Loading...

Can static type systems speed up programming? An experimental evaluation of static and dynamic type systems

Master's Thesis 2011 114 Pages

Computer Science - Software

Excerpt

Table of Contents

Abstract

Zusammenfassung (German Abstract)

Directory of Figures

Directory of Tables

Directory of Listings

1. Introduction

2. Motivation & Background
2.1 Motivation
2.2 Maintenance and Debugging
2.2.1 Maintenance in a Nutshell
2.2.2 Debugging in a Nutshell
2.3 Documentation and APIs
2.3.1 Documentation of Software Systems
2.3.2 APIs and Application of their Design Principles in General Programming
2.4 Type Systems
2.5 Empirical Research in Software Engineering
2.5.1 On Empirical Research
2.5.2 Controlled Experiments
2.5.3 Current State of Empirical Research in Software Engineering

3. Related Work
3.1 Gannon (1977)
3.2 Prechelt and Tichy (1998)
3.3 Daly, Sazawal and Foster (2009)
3.4 Hanenberg (2010)
3.5 Steinberg, Mayer, Stuchlik and Hanenberg - A running Experiment series
3.5.1 Steinberg (2011)
3.5.2 Mayer (2011)
3.5.3 Stuchlik and Hanenberg (2011)

4. The Experiment
4.1 The Research Question
4.2 Experiment Overview
4.2.1 Initial Considerations
4.2.2 Further Considerations: Studies on Using Students as Subjects
4.2.3 Design of the Experiment
4.3 Questionnaire
4.4 Hard- and Software Environment
4.4.1 Environment
4.4.2 Programming Languages
4.5 Workspace Applications and Tasks
4.5.1 The Java Application - A Labyrinth Game
4.5.2 The Groovy Application - A simple Mail Viewer
4.5.3 Important Changes made to both Parts
4.5.4 The Tasks
4.6 Experiment Implementation

5. Threats to Validity
5.1 Internal Validity
5.2 External Validity

6. Analysis and Results
6.1 General Descriptive Statistics
6.2 Statistical Tests and Analysis
6.2.1 Within-Subject Analysis on the complete data
6.2.2 Analysis for residual effects between the two Participant Groups
6.2.3 Within-Subject Analysis on the two Participant Groups
6.2.4 Exploratory Analysis of the Results based on Participants’ Performance
6.2.5 Hypotheses and Task based Analysis

7. Summary and Discussion
7.1 Final Remarks
7.2 Result Summary
7.3 Discussion

8. Conclusion

References

A. Appendix
A.1 Statistical Methods and Tests
A.1.1. Box plots (box-whisker-diagrams)
A.1.2. Kolmogorov-Smirnov and Shapiro-Wilk
A.1.3. Independent and Dependent t-test
A.1.4. Wilcoxon Signed Rank Test
A.1.5. Mann-Whitney-U and Kolmogorov-Smirnov Z test
A.1.6. Regression Analysis
A.2 Supplemental Data
A.2.1. Participant Results Tasks 1 to 9 (Java)
A.2.2. Participant Results Tasks 1 to 9 (Groovy)
A.2.3. Results of the Tests for Normal Distribution on the results split by the two Groups
A.2.4. Results of Tests for Normal Distribution for the Participant Performance Analyses
A.2.5. Participant Performance Analysis based on the complete data
A.2.5.1. Outperformers
A.2.5.2. Underperformers
A.2.6. Demographic of participants and Questionnaire Results
A.3 An Example of a problematic Experiment Design and Analysis

Abstract

Type systems of programming languages are a much discussed topic of software engineering. There are many voices arguing towards static as well as dynamic type systems, although their actual impact on software development is rarely evaluated using rigorous scientific methods. In the context of this work, a controlled experiment with 36 participants was conducted which tried to compare the performance of software developers using a static and a dynamic type system for the same tasks using an undocumented API. The two programming languages used were Java and Groovy. The experiment and its results are analyzed and discussed in this thesis. Its main hypothesis was that a static type system speeds up the time developers need to solve programming tasks in an undocumented API. The main results of the experiment speak strongly in favor of this hypothesis, because the static type system seems to have a significantly positive impact on the development time.

Zusammenfassung (German Abstract)

Typsysteme von Programmiersprachen sind ein vieldiskutiertes Thema in der Softwaretechnik. Es gibt sowohl für statische als auch dynamische Typsysteme große Gruppen von Befürwortern, obwohl der tatsächliche Einfluss beider auf die Softwareentwicklung selten mithilfe strenger wissenschaftlicher Methoden ausgewertet wurde. Im Kontext dieser Arbeit wurde ein kontrolliertes Experiment mit 36 Teilnehmern durchgeführt, um die Performanz von Softwareentwicklern mit einem statischen und einem dynamischen Typsystem anhand gleicher Aufgaben in einer undokumentierten Anwendung zu vergleichen. Die hierfür genutzten Programmiersprachen waren Java und Groovy. Das Experiment und die Ergebnisse werden in dieser Arbeit analysiert und diskutiert. Die Haupthypothese des Experiments besagt dass ein statisches Typsystem die Zeit verkürzt die ein Entwickler benötigt um Programmieraufgaben in einer undokumentierten Umgebung zu lösen. Die Ergebnisse sprechen stark für diese Hypothese, da das statische Typsystem tatsächlich einen signifikanten positiven Einfluss auf die Entwicklungszeit zu haben scheint.

Directory of Figures

Figure 4-1: Assumed occurrence of learning effect in experiment design (Figure taken from [Stuchlik and Hanenberg 2011])

Figure 4-2: The labyrinth game interface along with some annotations for the participants

Figure 4-3: The mail viewer interface along with some annotations for the participants

Figure 4-4: Simplified call stack containing bug creation and runtime error

Figure 4-5: Simplified call stack containing bug creation and runtime error for task 16

Figure 4-6: Simplified Call Stack containg bug creation and runtime error for Task 18

Figure 6-1: Boxplot of complete experiment results

Figure 6-2: Boxplot of results for the Groovy starter group

Figure 6-3: Sample histogram for a positively skewed frequency distribution of task results

Figure 6-4: Boxplot of results for the Java starter group

Figure 6-5: Scatterplot of the results for the type identification tasks in Groovy

Figure 6-6: Scatterplot of the results for the type identification tasks in Java

Figure 6-7: Boxplot of the Groovy part results for tasks 7 and 9

Figure 6-8: Boxplot of results for task 4 and 5 results of only the first language used

Figure A-1: Example of a box plot

Directory of Tables

Table 4-1: Experiment Blocking Design

Table 4-2: Summary of the independent variables, their values, corresponding tasks and dependent variables

Table 6-1: Descriptive statistics data of Groovy tasks for all participants (in seconds)

Table 6-2: Descriptive statistics data of Java tasks for all participants (in seconds)

Table 6-3: Descriptive statistics data of total experiment time for all participants (in seconds)

Table 6-4: Results of the tests for normal-distribution for the complete data, comparing task differences based on the language used

Table 6-5: Results of the t-test and Wilcoxon Signed Rank tests on the complete data, comparing tasks based on the language used

Table 6-6: Results of the Mann-Whitney-U and Kolmogorov-Smirnov-Z test when comparing Java task results between Groups (GS=GroovyStarters)

Table 6-7: Results of the Mann-Whitney-U and Kolmogorov-Smirnov-Z test when comparing Groovy task results between Groups (JS = JavaStarters)

Table 6-8: Descriptive statistics of Groovy tasks for participants that started with Groovy (in seconds)

Table 6-9: Descriptive statistics of Java tasks for participants that started with Groovy (in seconds)

Table 6-10: Descriptive statistics of total time for participants that started with Groovy (in seconds)

Table 6-11: Results of the tests for normal-distribution for the Groovy starter group, comparing task time differences based on the language used

Table 6-12: Results of the t-test and Wilcoxon-test for the Groovy starter group

Table 6-13: Descriptive statistics of Groovy tasks for participants that started with Java (in seconds)

Table 6-14: Descriptive statistics of Java tasks for participants that started with Java (in seconds)

Table 6-15: Descriptive statistics of total time for participants that started with Java (in seconds)

Table 6-16: Results of the tests for normal-distribution for the Java starter group, comparing task time differences based on the language used

Table 6-17: Results of the t-test and Wilcoxon-test for the Java starter group

Table 6-18: Descriptive statistics of Groovy tasks for outperformer participants that started with Groovy (in seconds)

Table 6-19: Descriptive statistics of Java tasks for outperformer participants that started with Groovy (in seconds)

Table 6-20: Descriptive statistics of total time for outperformer participants that started with Groovy (in seconds)

Table 6-21: Results of the t-test and Wilcoxon test for the Groovy starter outperformer group

Table 6-22: Descriptive statistics of Groovy tasks for underperformer participants that started with Groovy (in seconds)

Table 6-23: Descriptive statistics of Java tasks for underperformer participants that started with Groovy (in seconds)

Table 6-24: Descriptive statistics of total time for underperformer participants that started with Groovy (in seconds)

Table 6-25: Results of the t-test and Wilcoxon test for the Groovy starter underperformer group

Table 6-26: Descriptive statistics of Groovy tasks for outperformer participants that started with Java (in seconds)

Table 6-27: Descriptive statistics of Java tasks for outperformer participants that started with Java (in seconds)

Table 6-28: Descriptive statistics of total time for outperformer participants that started with Java (in seconds)

Table 6-29: Results of the t-test and Wilcoxon test for the Java starter outperformer group

Table 6-30: Descriptive statistics of Groovy tasks for underperformer participants that started with Java (in seconds)

Table 6-31: Descriptive statistics of Java tasks for underperformer participants that started with Java (in seconds)

Table 6-32: Descriptive statistics of total time for underperformer participants that started with Java (in seconds)

Table 6-33: Results of the t-test and Wilcoxon test for the Java starter underperformer group

Table 6-34: Results of regression analysis for task time depending on number of types to identify and language

Table 6-35: Coefficients of regression analysis for type identification task time depending on number of types to identify and language

Table 6-36: Results of t-test and Wilcoxon test for comparing tasks 7 and 9 for Groovy

Table 6-37: Results of Mann-Whitney-U and Kolmogorov-Smirnov-Z test for first tasks 4 and 5 results

Table A-1: Participants raw results for the Java part in seconds for Groovy Starters

Table A-2: Participants raw results for the Java part in seconds for Java Starters

Table A-3: Participants raw results for the Groovy part in seconds for Groovy Starters

Table A-4: Participants raw results for the Groovy part in seconds for Java Starters

Table A-5: Results of tests for normal distribution on task results split by group (GS = GroovyStarters, JS=JavaStarters)

Table A-6: Tests for Normal Distribution on Differences between Groovy and Java times for Groovy Starter Outperformers

Table A-7: Tests for Normal Distribution on Differences between Groovy and Java times for Groovy Starter Underperformers

Table A-8: Tests for Normal Distribution on Differences between Groovy and Java times for Java Starter Outperformers

Table A-9: Tests for Normal Distribution on Differences between Groovy and Java times for Java Starter Underperformers

Table A-10: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-11: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-12: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-13: Tests for normal distribution on differences between groovy and java times for complete experiment outperformers

Table A-14: Results of t-test and Wilcoxon test for outperformer students of whole experiment

Table A-15: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-16: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-17: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-18: Tests for normal distribution on differences between groovy and java times for complete experiment underperformers

Table A-19: Results of t-test and Wilcoxon test for underperformer students of whole experiment

Table A-20: Questionnaire results for programming skill questions

Table A-21: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-22: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-23: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-24: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-25: Results of Mann-Whitney-U (MW-U) and Kolmogorov-Smirnov-Z (KS-Z) test for independent design

Directory of Listings

Listing 2-1: Examples for variable declarations in a statically typed language

Listing 2-2: Examples for variable declarations in a dynamically typed language

Listing 2-3: Redundancy of information through static type system

Listing 4-1: Simple Java Code Example

Listing 4-2: Simple Groovy Code Example

Listing 4-3: Example code to explain stack and branch size

Listing 4-4: Solution to task 1 (Java)

Listing 4-5: Solution to task 10 (Groovy)

Listing 4-6: Solution to task 2 (Java)

Listing 4-7: Solution to task 11 (Groovy)

Listing 4-8: Solution to task 3 (Java)

Listing 4-9: Solution to task 12 (Groovy)

Listing 4-10: The part of task 4 with the error that leads to wrong behavior

Listing 4-11: Solution to task 4 (Java)

Listing 4-12: Solution to task 13 (Groovy)

Listing 4-13: Part of code from task 14 with missing line to remove reference

Listing 4-14: Solution to task 14 (Groovy)

Listing 4-15: Solution to task 5 (Java)

Listing 4-16: Solution to task 6 (Java)

Listing 4-17: Solution to task 15 (Groovy)

Listing 4-18: The simulated interaction of task 16 (Groovy)

Listing 4-19: Point of bug insertion for task 16

Listing 4-20: Point where bug results in runtime error for task 16

Listing 4-21: Solution to task 8 (Java)

Listing 4-22: Solution to task 17 (Groovy)

Listing 4-23: Simulated interaction for task 18 (Groovy)

Listing 4-24: Point of bug insertion for Task 18

Listing 4-25: Point where bug results in runtime Error for Task 18

1. Introduction

Software development is generally a complex process that is up to today almost impossible to predict. This is true even if the focus is on the pure programming part of software development and other associated tasks like requirements gathering and specification are ignored. Commonly, software developers need to fix errors or extend an existing program. Both tasks are often summarized as software maintenance and it is stated that software maintenance makes up a significant part of software project costs ([Boehm 1976], [Lientz et al. 1978] and [Gould 1975]). In addition, different programming languages are in use, which in most cases have either a static or a dynamic type system. These two type systems are the source of a controversial discussion in the scientific world as well as in the software industry. Some argue very strongly towards static type systems and the advantages they are supposed to bring, while others oppose these ideas with their own arguments on how these type systems supposedly restrict and complicate the use of programming languages. It is mostly a battle of beliefs, as both sides’ arguments are purely speculative and based on logical reasoning. While static type systems have experienced a popularity boost during the last years and are wide spread over many popular programming languages (like Java, C#, C++), there is very little scientific evidence of either their positive or negative impact on software development. This work aims toward closing this knowledge gap a little by conducting a controlled experiment that compares the performance of developers with a static and a dynamic type system on the same tasks. It is not the first experiment conducted on the impact of type systems (others are [Gannon 1977], [Prechelt and Tichy 1998], [Daly et al. 2009], [Hanenberg 2010b] and [Stuchlik and Hanenberg 2011]), but research on the topic is still scarce as the next chapters will show.

The main focus of the experiment is the impact of a static type system on development time, comparing the time needed to solve tasks in Java (which has static type system) and Groovy (using a dynamic type system). For this, several different tasks were designed and all participants of the experiment had to solve all tasks in both programming languages. Three hypotheses were the base of the experiment design. The assumptions were that first, tasks where different classes need to be identified are solved faster in a statically typed language. This was also tested with different numbers of classes to identify. Second, for semantic errors, the kind of type system should not make a difference on the completion time. Third, when using a dynamically typed language, it is assumed that it takes a longer time to fix errors if runtime error occurrence and bug insertion (meaning the location in the code that is faulty and which later results in the runtime error) are farther removed from each other. All this is based on programs that are undocumented, meaning that there are neither comments or documentation, nor variables that explicitly point toward their contained types (more about this notion of no documentation later).

Chapter 2 explains the motivation for this thesis and the general scientific history of empirical research in software engineering (or lack thereof) and also gives some background information on some important topics associated with this work’s context. These topics are type systems, maintenance, debugging, as well as documentation in software engineering. The chapter closes with a summary of empirical research and methods with a focus on controlled experiments. Afterwards, chapter 3 summarizes and discusses related work that has previously been conducted on the topic of static and dynamic type systems. In chapter 4, the experiment’s research question and design is explained in detail. Afterwards, chapter 5 sums up and discusses some threats to validity of the experiment results. The sixth chapter contains the complete analysis of the gathered experiment data. Then, in the seventh chapter, the experiment results are summarized and discussed. Finally, chapter 8 concludes this work.

2. Motivation & Background

In this chapter, first the motivation for this work is represented, along with some information about certain topics that are directly related to the subject of research. These topics are maintenance and debugging, documentation and APIs, as well as type systems. A summary of empirical research and controlled experiments and the current state of the art in software engineering closes the chapter.

2.1 Motivation

The main motivation of this work is to find out whether static type systems improve developer performance when doing software maintenance. Type systems are said to have some inherent qualities that supposedly support a developer when writing and maintaining code which could lead to faster development. As will be seen later, work on this specific topic is scarce and thus this work was meant to provide more insight on the topic. Also, a part of the motivation is the possible impact of APIs on software maintenance in this context.

2.2 Maintenance and Debugging

2.2.1 Maintenance in a Nutshell

Almost anyone does probably have a very intuitive understanding of what maintenance means. Spoken plainly, to maintain something is to keep it from breaking down or stop doing what it was meant to do. For software maintenance, the IEEE gives a short and understandable definition of the term: “Modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a modified environment”, [IEEE 1998]. Not only does it include the purpose of fixing or preventing faults, but also improvements. Interestingly, in a newer version of the standard released together with the ISO, the newly definition states that software maintenance is “the totality of activities required to provide cost-effective support to a software system […]”, [ISO/IEC/IEEE 2006]. A definition that is fuzzy compared to the former, though older one. It is still more appropriate and should serve as the base definition for maintenance in this work. In the newer standard, they also define four corresponding maintenance types: Corrective, Preventive, Adaptive and Perfective Maintenance. Corrective and adaptive maintenance are in the focus of this work.

Something that is related to maintenance but infinitely harder to define is the notion of maintainability. Citing the 2006 standard again (the 98 did not yet define the term), it says that maintainability represents “the capability of the software product to be modified […]”, [ISO/IEC/IEEE 2006]. This definition (along with many similar definitions for maintainability from software engineering literature which will not be mentioned here) is very unclear. Worse, there is still no objective measure for maintainability yet. Many attempts were made to either measure it on a more quantitative base using software metrics (e.g. measuring factors that influence maintainability and infer a measure of maintainability this way) and modeling it on a qualitative base. Both approaches have yet to yield any commonly accepted measure of maintainability [Broy et al. 2006].

Software maintenance is a huge cost factor. Boehm claimed that cost of software maintenance is made up of more than 50% of the total project cost of software projects [Boehm 1976]. There a few more studies that give rough estimates, which range from 40% up to 75% of total project cost (study results summarized by [Lientz et al. 1978]). Although quite a few years have passed, other studies do not give the impression that this has changed. In an older study, Gould cites that about 25% of the time is spent on maintenance/error correction [Gould 1975]. While no current figures could be found that support all these percentages, it can be assumed (and personal experience confirms this) that maintenance still takes up a bulk of a software system’s lifetime and cost.

2.2.2 Debugging in a Nutshell

It was made clear that maintenance of software takes up huge amounts of time and money. An important part of the maintenance process is the fixing of errors (corrective maintenance) which are commonly called bugs. Consequently, removing an error from a program is called debugging. There is actually some history on the origin of the term “bug” for an error in a computer program, but it is controversial and shall not play any further role here.[1]

There are possible classifications of bugs as well as some research works on debugging approaches and modeling them ([Katz and Anderson 1987], [Vessey 1986] and [Ducassé and Emde 1988] for example). According to them all, the debugging process always involves the following tasks (not necessarily in this exact granularity or order): Reading and understanding program code to locate the possible error source, possibly introduce test outputs, gain enough understanding of the program to find a solution for the problem and test the solution.[2] These tasks become significantly harder when the program to fix was written by someone else or if a long time has passed since someone took a look into the code he has written.

A bug can also be the deviation of a program from its specification, which does not necessarily always lead to an error. So it also important that a programmer knows what the actual intention of the program is to fix this type of bug. Ducasse classifies some of the knowledge that helps in the debugging process, among it the knowledge of the intended program, the knowledge of the actual program, an understanding of the programming language, general programming expertise, knowledge of the application domain, knowledge of bugs and knowledge of debugging methods [Ducassé and Emde 1988].

2.3 Documentation and APIs

2.3.1 Documentation of Software Systems

Before being able to change a program, one must first understand it or at least the part needing change. In other words, gain knowledge about it. Schneidewind says that “it is a major detective operation to find out how the program works, and each attempt to change it sets off mysterious bugs form the tangled undergrowth of unstructed code”, [Schneidewind 1987]. So because a good understanding of a program is extremely important for maintenance and debugging tasks, it seems wise to take a look at the documentation aspect of software systems.

Understanding a program or relevant parts of it before being able to make changes takes a significant amount of the total software maintenance time, especially if the program was written by someone else. [Standish 1984] claims it to be 50-90% of total maintenance cost, although a more recent study by Tjortjis estimates about 30% of maintenance time was devoted to program comprehension [Tjortjis and Layzell 2001][3]. There are different types of documentation artifacts, like class diagrams, flow charts, data dictionaries, glossaries, requirements documents, and more. But in many cases, the only documentation available is the program code and possibly contained comments.

So, whenever programmers need to change an existing program, “the automated extraction of design documentation from the source code of a legacy system is often the only reliable description of what the software system is doing”, [Buss and Henshaw 1992]. This is true not only for the automated extraction. Often programmers have to manually read or at least quickly scan huge parts of the code to understand what is going on. Sousa and Moreira conducted a field study and concluded that among the three biggest problems related to the software maintenance process is the lack of documentation of applications [Sousa 1998], leading to the necessity of reading the code. Similar results concerning the primary usage of code as documentation have been reported by [Singer et al. 1997], [de Souza et al. 2005] and [Das et al. 2007].

It should be mentioned that the decision to use source code as a base for program understanding was not always the first choice, but sometimes rather a last resort due to lack of other documentation artifacts. On the other hand, the study by [de Souza et al. 2005] states that source code was always the most used artifact, no matter if other documentation artifacts existed. This contradicts the notion that code is a last resort documentation. Furthermore, there are many reasons for other types of documentation to be either completely absent or outdated in comparison to the current code base. For example, other artifacts are seldom updated along with maintenance in the code to reflect these changes (even comments or annotations in the code tend to “age”), sometimes due to lack of time, budget or motivation.

All these problems are far beyond the scope of this work, which from a documentation view focuses exclusively on the documentation value of the source code. First, source code will always be present even when all other artifacts are missing or outdated, and second it can be assumed that building a cognitive model from code is easier for a programmer. After all, it is what all programmers are used to see regularly. As an interesting side note with practitioner’s view on the pros and cons of documenting software, there is a growing community of followers of the so called “Clean Code” initiative. The “Clean Code” approach was originally spawned by Robert C. Martin in his book [Martin 2009]. Basically, he proposes that good source code is meant to document and explain itself, without the need for many comments. Source code should be written in a way that a reader can read it literally like a book, with speaking method and variable names that clearly state their purpose and function.

2.3.2 APIs and Application of their Design Principles in General Programming

Assuming that a programmer only has the available source code as documentation, he needs to make do with what he can get. Depending on the circumstances, there is a difference whether he reuses components from the outside or simply jumps into an existing program and makes changes there. Commonly, if code has been written to be used in different contexts and is in itself a finished component, the part of the code that can be used from the outside is commonly known as the API (Application Programming Interface). An API generally consists of different classes with different methods, along with possible documentation. These need to be public so that they can be used from the outside of the component. Put another way, they are the adapters which should be used to plug the component into another program. The component’s internals are usually hidden behind the API classes and methods.

There are many reusable APIs available; Java for example is supplied with its own set of class libraries that provide functionality from printing to a console window to reading and writing files to disk. Most current programs are written with a smaller or larger amount of API usage in them. In this work, programmers did not have to use any specific third-party API, but only the classes given to them, to whom they had full access. This is generally not considered as API usage, although the principles of good API design still apply here. An important rule is given by Henning in [Henning 2007]: “APIs should be designed from the perspective of the caller”. He also gives a very important statement which neatly summarizes what APIs are all about: “Even though we tend to think of APIs as machine interfaces, they are not: they are human-machine interfaces”, [Henning 2007]. Disregarding for a moment the fact that having complete access to a program’s code is not exactly what one would call API use, the same principles should nevertheless apply to all parts of program code in general. It is reasonable to assume that a programmer who has to fix a bug in existing code is often still at a loss when trying to understand many functions just by looking at the interface of a class.

2.4 Type Systems

Broken down, the essence of a type system is that it constraints the use of variables and other statements in a program by enforcing them to adhere to a certain type (like containing a text, or a number, but not both). Cardelli and Wegner use an interesting metaphor to describe the fundamental purpose of a type system: “A type may be viewed as a set of clothes (or a suit of armor) that protects an underlying untyped representation from arbitrary or unintended use.” [Cardelli and Wegner 1985].

Two types of type systems are common, the static and the dynamic type system. The difference between a dynamic and static type system is the time at which the type of an object is actually checked. A programming language implementing a static type system (like Java, C++ or C# to name just a few popular ones) usually has a type checker that can tell the developer if there are any errors in the program based on the static information in the written code. The code does not need to be executed to find these errors. For example if the programmer tries to put an object of type Ship into a variable of type Car the compiler can detect this error and tell the programmer. Usually this means a program cannot be run until the compiler can detect no more type errors based on static information. A dynamic type system usually does this check only at runtime. This means the program will run, but as soon as the program tries to tell the Car object to “SetSail” (a method only a ship would have), the program terminates with a runtime type error.

So the main difference of the two type systems is the time at which a type is checked for its constraints. Sometimes dynamically typed languages are mistaken for untyped/typeless languages, which is wrong. Dynamically typed languages use types, although do not enforce them statically and perform the type check during runtime.

illustration not visible in this excerpt

Listing 2-1: Examples for variable declarations in a statically typed language

illustration not visible in this excerpt

Listing 2-2: Examples for variable declarations in a dynamically typed language

The first of the above code snippets shows some variable declarations and values/objects put into them for a statically typed language. The number variable is declared as type int, which means it can store whole numbers. The last line would result in an error during compilation, because the type checker can see that the programmer is trying to put a text into a variable of type int and tell him. It thus prevents a possible error during program execution. In the second snippet, the information of what types the variables are of is omitted. It is very possible to put a number into a variable, and immediately afterwards replace its contents with a text. This could lead to a runtime error (if the programmer tries to multiply two variables that contain texts by mistake).

Both systems have their intrinsic advantages and disadvantages, of which some should be discussed here from the viewpoint of static type systems.

Advantages of a static type system (taken from [Cardelli 1997] and [Pierce 2002])

- A static type system prevents the programmer from making mundane type related errors through disciplining him because of the type enforcement. (Cardelli page 6 and Pierce pages 4-5)
- Because of the static type information available they can detect a lot of type related errors (calling a method on a wrong type) during compilation and thus reduce the amount of runtime type errors (Cardelli page 6 and Pierce pages 4-5)
- As another result of the reduced type errors, they also minimize security risks, e.g. by preventing harmful type conversions (Cardelli page 6 and Pierce pages 6-8)
- A static type system can provide the reader of code with an implicit documentation. Because a static type system enforces type declarations for variables, method parameters and return types, it implicitly increases the documentation factor by making the code speak for itself. (Pierce page 5)
- A type system may enable certain forms of optimization by the compiler or the runtime environment because type casts and runtime checks are made obsolete in certain situations. It can thus make the language more efficient (Pierce page 8)

Disadvantages of a static type system (all taken from [Tratt 2009][4] pages 7-10)

- Restrictions on the range of possible applications. Because a type system can be overly restrictive and force the programmer to sometimes work around the type system.
- Limitations on the degree of possible modification during runtime. Statically typed programming languages can only rely on heavily complex reflection operations to be able to be changed during runtime.
- They can get in the way of simple changes or additions to the program which would be easily implemented in a dynamic type system but make it difficult in the static type systems because of dependencies that always have to be type correct.

In addition to the disadvantages taken from Tratt, there is also the general notion of documentation redundancy introduced through the use of static type systems, leading to very verbose code. Simply consider the following example to see the point:

illustration not visible in this excerpt

Listing 2-3: Redundancy of information through static type system

In the end, both sides have good arguments that are logically coherent. Nonetheless, it remains a conflict of ideologies as long as no reliable results from multiple studies or other scientific methods are available. This experiment focuses mainly on the documentation and error prevention aspect of type systems when trying to shed some light on their advantages and disadvantages.

2.5 Empirical Research in Software Engineering

2.5.1 On Empirical Research

By “research”, this work refers to the rigorous scientific research methods that are usually applied in the natural and some of the social sciences and have matured over hundreds of years. Normally, this type of research is driven by the desire of the researcher to answer a question, possibly after observing some condition in reality and then formulating a theory (an explanation) about the nature of this condition. This can include how or why it might have occurred or what it is made of, among other things. Anything occurring in nature or anywhere else can be the subject of such a theory.

Next, a hypothesis is formulated based on the theory which predicts the condition or aspects of it. This hypothesis can be rejected or hardened (it can never be proven) by collecting data relevant for this hypothesis and analyzing it. Do the data confirm the hypothesis, the theory seems more sound, if they contradict the hypothesis, then the hypothesis is considered falsified and has to be rejected (although a new theory based on changed assumptions can be created afterwards). Data collection is commonly done using experiments or other methods of empirical research. Although this approach dates back to Sir Francis Bacon[5] and others that have modified the ideas over time, Karl Popper [Popper 2008], a famous scientific philosopher, is quoted most frequently in this context. Popper believed that experimental observation is the key to scientific discovery and proposed that hypotheses should be falsifiable.

The above described method of scientific research that is concerned with creating theories and hardening or falsifying those using experiments or observations is nowadays commonly called empirical research. The term empiricism is derived from a Greek word for “based on experience”. Empirical research uses different methods which also differ between the sciences that apply them, e.g. the natural sciences, psychology, social science and medicine. Each uses different methods and approaches that have proven to produce reliable and valid results in specific areas. The notion of validity means that something really measured what it was made to measure, and reliability means that it measures this something consistently across different conditions (e.g. when taking the measure repeatedly while assuming all other variables are similar, the measured result should be similar for both measures).

2.5.2 Controlled Experiments

Although there are whole books dedicated to the art of creating and designing experiments and studies, only a short summary of controlled experiments is presented here. More information can be taken from [Prechelt 2001], which is a good introduction into experimentation for the software engineering discipline.[6] The following part explains controlled experiments. A summary of more methods can also be found in this authors own Bachelor thesis [Kleinschmager 2009].

The method of research used in this work is a controlled experiment. These experiments try to rigorously control the experiment conditions by keeping as many factors constant as is possibly, while deliberately manipulating only one or a few experiment variables. Conducting a controlled experiment can be very cumbersome and difficult, because planning and implementation take a lot of time and consideration as well as attention to detail.

There are three types of variables in controlled experiments (not to be confused with programming variables): Independent, dependent and the sum of all other variables, sometimes also called “noise” or unsystematic variation. The first type, the independent variable, is manipulated by the experimenter and then the impact of the manipulation is measured using the dependent variables (which are called dependent because they depend on the independent variable). Variation resulting from manipulation of the independent variables is called systematic variation. An example would be the time an athlete needs until his heartbeat reaches a certain limit (the dependent variable) depending on what kind of exercise he has to do (the independent variable).

Controlled experiments with humans are especially tricky to design because the human factor introduces an infinite amount of unsystematic variation (for example, the athlete might have slept badly the night before he did the first exercise, but had a very refreshing sleep during the night before the second exercise). Unsystematic variation or “noise” can seriously harm the usefulness of any results, and therefore many measures need to be taken in order to reduce them as much as possible. As will be explained later in more detail, one of these measures is to design experiments where all participants partake in all conditions and also in a random or balanced order. This makes it easier to calculate the systematic variation using statistical methods.

All in all, controlled experiments have a high validity and can easily be reproduced many times (when following the exact setup and implementation), producing reliable and comparable results. Their biggest disadvantage is their large cost in time and work for preparation, buildup and evaluation.

2.5.3 Current State of Empirical Research in Software Engineering

Software systems today are getting more and more complex, and more and more expensive to develop. Their impact on everyday life is immense. Cars, planes, medical equipment, computers for financial transactions and almost infinitely more examples of machines or devices depend on software.

Under normal circumstances, one might think that the creation and maintenance of software would be a well-researched and perfected field of work. But in software engineering –sadly-, the situation is quite the contrary. There is a huge deficit of research based on experimenting and hypotheses that can be falsified. In most parts, the current state-of-the-art in the software sciences lacks scientific method, which is insufficient and inadequate for a field that claims to do science. In 1976, Boehm tried the definition of software engineering as: “The practical application of scientific knowledge in the design and construction of computer programs and the associated documentation required to develop, operate, and maintain them”, [Boehm 1976]. Intuitively, this definition still fits quite well today, although Boehm leaves open what exactly he means by scientific knowledge. In 1991, Basili and Selby already wrote that the “immaturity of the field is reflected by the fact that most of its technologies have not yet been analyzed to determine their effects on quality and productivity. Moreover, when these analyses have occurred the resulting guidance is not quantitative but only ethereal.” [Basili and Selby 1991].

In the days of 1991, software engineering was still young and in its early stages and evolution, so it might be alright to say that such a young field had yet to find its scientific base. But, some years later, apparently not much had changed. Lukowicz, et al. did a study on research articles [Lukowicz et al. 1994] and found out that from over 400 articles only a fraction included experimental validation (and there is no mentioning of the quality of the experimental setup and analysis in the other articles). In 1996, Basili again criticized the lack of experimentation and learning, although a few studies had already been conducted. He says that there should be a “cycle of model building, experimentation and learning. We cannot rely solely on observation followed by logical thought” [Basili 1996].

The current state is that observation is still mainly followed by logical thought and arguments towards generalization, maybe model building. In some cases, even a field study is conducted, which is commendable, but hardly sufficient for science.

There are many arguments and excuses to not do experiments in software sciences. In 1997, Tichy summarized some of them [Tichy 1997]. Snelting [Snelting 1998] even accuses software scientists of applying constructivism and pleads for a more rigorous methodological research approach in software engineering, but also sees some improvement some years later [Snelting 2001]. So it seems obvious that there is a lot of arguing about which direction the software sciences should go. Both sides do have reasonable arguments (some good examples are [Denning 2005] and [Génova 2010]).

All this does not mean that experiments are the holy grail of science, but the status quo should be that no good model or theory can hold or be generalizable without sound experimental validation. Free speculation and experimentation should work in hand in hand, as is demanded in [Génova 2010], even if he argues strongly towards a more speculative approach. Especially the human factor in software engineering has been neglected for many years, as Hanenberg rightfully criticizes [Hanenberg 2010c]. It is great if people come up with new techniques, models and approaches, as long as they are usable and systematic studies with humans show that it helps them develop better software. Because it is the humans, in this case especially the developers, who have to put all things together and actually implement the software, no matter how great the tool support and theoretical foundation is. The current state of events is that techniques, models and approaches develop (and often vanish) much quicker than they can be validated in scientific experiments.

3. Related Work

This chapter summarizes the results from the few preceding studies available on static type systems. All of the here presented experiments are those that use humans as subjects of the experiment and are specifically targeted on comparing static and dynamic type systems. There are other works which for example focus on using a type checker on a program previously written with a dynamically typed language to check whether these programs contain possible errors [Furr et al. 2009]. Some analyze a large base of open source projects and try to measure programmer productivity based on the language used, like [Delorey et al. 2007]. Others compare the general performance of developers for a task using very different programming languages and/or the runtime performance of the final program [Hudak and Jones 1994], [Gat 2000] and [Prechelt 2000]. But as this work was targeted on evaluating the impact on the performance of developers, only papers that implement a comparable design with humans as participants are mentioned.

3.1 Gannon (1977)

The oldest study that could be found which directly experimented on the impact of type systems was conducted in 1977 by Gannon [Gannon 1977]. It used a simple repeated measures cross-over design with 38 graduate and undergraduate students who had to program the solution to a problem twice; once with a statically typed language and the second time with a dynamically typed language. Both languages were designed specifically for the experiment. The participants were split into two groups, each group starting with a different language and then using the other the second time. The hypothesis was that the reliability of software was enhanced by a language if the errors made by the participants in that language where less numerous than those of the second language. He did not give any clear indication of favoring any of the two languages in the hypothesis.

The methodology used has the advantage of the repeated-measures design by giving the possibility of a within-subject analysis. Although there certainly would have been a carry-over effect because it was the same task that had to be solved both times, so the participants might have already had a “solution roadmap” in their minds after finishing with the first language. His measure against this was to rely on the features of the two languages that were explicitly altered: The dynamically typed language did not include any built-in string functionality, which the participants had to build themselves. It can be argued that this considerably threatens the experiments internal validity because it is not clear if the results measure the impact of the type system or the impact of missing string operations for the dynamically typed language.

Gannon stated that the results were that the statically typed language increased programming reliability and that inexperienced students benefit more from it. They are based primarily on measuring the number of runs in the environment and the number of error occurrences. Only some of the results are statistically significant. In general, the results are questionable, because of the fact that the participants had quite a few problems with the missing string functionality in the dynamically typed language.

3.2 Prechelt and Tichy (1998)

In 1998, Prechelt and Tichy conducted a similar experiment using two different variations of the C programming language [Prechelt and Tichy 1998], one employing static type checking and one employing dynamic type checking. 40 participants (most of them PhD computer science students) were first split into main groups, one working with the type checker, one without. They then employed a slightly more complicated design, where all participants had to solve both parts of the experiment, although each with a different language. Subgroups were assigned which differed in both the order of the tasks (A then B or vice versa) and also the order of the language to use (first with type checker, then without). So in the end, all participants had been split among four roughly equal sized groups.

The tasks were supposed to be short and modestly complex. Their hypotheses were that type checking increases interface use productivity, reduces the number of defects and also reduces the defect lifetime. They made sure that the programmers were familiar with the language, so that the majority of problems would result from using the rather complex library they were given to solve the tasks. Dependent variables were the number of defects that were introduced, changed or removed with each program version and put them into different defect categories. In addition, the number of compilation cycles and time till delivery were measured.

The results strengthened all their hypotheses and they concluded that type checking increases productivity, reduces defects and also the time they stay in the program. This is based primarily on the number of defects in the delivered programs as well as the reduced defect lifetime that was achieved through the static type system. While the results seem sound from a methodical point of view, there are possible sources of strong unsystematic variation. They discuss some of them, including the learning effect they definitely measured. But (without having the exact task implementations to look at) it seems that -judging from their rough description of the two tasks- that the difference between them could have been a strong source of unsystematic variation. This is a fact they did not mention. Nevertheless, it is still one of the few methodologically sound experiments conducted on type systems and deserves approval, especially because it was the second experiment on the topic ever.

3.3 Daly, Sazawal and Foster (2009)

Many years later, in 2009, Daly, Sazawal and Foster [Daly et al. 2009] conducted a small study using the scripting language Ruby, which is dynamically typed, and Diamond Ruby, a static type system for Ruby. They had four participants and their design was also a repeated-measures cross-over design. All participants were said to be familiar with Ruby and were recruited from a user group of practitioners. In contrast to the other studies mentioned here, they only ran a qualitative analysis on the results without any statistical measures (which would not have yielded any useful results with only four participants anyway).

They could not find any specific advantage of the type system. Apart from some threats to validity that the authors’ already mentioned themselves, there again is the difference between the tasks that might have introduced some random effect, even if the authors claim that they were of approximately similar complexity (one was a simplified Sudoku solver and one a maze solver). What is more, the first participant was not given starter code and therefore only solved a fraction of the total tasks’ work. And he also did not have internet access, which the experimenters realized was a mistake and made it accessible for the next three; both very possible sources of unwanted effects.

Although the results of the study are more or less exploratory and only qualitative, the authors could induce an interesting hypothesis from their analysis: They reason that in small scale applications like the one from the experiment developers can compensate for the lack of a type system by relying on their own memory or by giving meaningful names to variables and other code artifacts. This is an interesting hypothesis which would be worth testing in a larger experiment.

3.4 Hanenberg (2010)

Another huge experiment was conducted in 2009 by Hanenberg [Hanenberg 2010b], [Hanenberg 2010a]. In his experiment, a total of 49 participants (undergraduate students) had to solve two tasks of writing a scanner and a parser in a language called Purity specifically written for the experiment in a statically and a dynamically typed version. The hypothesis was that the statically typed language would have a positive impact on the development time. He used an independent (between-subject) design where every participant only solved the two tasks once with either the statically or the dynamically typed language. The reasoning behind this design was that participants would probably be biased toward the same language with a different type system after having used it with another type system already.

Compared to the other previously mentioned experiments, where the total experiment time was about four hours per participant, here they had 27 hours of working time (about 45 hours when including teaching time). He also took a different approach by making the time a fixed factor, because the 27 hours were the maximum time given to the participants, in contrast to other experiments were participants usually had as many time as needed. It was also designed in a way that it was very hard to actually fulfill all requirements in the provided time frame.

The results were that the type systems never had a significantly positive impact on the result, in once case of the tests even produced a significantly negative impact, even though this did not lead to overall significantly negative results for the type system. Concerning threats to validity, Hanenberg discusses quite a few of them in detail. But it should be noted that a huge threat to the validity of the results that Hanenberg already mentioned himself is the amount of unsystematic variation that possibly lurks in the independent design of the experiment. Because no within-subject comparison is possible, any differences between the two type systems’ performance could also have been due to differences in participant quality and many other factors which could have interfered with measuring the intended effect.

3.5 Steinberg, Mayer, Stuchlik and Hanenberg - A running Experiment series

Next, it should be mentioned that a set of experiments was conducted at this institute and they form an experiment series that focuses on the comparison of static and dynamic type systems. Some of them are mentioned with their description and some preliminary results in a summarizing report [Hanenberg 2011]. This thesis can be considered a part of the series, too.

3.5.1 Steinberg (2011)

One of the still unpublished experiments (the Master thesis by Steinberg [Steinberg 2011]) investigated the impact of the type system on debugging for type errors and semantic errors with 30 participants in a repeated-measures cross-over design. It turned out that the type system speeded up the fixing of type errors, and no significant difference was discovered for fixing semantic errors. This could be considered a success, although some contradicting results were achieved that falsified one of the hypotheses which stated that the farther the code that is responsible for the error is away from the point where the error actually occurs, the greater the fixing time should be. The study might have suffered from a huge learning effect and other factors like a larger influence of the kinds of programming tasks on the unsystematic variation.

3.5.2 Mayer (2011)

The second still unpublished experiment (the Bachelor thesis by Mayer [Mayer 2011]) has the hypothesis that a static type system aids in the use of an undocumented API and shortens the time needed for a task. Again, a two group repeated measures design was used and 27 participants took part. The results were at the time of this work’s writing not finished, but initial results revealed rejection of the hypothesis for some tasks and confirmation for others. Again, the results seem inconclusive for the experiment.

3.5.3 Stuchlik and Hanenberg (2011)

Despite the fact that some work is still underway and unpublished, one of the earlier experiments of the series already spawned a separate publication [Stuchlik and Hanenberg 2011] and therefore deserves a deeper look: In the experiment, 21 participants (undergraduates) had to solve 7 Tasks in random order in Java as well as in Groovy. For each language, an own application was used to reduce the learning effect that decreased usefulness of the results in further experiments, even if the two application where structurally equal, only the methods and classes were renamed. One of the assumption was that the more type casts would be needed for a task with a static type system, the larger the difference would be in time taken between the static and the dynamic type system.

Two of the tasks had to be discarded for the analysis because of many comprehension problems with the task descriptions, resulting in a lot of variation. But the results show that there was a significant positive impact of the dynamic type system for some of the tasks. Their reasoning is that type casts are not a trivial aspect of static type systems and need some intellectual effort on part of the developer (even though the results of the better Java developers in the study did not show this effect). Additionally, the initial assumption that more type casts also lead to longer development time had to be rejected, leading the authors to reason that for larger tasks type casts do not play such a major role as they were assumed to.

One of the study’s problems was that the tasks were rather constructed tasks that forced the use of type casts were some developers would argue they would not have been needed in a real application. The combined analysis of both groups as one set is also rather problematic, but this fact was already mentioned by the authors and only makes up a small part of the analysis. Using the development time as the major dependent variable is also problematic from the view that software development is usually so much more than just trying to solve a programming task as quickly as possible, but that fact can be overlooked because it was designed as a controlled experiment where as many factors as possible need to be fixed. Also, there are no other objective measures that can be applied to software.

[...]


[1] Interested individuals might take a look at the article on Wikipedia about the term’s origin: http://en.wikipedia.org/wiki/Debugging

[2] Quite a few studies have been conducted to research debugging approaches, especially the differences between novice and expert programmers. A good starting point is the work of Murphy et al. Murphy et al. [2008] which also references the most important older studies on the topic.

[3] Buss and Henshaw [1992] cite another paper that claims that „some 30-35% of total life-cycle costs are consumed in trying to understand software after it has been delivered, to make changes“. Unfortunately, the original article Hall [1992] could not be retrieved for reviewing.

[4] Also see Lamport and Paulson [1999] or Bracha [2004] for more discussions

[5] His method was published in his book called “Novum Organum”. An English translation is available on http://www.constitution.org/bacon/nov_org.htm

[6] A much more thorough book on research methods and the corresponding evaluation is Bortz and Döring [2006], whose focus is on the social sciences though. Two other books are available that focus on experimentation in software engineering. These are Wohlin [2000] and Juristo and Moreno [2001].

Details

Pages
114
Year
2011
ISBN (eBook)
9783656256595
ISBN (Book)
9783656258698
File size
1.6 MB
Language
English
Catalog Number
v199362
Institution / College
University of Duisburg-Essen – Institute for Computer Science and Business Information Systems
Grade
1,0
Tags
Software Empirie Empirical Research Software Engineering Controlled Experiment Softwareempirie Programming Programming Study Java Groovy Type Systems Empirische Softwareforschung Entwicklungsproduktivität Developer Productivity Softwareentwicklung Kontrolliertes Experiment

Share

Previous

Title: Can static type systems speed up programming? An experimental evaluation of static and dynamic type systems