Homepage > Catalog > Computer Science - IT-Security

Robust Graph-Based Static Code Analysis

Name: Robust Graph-Based Static Code Analysis
Price: 18.99 EUR
Availability: InStock
Author: Samuel Hopstock
ISBN: 9783346063663

Bachelor Thesis, 2019

59 Pages, Grade: 1,0

Samuel Hopstock (Author)

Excerpt

Abstract

Kurzfassung

Glossary

1 Introduction
1.1 Problem Statement
1.2 Thesis Structure

2 Background and State of the Art
2.1 Static Code Analysis
2.2 Robust Analysis
2.2.1 Handling Incomplete Code
2.2.2 Handling Erroneous Code
2.2.3 Handling Inheritance and Interprocedural Dataflow
2.3 Code Property Graph
2.3.1 Abstract Syntax Tree
2.3.2 Control Flow Graph
2.3.3 Data Flow Graph
2.4 Graph Databases
2.5 Related Work

3 Approach and Implementation
3.1 Existing Setup
3.1.1 CPG Generation from Java Source Code
3.1.2 Graph Persistence with Neo4j-OGM
3.2 Improvements to CPG Generation for Robust Analysis
3.2.1 Wrapping Incomplete Code Snippets
3.2.2 Enhanced Analysis Passes
3.2.3 Data Flow Analysis
3.2.4 Type Propagation and the Type Listener System
3.3 Automated Code Crawler
3.3.1 Preparation: Collecting Java Files
3.3.2 CPG Generation
3.3.3 Analysis: Running Queries on the Graph

4 Analyzing Java Cryptography Extension API Misuse
4.1 Misusing Cryptography
4.2 Automated Detection with CPG Queries
4.2.1 Insecure Algorithm Usage
4.2.2 Constant Encryption Passwords
4.3 Analyzing GitHub Repositories
4.3.1 Discovering Java Repositories that use Cryptography
4.3.2 Experiment Setting
4.3.3 Detected Cryptography API Misuses
4.3.4 Performance of the Analysis Process

5 Conclusion

6 Future Work

List of Figures

List of Algorithms

Bibliography

Abstract

Automatic code analysis is a widely used technique to find and eliminate errors in software projects. Instead of executing the program and verify that its behavior is correct, as dynamic analysis does it, static analysis is applied on its source code. Here, we search for suspicious patterns that are likely to indicate erroneous behavior.

A special type of software bugs are those errors, that lead to security vulnerabilities. In this case, attackers may be able to undermine fundamental security aspects, by exfiltrating sensitive user data from server applications or assume control over the machine running the program in question. Security vulnerabilities in the code can have drastic consequences, which is why it is important to identify them as fast as possible and fix them immediately afterwards.

This thesis extends the concept of Code Property Graphs (CPGs), which has been proposed for static analysis of C/C++ code, to be applied on programs and incomplete code snippets written in Java. Unifying Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs) in a single datastructure, this approach enables searching for vulnerabilities whose code patterns are spread out across the boundaries of single methods and classes. These patterns are identified using the graph query language cypher, which is provided by the graph database Neo4j.

In an evaluation run on 100 public repositories on GitHub using cryptography, 135 findings of cryptographic API misuse have been identified using this technique. These include the use of insecure algorithms, like the Data Encryption Standard (DES) or Electronic Code Book mode (ECB), and hardcoded passwords that are used for encryption purposes.

Kurzfassung

Automatische Codeanalyse ist eine weit verbreitete Technik zum Auffinden und Be- seitigen von Fehlern in Softwareprojekten. Anstatt das Programm auszuführen und zu überprüfen, ob sein Verhalten korrekt ist, wie dies bei der dynamischen Analyse der Fall ist, wird statische Analyse auf Quellcode angewendet. Hier suchen wir nach verdächtigen Mustern, die auf fehlerhaftes Verhalten hinweisen können.

Eine besondere Art von Softwarebugs sind solche Fehler, die zu Sicherheitslücken führen. In diesem Fall können Angreifer möglicherweise grundlegende Sicherheit- saspekte untergraben, indem sie vertrauliche Benutzerdaten aus Serveranwendungen exfiltrieren, oder die Kontrolle über den Computer übernehmen, auf dem das betr- effende Programm ausgeführt wird. Sicherheitslücken im Code können drastische Konsequenzen haben, weshalb es wichtig ist, sie so schnell wie möglich zu identifizieren und zu beheben.

Diese Arbeit erweitert das Konzept der Code Property Graphs (CPGs), die für die statische Analyse von C/C++ Code vorgeschlagen wurden, um es auf in Java geschriebene Programme und unvollständige Codefragmente anzuwenden. Dieser Ansatz vereint abstrakte Syntaxbäume (ASTs), Kontrollflussgraphen (CFGs) und Daten- flussdiagramme (DFGs) in einer einzigen Datenstruktur und ermöglicht so die Suche nach Schwachstellen, deren Codemuster über die Grenzen einzelner Methoden und Klassen hinaus verteilt sind. Diese Muster werden mithilfe der Graphenabfragesprache cypher identifiziert, die von der Graphendatenbank Neo4j bereitgestellt wird.

In einer Evaluierung mit 100 öffentlichen Repositories auf GitHub, die Kryptografie benutzen, wurden mit dieser Technik 135 Fälle von Missbrauch kryptographischer Programmierschnittstellen festgestellt. Dazu gehören die Benutzung unsicherer Al- gorithmen wie des Data Encryption Standard (DES) oder des Electronic Code Book Modes (ECB) sowie im Programmcode gespeicherte Kennwörter, die für Verschlüs- selungszwecke verwendet werden.

Glossary

Abbildung in dieser Leseprobe nicht enthalten

1 Introduction

Writing software is an error-prone process. The industry average is suggested to range between 1-25 errors per 1000 lines of code [McC04, p. 521]. In many cases, those errors do not only affect correct functionality, but rather cause serious security problems: When working with user-supplied input data for example, one needs to be especially cautious in order to prevent bugs that would allow the user to inject arbitrary code into the program. If this happens on server applications, an attacker might be able to exfiltrate sensitive data or take total control over the server itself, which is to be prevented by any possible means.

So in order to reduce the amount of defects in software that is to be shipped, especially security-related issues, several analysis techniques have to be applied to find them. Unfortunately, all techniques that aim to check whether a program functions correctly are fundamentally imprecise: As stated in Rice’s theorem, only trivial properties of programs can be decided algorithmically [Ric53]. Semantic correctness is a nontrivial property, so it can only be approximated. Thus, there is not one specific way of finding all bugs, but rather many different approaches that all have individual advantages and weaknesses. Overall, there exist two families of analyses:

- Dynamic analysis means that the program is executed, usually in the form of extensive test suites. Different inputs are provided, and the result is checked for correctness. The effectiveness of finding defects using software tests of course heavily depends on the quality of the test suite. Automated approaches, like fuzzing, help improve the effectiveness of software tests: Fuzzers provide random or invalid inputs, trying to provoke crashes. This may lead to the identification of errors, where the test suite developers missed possible types of inputs. Google, for example, heavily and successfully uses fuzzing techniques to find bugs, both internally and for various open source projects [Ary+19]. If performed thoroughly, dynamic analysis is a computationally expensive process (which is why Google runs their fuzzing platform on 25,000 cores).
- Static analysis is applied directly on the source code form of a program, without executing it. This typically makes the process faster than dynamic analysis, as there are no time consuming tests to be run. As a result, modern development environments provide many different static analysis options, which are able to alert the developer about potential issues nearly in real time. This short feedback loop is very helpful to prevent as many errors as possible at an early stage during development.

1.1 Problem Statement

The downside of static analysis is that it is less precise than dynamic analysis: It is only capable of more or less guessing what will occur when executing the code, as it lacks access to information that is only present at runtime (e.g. variable values). Because of this, static analysis needs to combine as many techniques as possible, in order to collect enough information about the program. Unfortunately, some of them require it to be in a fully compilable state, with code used from 3rd party dependencies being included in the analysis scope.

The topic of this thesis is to develop a graph-based static analysis framework for Java code that tolerates incomplete or non-compiling source code. For this purpose, the concept of Code Property Graphs (CPGs) is to be researched and extended, in order to provide information about more complex erroneous patterns in Java source code. Additionally, an evaluation of the resulting graph model is to be performed, by searching for cryptographic vulnerabilities in publicly available Java projects. This evaluation needs to show, whether this graph-based analysis approach is capable of finding security issues in Java code, and how feasible the analysis is from a performance point of view.

1.2 Thesis Structure

In Chapter 2, there will be an explanation of fundamental techniques that will be used for this project: What types of analysis do we mean when speaking of static analysis, what is the structure of a CPG and how can it be persisted for further analysis?

In Chapter 3, we will then look at how this graph-based analysis is implemented, by building upon an existing implementation that already provides basic graph generation and persistence features. The existing graph model will be extended to meet CPG requirements. Later, this workflow is integrated into an automated analysis tool, which is able to fetch code from GitHub, convert it into a CPG representation and run analysis queries on it.

Aforementioned analysis tool will finally be used in Chapter 4 to evaluate our CPG model: 100 random Java repositories making use of cryptographic features have been selected and are then processed by the automated analysis tool. As an example for the types of analyses that can be run on top of a CPG representation, queries searching for insecure encryption algorithms and hardcoded passwords will be presented, and their results on the selected repositories discussed later on.

Chapters 5 and 6 conclude the thesis and provide some thoughts about enhancements that can be carried out in the future, in order to further improve the previously developed graph-based analysis process.

2 Background and State of the Art

2.1 Static Code Analysis

There are several types of inspections that are usually carried out during static analysis:

- Lexical Analysis: The raw source code is converted into a tokenized form, where individual words are grouped together in order to form a single token. Analysis is then carried out on the token level.
- Control Flow Analysis: The statements are analyzed in their execution
order.
- Data Flow Analysis: Here, we look at the way data is exchanged inside the program, as well as interaction with external data providers.
- T aint Analysis: At which places does the program work with data that has been provided by the user, leading to potential security issues? [OWA19]

All of these techniques can be performed more or less separately, but in the context of graph-based analysis, they can be unified. More details about this graph model will be provided in Section 2.3.

2.2 Robust Analysis

As already mentioned, code found somewhere on the Internet might not always be fully executable. And if this is not only due to issues with build configuration files or other organizational matters, but because the source code itself can not be compiled directly, this might even lead to issues when applying static analysis techniques. Now our goal is to make the graph-based analysis approach, which will be discussed in chapter 3, resilient against several common analysis issues.

2.2.1 Handling Incomplete Code

In the light of the huge popularity of programming-centered Q&A websites like StackOverflow, we might intend to also apply our graph-based approach on some of the code snippets found there. In this context, we are not dealing with fully working code (sometimes referred to as “minimal working examples”), but rather with single methods or even only statements. For simple problems, like how to use a certain library, this may well be enough to answer the question sufficiently. But, depending on the tools used for static analysis, analyzed code may need to contain all parts that are required by a compiler ’s syntax definition. The Java compiler for example expects methods to be contained in a class, so snippets containing a single method do not fulfill this specification.

2.2.2 Handling Erroneous Code

Some pieces of code intended to be analyzed might also contain smaller syntactical errors. Often, those are actually not related to the code locations that are of interest: If a semicolon has been forgotten in a line that is only responsible for importing some 3rd party class, this error is not of interest when looking for serious issues, like security vulnerabilities. Thus, such smaller errors should not prevent the complete file or sample from being analyzed, but rather only skip the erroneous places.

2.2.3 Handling Inheritance and Interprocedural Dataflow

Especially when dealing with object-oriented languages like Java, code tends to be spread over multiple classes and methods. Like this, code quality and maintainability is presumably enhanced, but static analysis gets more difficult. It is not enough to just analyze the code method by method, but rather a more global picture is necessary. We need to keep track which variables are passed from method to method, in order to see the so-called “interprocedural” dataflow inside the program. This is further complicated by the presence of “virtual” methods when dealing with inheritance: Classes may provide different implementations for methods that have been already defined in a parent class [Gos+18, p. 258]. Thus, the final call target may not be known when the program is compiled, and is resolved dynamically during runtime. This of course is a problem for static analysis, as we do not have the possibility to use runtime information for finding out a method call’s target.

2.3 Code Property Graph

A CPG is a graph representation for source code, introduced by Yamaguchi et al. It “combines properties of Abstract Syntax Trees, Control Flow Graphs and Program Dependence Graphs in a joint data structure” [Yam+14]. The advantage of this repre- sentation is that subsequent analysis steps can take both syntactical and semantical properties of the program into account. This makes detecting erroneous behaviour much easier, as more complex relationships between code pieces can be identified.

2.3.1 Abstract Syntax Tree

The basis of a CPG is formed by the Abstract Syntax Tree (AST), which can be seen in Figure 2.1 (Nodes with children are blue, leaves green). It models the syntactical structure of a program, grouping the simple text representation into individual Java- specific tokens (statements, declarations etc.).

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: General Structure of an Abstract Syntax Tree

The corresponding graph representation is defined as follows:

GAST = (VAST, EAST, λ A S T, µ A S T) is a directed labeled multigraph (multiple edges between the same pair of nodes is allowed), where V A S T is the AST’s set of nodes, connected by edges E A S T. Edges are given labels by function λ A S T. This label describes the relationship type between parent and child node: As an example, let p be a CallExpression and q this call’s argument. Then we get λ A S T (p, q) =“ARGUMENTS” as the label for an edge from p to q.

One piece of source code, usually a single file, is called a CompilationUnit (in the context of this thesis: TranslationUnit). It serves as the root node of the AST. This file can then be split into a number of pieces: At its top level, there might be a package declaration and several imports, followed by one or more class declarations (also referred to as RecordDeclarations). A class then again consists of multiple parts, namely fields, methods and optionally inner classes. And this process goes on until finally we have constructed a full syntax tree for the source code. An example of how this might look for a small “Hello World!” program can be seen in Figure 2.2. [SBT19, pp. 5-8]

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2: Abstract Syntax Tree of a short program

2.3.2 Control Flow Graph

In the scope of an AST, we still have no way of determinining the execution order of statements in a method. The only information that we get is what exactly the statements are that are part of this method. This is where the concept of a Control Flow Graph (CFG) comes in: We extract further information from the source code (e.g. each statement's line number), in order to determine which one is executed first. Then, this data is integrated into the AST, by adding CFG edges from statements executed earlier to their successor. Formally, this happens like this:

GAST+CFG = (VAsr,(EAsT U EcF G ) , ( AAsT U AcFG), f/Asr), with EcF G the newly added CFG edges and AcFG assigning them "CFG" labels for identification. The rest of the graph stays the same.

In Figure 2.3a, this process has been performed for the AST from Figure 2.2 (CFG edges marked in red).

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.3: Enhancing the AST from figure 2.2

2.3.3 Data Flow Graph

In Yamaguchi’s definition of a CPG, the third component is a Program Dependence Graph (PDG). This graph representation has been proposed by Ferrante et al., and consists of two parts: Data dependencies that show which variables affect each other ’s values, and control dependencies that model relationships between variable values and changes in control flow [FOW87]. For now, we will limit this part of the CPG to data dependencies, leaving control dependencies for possible future enhancement of the graph model. Those data dependencies form the Data Flow Graph (DFG) in our program representation.

DFG edges in the graph are needed between variables that affect each other, as defined by the PDG. These edges are directed from the variable that provides the value (source) towards the one receiving it (sink). In Figure 2.3b, green DFG edges have been inserted. Note that we will treat the “value” of a method as being its return value, so accordingly, a data dependency edge would be inserted from all return statements towards the method declaration itself. In our example, this situation should be depicted, but we are dealing with a void function that does not return anything. Thus the dotted line from its return statement.

Finally, this yields the complete CPG:

G C P G = G A S T + C F G + D F G = (V A S T, (E A S T ∪ E C F G ∪ E D F G), (λ A S T ∪ λ C F G ∪ λ D F G), µ A S T).

Here, E D F G of course are the dataflow edges, with λ D F G labeling them as “DFG”.

2.4 Graph Databases

One of the easiest ways for persisting any kind of graph datastructure is to use a dedicated graph database system. Graph databases do not store data in tables like the common relational databases do. Instead, as the name suggests, datapoints are stored as nodes of a graph that are connected to other nodes via edges. Additionally, graph database systems typically support a more detailed graph model, the so-called “labeled property graph”.

A labeled property graph has the following characteristics:

- It contains nodes and relationships
- Nodes contain properties (key-value pairs).
- Nodes can be labeled with one or more labels.
- Relationships are named and directed, and always have a start and end node.
- Relationships can also contain properties.

[RWE15, p. 4]

When looking at our Code Property Graph, we can see that it can be nicely trans- formed into a labeled property graph:

- Being a graph, of course it trivially fulfills the requirement of having nodes and edges
- CPG nodes also have properties, like their name, maybe a datatype, associated source code snippets and so on
- Nodes in the CPG are labelled according to their node type: Is it a MethodDeclaration, a RecordDeclaration,. . . ?
- Edge names are used to distinguish the different interleaved graphs that are part of the CPG: dataflow edges get a different label than AST or CFG edges. Additionally, edges are inherently directed, as all of our subgraphs describe some kind of order, be it syntax tree hierarchy, control flow or the direction of data flow
- The only thing that does not play a really important role in the domain of CPGs is the fact that edges can contain properties on their own. For our purposes, simply labelling them is sufficient

One big advantage graph databases have over traditional relational systems is, that in order to find related data, no computationally expensive join operations have to be performed. As a consequence, they are often used in systems with huge amounts of data, where it is crucial to find relevant connections in real time. Social media platforms thus rely heavily on graph databases. Twitter, for example, created FlockDB, their own graph database system [PoilO].

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.4: Graphical Representation of a Cypher Query

But an additional advantage, which is the more interesting point for our use case, is the natively exposed feature of running graph-specific traversals on the data. When we want to analyze a CPG, realtime insertion and query abilities are not our top concern. Instead, we will only persist the graph once. But then, we want to find various patterns inside the graph itself . And this is where graph database systems come to shine: Search ing for patterns that can be described as paths across the graph, containing specific edge and node types, is a much easier task when performed on graph databases and not on traditional data stores. They expose a graph-specific query interface, like Neo4j's "Cypher", which intuitively resembles a graphical notation of the desired pattern. Consider this example: We have a simple database of people's phone calls, where nodes describe people and a phone call is modelled by an edge from caller to callee. Now when we would like to get all the people that "Alice" has called, we issue the following cypher query: match (a) - [:call]-> (b) where a .n ame = "Alice" r eturn b . Looking at this query, we can intuitively see that in the graph, a and b resemble nodes, connected by a "call" edge from a to b .

Similarly, we can also search for more complex paths, for example if we would like to find pairs of people that Alice has called at some point, who have also had another conversation between themselves. Writing a query like this in traditional SQL can be quite an effort. But modelling the situation graphically is rather easy, as shown in Figure 2.4: Alice has outgoing "call" edges to Bob and Charlie, and those two nodes need to be connected as well. But here, it is irrelevant who initiated the call. Thus this edge is undirected. And this can be directly translated into a cypher query: mat ch (b)<- [:call] - (a {n ame : "Alice"}) -[:c a ll] ->(c) - [:ca ll] - (b) return b, c. As a small side note, this query also shows that matching based on node properties can not only be done by using the SQL-like syntax from the previous query, but also by integrating it in the node.

So all in all, graph databases are both efficient for realtime insertion of datapoints, where relations between them are a central point of the data model, as well as for running complex pattern-based queries. And because of this last point, a graph database is ideally suited for being the persistence layer in our CPG analysis project.

2.5 Related Work

Based on the CPG model developed by Yamaguchi et al., a tool called “Joern”¹ has been created, which uses this technique for C/C++ code analysis. Java is another widely used language for all kinds of software development, so utilizing a CPG-based analysis technique for projects in this language would provide a great benefit. This is why, in this thesis, we will bring the CPG approach to Java.

Regarding the graph-based search for cryptographic vulnerabilities, as we will present it in Chapter 4, a similar solution has been developed by Benz [Ben16]. For his approach, he did not rely on Yamaguchi’s CPG model but rather used the Graph- Based Object Usage Model (GROUM) that was introduced by Nguyen et al. [Ngu+09]. He modified the base model in a way that interprocedural program flow could be considered during the analysis process. Interprocedural analysis takes into account that data might be exchanged between methods and does not only consider code inside the boundaries of single methods. A goal of this thesis is to show that the Java version of the CPG inherently supports interprocedural analysis, without additional method inlining or other techniques needed for the GROUM approach.

3 Approach and Implementation

3.1 Existing Setup

In the following sections, those parts of the CPG generation process will be discussed that have already been implemented before the start of this thesis. We will treat this situation as the starting point for all further enhancements that will be presented.

3.1.1 CPG Generation from Java Source Code

AST Generation with the JavaParser

As a starting point for CPG generation, we first convert the Java code into its AST representation. This is done with the JavaParser¹ library, which can handle any code from Java versions 1 through 12. This flexibility is important for our goal of robust analysis, as we want our tool to be able to parse any Java code we can find.

The JavaParser provides us with a custom tree datastructure that describes the AST. In order to provide a layer of abstraction, we then proceed to translate JavaParser nodes into our own node classes. This node mapping gives us the ability to provide alternative ways of generating the AST, (e.g. supporting multiple languages), without having to change further analysis steps (see fig. 3.1). One example for this is a C/C++ frontend that is being developed in parallel to this thesis’ project. Of course the JavaParser cannot parse C/C++ code, but the nodes used in the AST of C/C++ are mostly similar to Java syntax nodes. So having a more abstract node representation makes it possible to create a graph model that can be further processed, without having to know whether it was created from Java source code or from any different language.

AST parsing and translation to our abstract graph model has already been imple- mented, but in the context of this thesis, one central shortcoming will be addressed: The JavaParser expects the input code to be (more or less) fully parseable by a normal Java compiler. Luckily, it is already capable of dealing with small syntax errors, so we do not have to take care of them. But when trying to analyze incomplete code fragments, the parser will refuse to work. A solution for this issue will be presented in a later section.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.1: The individual language frontends translate their internal graph model to an abstract one. Further manipulations are then independent from the previously used frontend

Graph Transformation using a Pass System

After the AST creation, we only have a really basic graph representation of the analyzed code. We still have no edges describing control flow and data flow paths, but those are a central part of a CPG. So at this point, we need to further enrich the graph by adding this information. For this purpose, our CPG parser uses a pass system: A pass is a procedure that performs graph modifications. Once the initial AST has been created, each registered pass (e.g. one dataflow analysis pass, one control flow analysis pass and so on) is invoked one after another, and receives the current state of the graph. The pass is then expected to modify the graph in one specific way. The dataflow pass, for example, should add DFG edges. Like this, the desired CPG is gradually built from the original AST, containing more precise information after each pass.

Only a rudimentary pass for CFG generation and a simple call resolver had already been implemented. In the scope of this thesis, several improvements to this situation will be presented (see section 3.2.2 for details).

3.1.2 Graph Persistence with Neo4j-OGM

Internally, the gradually constructed graph is modeled by using Java objects: A single class Node serves as a base for all objects that should be later converted to nodes in the graph database. Node types (e.g. RecordDeclaration, MethodDeclaration etc.) are modeled by subclasses of Node . Like this, we get a hierarchical structure of node types whose root is Node .

As a next step, we need to be able to create edges in the graph. This is done by the use of fields: Whenever a node object has field variables that are also of type Node , these will later on be converted to individual edges. E.g. if a RecordDeclaration has a field variable named methods that contains a list of MethodDeclarations, this is interpreted as the RecordDeclaration having “methods” edges to each of the contained MethodDeclarations.

The translation process itself is handled by an Object Graph Mapping (OGM) frame- work called Neo4j-OGM. It performs the just described actions of mapping objects to graph nodes and field variables to edges (or relations, as they are called in Neo4j). Using this approach provides great flexibility during the graph generation phase, as we do not have to operate directly on a graph database but can rather use a Java abstraction layer. This allows us to embed additional functionality inside the node objects that could not be persisted to the database in an easy way.

3.2 Improvements to CPG Generation for Robust Analysis

Using the just described situation as a baseline, we can now start to improve the graph generation process. The goal is that we will reach a graph representation that meets all the requirements of a CPG as discussed in Section 2.3. The starting point will be to make the generation process resilient against issues resulting from incomplete code. Then, the remaining parts of a CPG will be added, namely CFG and DFG.

3.2.1 Wrapping Incomplete Code Snippets

The JavaParser expects each parsed file to contain Java code that is fully syntactically correct. An example for this can be seen in Figure 3.2: The main functionality (printing "Hello world") needs to be contained inside a method of a class. If this is not the case, the JavaParser refuses to produce an abstract syntax tree for the program.

But we also want to be able to analyze incomplete code snippets, e.g. single methods from sources like StackOverflow. This is why we need a way to overcome this limitation of the JavaParser. As a first step, we need to look at what the different forms are, in which incomplete code (that programmers can still understand) can be provided. Those are the types of code that will come up on code sharing sites and thus the ones that are relevant for our analysis.

[...]

¹ https://joern.io/

¹ https://github.com/javaparser/javaparser

Excerpt out of 59 pages

Details

Title: Robust Graph-Based Static Code Analysis
College: Technical University of Munich (Fakultät für Informatik)
Grade: 1,0
Author: Samuel Hopstock (Author)
Year: 2019
Pages: 59
Catalog Number: V505779
ISBN (eBook): 9783346063663
Language: English
Keywords: Security, Code Property Graph, CPG, Abstract Syntax Tree, AST, Control Flow Graph, CFG, Data Flow Graph, DFG, Neo4j, Vulnerability, Java

Quote paper: Samuel Hopstock (Author), 2019, Robust Graph-Based Static Code Analysis, Munich, GRIN Verlag, https://www.grin.com/document/505779

Comments

No comments yet.

Similar texts

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free

Robust Graph-Based Static Code Analysis

Bachelor Thesis, 2019

59 Pages, Grade: 1,0

Samuel Hopstock (Author)

Contents

Abstract

Kurzfassung

Glossary

1 Introduction

1.1 Problem Statement

1.2 Thesis Structure

2 Background and State of the Art

2.1 Static Code Analysis

2.2 Robust Analysis

2.2.1 Handling Incomplete Code

2.2.2 Handling Erroneous Code

2.2.3 Handling Inheritance and Interprocedural Dataflow

2.3 Code Property Graph

2.3.1 Abstract Syntax Tree

2.3.2 Control Flow Graph

2.3.3 Data Flow Graph

2.4 Graph Databases

2.5 Related Work

3 Approach and Implementation

3.1 Existing Setup

3.1.1 CPG Generation from Java Source Code

3.1.2 Graph Persistence with Neo4j-OGM

3.2 Improvements to CPG Generation for Robust Analysis

3.2.1 Wrapping Incomplete Code Snippets

Similar texts

Incremental Construction of Code Property Graphs

An Analysis of Comprehension Problems based on Discourse Analysis and Relevan...

Trichoderma Classification System Based on Color Code Texture of Potato Dextr...

The Human Security Discourse and Seeking Peace. Field Work Analysis Based on ...

A Research Report based on an analysis of Service Quality of Sports Centre

The Language of Persuasion in Advertising. A Corpus-based Critical Discourse ...

Insights of Current Developments in Optics-Based-Biosensors for Analysis of E...

Strategic Analysis of Zara

Can static type systems speed up programming? An experimental evaluation of s...

Das Open Web Application Security Project (OWASP) und Schwachstellen in Web-A...

Aufklärung der Struktur-Wirkungsbeziehungen von CpG-A- und CpG-C-Oligodesoxyn...

Erstellung einer einfachen Java-Anwendung zur Verwaltung eines Karteikartensy...

Der Dijkstra-Algorithmus zur Berechnung kürzester Wege in Graphen

Graphing Stock Market Data in R

Programmiersprachen im Vergleich. Sentimentanalyse und Verbreitung von Python...

Analysis of the BEM Code of Professional Guidelines

Erstellen eines Passwortgenerators in Java

Entwicklung eines AUTOSAR-basierten Eingebetteten Systems zur Evaluierung der...

Graph Theory Applications in Network Security