STATOR:Establishing code-base
Here we list pros and cons of all our approaches towards establishing STATOR's code-base. All pros/cons are listed in the order from the most important to the least one.
- TODO:
- Update information in all pros/cons points in all subsections below such that free tools like PAGAI, FramaC, UFO, CPROVER, CBMC, KLEE, Bugst, etc. are included.
Contents
- 1 Perspective 1: Main implementation language/platform
- 2 Perspective 2: Programs in what languages should STATOR analyse
- 3 Perspective 3: What sort of programs we consider
- 4 Perspective 4: Build STATOR on top on a single freely available tool
- 5 Perspective 5: Build STATOR on top of a code collected from several different sources (tools)
- 6 Open questions
- 7 Results
Perspective 1: Main implementation language/platform
Java
- pros
- We can take a lot of code from tools like CPAchecker and Stanse with minimal effort.
- Huge standard library
- There several very good IDEs for development of Java applications.
- It is stable and portable to many different platforms.
- George already implemented its analysis there and no issues appeared.
- Many students know it.
- cons
- Implementation of some program analyses may be slower then the one in programming languages like OCaml or F#. Nevertheless, it is quite difficult to predict impact of this potentially increased programming effort.
F# (.NET/Mono)
- pros
- May provide faster implementation of some program analyses then in Java, because symbolic manipulations of particular data structures (like syntax trees, expressions, BDD,...)
- We can take a C front end from tools like CPAchecker, Stanse, Bugst, or Predator with minimal effort. It only requires to dump internal CFG representations of the tools to disc and load them into equivalent data structures syntactically expressed in F#.
- When we are in .NET we can freely use any other .NET language (C#, Visual basic, managed C++, ...). Since many students know C# and C# is similar to (or better then) Java, so from this point of view F# is better choice then Java. Moreover, there is also implemented integration of non-managed languages, but portability is then a question.
- We can move to Windows and thus develop STATOR in Visual Studio. (Visual Studio can really speed-up development (e.g. excellent debugging, edit&continue, ...)
- cons
- Development in Mono can be less comfortable. Also support of F# in Mono known to us yet. Similarly, state of standard library is not know.
- Portability mostly depends on qualities of Mono.
- Not many people knows it, but the situation improves.
OCaml
- pros
- May provide faster implementation of some program analyses then in Java, because symbolic manipulations of particular data structures (like syntax trees, expressions, BDD,...)
- We can take a C front end from tools like CPAchecker, Stanse, Bugst, or Predator with minimal effort. It only requires to dump internal CFG representations of the tools to disc and load them into equivalent data structures syntactically expressed in OCaml.
- cons
- OCaml development environments are still immature and primitive. This can slow down whole development.
- Not many people knows it, but the situation improves.
- OCaml is not that portable as Java.
C++
- pros
- There is software like PAGAI, KLEE, Predator, Bugst, Symbiotic from which we can take code. Moreover, if we use F# (i.e. .NET) then calls to native C++ libraries are no problem (they are quite nicely supported in .NET).
- Analyses can be faster by constant factor relative to managed languages (Java, F#, ...).
- Overloading of operators allows to do symbolic manipulations of expressions very brief. So we can possibly get close to F#/OCaml features. Nevertheless, it will require to actually implement those operators first.
- cons
- Standard library is definitely not as large as one of Java or .NET languages. Question is whether it is larger or not in a comparison with OCaml's one. But in any case, there are other C++ libraries available, like Boost, which make the situation not that bad.
- It is not type-safe language, i.e. there is vulnerability to more kinds of programming errors.
- Students generally do not like this language.
Perspective 2: Programs in what languages should STATOR analyse
Using dedicated C front-end
- pros
- Such front-ends are already available in tools like CPAchecker, Stanse, Predator, ...
- In CPAchecker they are implementing front-end to Java. So we can take their code, when it is ready. There might further be implemented front-ends to other languages in some other free tools. But then we could use them as external translators from their internal CFG representation into our automata files.
- cons
- Extension of STATOR to other languages will be complicated, if others (like Dirk's team) terminates development of Java front-end or we fail to find free tool which provides front-end to language we are interested in.
Using LLVM front-end
- pros
- LLVM project intends to compile several languages into LLVM byte code (C,C++, Objective C, Java,...). So, this might be very valuable for future of STATOR. But if development of STATOR will formally end in 2017. If there are no successor projects, then this feature does not matter.
- cons
- I know only one tool, Bugst, which translates LLVM into an internal control-flow graph representation. So, only this tool can be used for F# or OCaml platform for STATOR. But Bugst is not finished, so some LLVM features are not supported. Other LLVM based tools like PAGAI, KLEE do not transform LLVM into an internal CFG representation.
Perspective 3: What sort of programs we consider
There should mostly be considered benchmarks from SV_COMP. We are further considering some (probably smaller) open-source projects and some SW for drones and airplanes.
It would be really nice to try C front-ends from CPAchecker, Stanse, Bugst, etc. to few representative programs from each category.
Perspective 4: Build STATOR on top on a single freely available tool
The tools are listed in no particular order (just as they came into my mind). Did not we forget some interesting open-source tool?
CPAchecker
- pros
- According to README.txt file of CPAchecker: "CPAchecker is able to parse and analyze a large subset of (GNU)C." (But unfortunately, I did not find a list of missing features.)
- All preprocessed C files passed into the tool are linked into one big file. This is stated in README.txt file of the tool. Bit it is still unclear, whether it works as in a linker (e.i. properly are handled definitions/declarations, externa/static variables and functions, what about functions without definitions, ...), or whether they simply concatenate all the files into big one. Only the former case is valuable for us.
- There is a support of Z3, MATH SAT 4, and CVC4 SMT solvers. TODO: check it, is it correct/complete??
- The tool has live community, it is frequently improved.
- They currently work on front-end to Java. Nevertheless, CFG representation differs from the one of C and it is also the "old-school" one.
- George already implemented its analysis there and no issues appeared.
- There are already implemented several program analyses. So, we can try/compare with/use then with minimal effort.
- cons
- It does not provide the "new-school" control-flow graph representation. All its analyses and all internal algorithms of CPAchecker work on the "old-school" representation.
- Since the tool is written in Java, we cannot use those interesting programming features of OCaml or F#.
- The tool accepts already preprocessed C files. So, we need to handle build systems of analysed C project and the preprocessing phase by ourselves.
PAGAI
- pros
- Uses LLVM front-end. So we can parse ANSI/GNU programs.
- Since it is implemented in C++ we can get close to features of OCaml or F#, but it will require to overload several operators.
- We can use its analyses for our purposes.
- cons
- It does not convert LLVM into internal representation. It directly analyses LLVM. So, changes in LLVM has impact potentially to entire tool.
- It uses the "old-school" control-flow graph representation (the one which is natively available from LLVM)
FramaC
- pros
- TODO
- cons
- TODO
UFO
- pros
- TODO
- cons
- TODO
Stanse
- pros
- It provide its own ANSII+GNU C parser and translates the program into internal control-flow graph representation and also AST, both expressed in terms of Java classes. AST is an XML tree. Further, individual program expressions, linked to particular control-flow graph edges, are also AST trees represented in XML form.
- There is a support of makefiles. It means that the tool can be passed a particular makefile instead of a single C file.
- cons
- Since the tool has its own C parser, then any changes in ANSII+GNU will have to be updated by ourselves.
- When we wanted to analyse programs written in other languages, then we will have to provide new front-end which compiled to internal Java classes. Moreover, this internal representation does not support, for example, exceptions.
- The internal control-flow graph representation of the tool is the "old-school" one.
- Stanse is written in Java (except the C parsed which is written in C and is based on ANTLR), so language features of F# or OCaml are not available.
- The tool cannot provide many other program analyses. There is basically only one (which may appear in many form according to its configuration though).
- The tool is not further improved.
Bugst
Bugst is not a tool. It is a collection of C++ libraries (and their tests) providing different functionality related to program analysis. Nevertheless, there are also available several tools (of different purpose) which are strongly based on (or somehow support) Bugst libraries and so they are distributed via Bugst's repository.
- pros
- It translates LLVM to an internal control-flow graph representation. So, there is an potential to analyse programs is several languages and further changes in LLVM have impact only to minority of the Bugst.
- There is a support of Z3, MATH SAT 4, and CVC4 SMT solvers. Note though that support of MATH SAT 4 and CVC4 stays in separate branch and it still was not merged into master branch.
- Since Bugst is written in C++ it is possible to achieve features of languages like F# or OCaml. But it will be necessary to overload several operators first.
- cons
- Bugst provides the "old-school" control-flow graph representation.
- Although Bugst is maintained, there is currently only on maintainer. So the progress is very slow.
- Bugst can only be built in Visual Studio 2010, since it uses MS Build system. Nevertheless, CMake support is in progress.
Predator
- pros
- Since its front-end is GCC, it can parse ANSI+GNU C programs. So, there is also an potential to analyse programs is several languages.
- Since the tool is written in C++ it is possible to achieve features of languages like F# or OCaml. But it will be necessary to overload several operators first.
- The tool is maintained and improved.
- There is already implemented quite mature shape analysis.
- cons
- The tool provides the "old-school" control-flow graph representation.
- The portability is questionable. For example, how about Windows platform?
KLEE
It is a tool providing automated test generation. It is based on program analysis called symbolic execution. It analyses LLVM code and it is distributed via LLVM repository.
- pros
- Uses LLVM front-end. So we can parse ANSI/GNU programs.
- Since it is implemented in C++ we can get close to features of OCaml or F#, but it will require to overload several operators.
- We can use its symbolic execution program analysis.
- cons
- It does not convert LLVM into internal representation. It directly analyses LLVM. So, changes in LLVM has impact potentially to entire tool. Nevertheless, this issue is maintained by LLVM community.
- It uses the "old-school" control-flow graph representation (the one which is natively available from LLVM)
CBMC
- pros
- TODO
- cons
- TODO
CPROVER
- pros
- TODO
- cons
- TODO
Perspective 5: Build STATOR on top of a code collected from several different sources (tools)
Here we consider possibilities to take useful code from different open-source tools. This code may differ in implementation language. The goal here is:
- Make I list of code we would like to have (like front-end to C/LLVM, ...)
- Search open-source tools for their implementations of items in our list (we can extend the list if we find some interesting functionality).
- Prune the result s.t. variant with minimal compatibility issue (i.e. language differences and type incompatibility). For the pruning it is crucial a good estimation of amount of work which has to be done to resolve incompatibility issue (e.g. where we can use dynamic libraries, where inter-language call, what will have to be reimplemented, and for what items in the list we have nothing).
TODO: complete the section!
Open questions
How much time/programming effort can we actually save when we use languages like F# or OCaml relative to other languages like Java or C++?
- ???
How much we can actually benefit (i.e. time/programming effort savings) from the "new-school" control-flow graph representation relative to the use of the "old-school" one?
- ???
How much is important for as that CPAchecker already provides several program analyses like predicate abstraction?
- ???
How much restrictive is CPAchecker's concept of CPAs with respect to kind of program analyses we intend to implement in STATOR? (Of course, there is a possibility to circumvent their concept, but it is against tool's philosophy, i.e. it is bad.)
- ???
What is the ideal code-base (i.e. even not real/available, hypothetical) for STATOR from the point of view from members of the STATOR team?
- David: ???
- George: ???
- Marek: In this moment I cannot say, because I do not have clear view of features of our analyses. I need to read related papers first. Even such essential things like "new-school vs. new-school" control-flow graph representation is a mystery for me. So, my opinion to this question does not matter.
How many PhD students we plan to join the STATOR project till the end of 2017?
- ???