Joern for Beginners: A How-To Guide for Source Code Analysis

This article introduces the use of open source Joern for vulnerability mining, discussing specific tools and methods for identifying vulnerabilities in code security audits.

Joern 101

Joern is an open source code analysis platform that can convert source code into a Code Property Graph (CPG) through a variety of different front ends, and then query and analyze the CPG through Joern's built-in query syntax. If readers are familiar with CodeQL , they can think of Joern as an open source version of CodeQL.

Joern supports different codes through different front-end engines, such as using CDT to support fuzzing parsing of C/C++ codes, using Ghidra to support binary file parsing, using Soot to support Java bytecode parsing, etc. The maturity of different front-ends varies, as shown in the following table:

Name	Built with	Maturity
C/C++	Eclipse CDT	Very High
Java	JavaParser	Very High
JavaScript	GraalVM	High
Python	JavaCC	High
x86/x64	Ghidra	High
JVM Bytecode	Soot	Medium
Kotlin	IntelliJ PSI	Medium
PHP	PHP-Parser	Medium
Go	go.parser	Medium
Ruby	ANTLR	Medium-Low
Swift	SwiftSyntax	Medium
C#	Roslyn	Medium-Low

Joern-cli

joern-cli is a suite we use to build code property graphs and perform searches. You can download the release version of joern-cli.zip from GitHub , or you can compile it yourself.

When writing complex rules later, you will need to complete them automatically and search the code to view use cases. Therefore, it is recommended to compile via source code:

git clone https://github.com/joernio/joern
cd joern
sbt stage

After the installation is complete, you can perform actual testing using a simple Java file Hello.javaas follows:

package demo;

import a.b.c.Foo;

public class Hello {
  static Foo foo = new Foo();

  public static void main(String[] args) {
    String data = foo.bar(args[0]);
    Runtime.exec(data);
  }
}

To build the source code file into a code property graph database (cpg), you can use the joern-parse script:

joern-parse --language javasrc Hello.java -o hello.cpg

There are a few points to note:

The language selected is javasrc instead of java, which is the front end of Soot.

You can use joern-parse --list-languagesto list all supported front ends (languages).

Hello.javaIt contains external classes a.b.c.Foo, so it cannot actually compile, but cpg can be generated correctly.

Query the cpg:

joern hello.cpg joern> cpg val res0: io.shiftleft.codepropertygraph.generated.Cpg = Cpg[Graph[136 nodes]] # ... joern> def source = cpg.method("main").parameter joern> def sink = cpg.call("exec").argument joern> sink.reachableByFlows(source).p val res7: List[String] = List( """ ┌─────────────────┬──────────────────────────────┬────┬──────┬────┐ │nodeType │tracked │line│method│file│ ├─────────────────┼──────────────────────────────┼────┼──────┼────┤ │MethodParameterIn│main(String[] args) │8 │main │ │ │Call │demo.Hello.foo.bar(args[0]) │9 │main │ │ │Call │demo.Hello.foo.bar(args[0]) │9 │main │ │ │Call │demo.Hello.foo.bar(args[0]) │9 │main │ │ │Identifier │String data = foo.bar(args[0])│9 │main │ │ │Identifier │Runtime.exec(data) │10 │main │ │ └─────────────────┴──────────────────────────────┴────┴──────┴────┘""" )

The core query is just three lines, looking for the data flow link from the main parameter to the Runtime.exec parameter:

def source = cpg.method("main").parameter
def sink = cpg.call("exec").argument
sink.reachableByFlows(source).p

Among them cpgis our code attribute graph root object, and the code search is based on this object. The query statement above is actually a legal Scala statement, and joern-cli itself is also a Scala interpreter. In addition to the cpg object, some global objects and methods are also defined, which can be helpviewed using:

joern> help
val res8: Helper = Welcome to the interactive help system. Below you find
a table of all available top-level commands. To get
more detailed help on a specific command, just type

`help.<command>`.

Try `help.importCode` to begin with.
┌────────────────┬────────────────────────────────────────────────┬─────────────────────────┐
│command         │description                                     │example                  │
├────────────────┼────────────────────────────────────────────────┼─────────────────────────┤
│close           │Close project by name                           │close(projectName)       │
│cpg             │CPG of the active project                       │cpg.method.l             │
│delete          │Close and remove project from disk              │delete(projectName)      │
│exit            │Exit the REPL                                   │                         │
│importCode      │Create new project from code                    │importCode("example.jar")│
│importCpg       │Create new project from existing CPG            │importCpg("cpg.bin.zip") │
│open            │Open project by name                            │open("projectName")      │
│openForInputPath│Open project for input path                     │                         │
│project         │Currently active project                        │project                  │
│run             │Run analyzer on active CPG                      │run.securityprofile      │
│save            │Write all changes to disk                       │save                     │
│switchWorkspace │Close current workspace and open a different one│                         │
│workspace       │Access to the workspace directory               │workspace                │
└────────────────┴────────────────────────────────────────────────┴─────────────────────────┘

joern> cpg.help
...

In addition to joern-parse and joern, the joern-cli suite also includes joern-export, joern-scan and other scripts for users to export control flow graphs and batch scan functions. For details on the use of these tools, please refer to the official Joern Documentation .

Steps

With cpg as the root node, we can find all node types in the code attribute graph, such as class, method, call, control flow, etc. These nodes can be obtained in the form of attributes, such as:

cpg.method: represents all method nodes.

cpg.parameter: represents the parameters of all methods.

cpg.method("main").parameter: Represents the parameters of all methods named main.

cpg.typeDecl("Foo").method: represents all methods in all classes named Foo.

Taking cpg.method as an example, its return value is Iterator[Method]of type, so all Scala Iterator methods can be called, such as map, filter, toList, size, headetc., which will also be mentioned when introducing Scala later.

cpg.method.parameter actually extends Iterator to support subqueries. In addition, Joern also implements some special Steps, such as the series for recursive queries repeatand the series timesfor data flow analysis sink.reachableBy(source).

For all types of Steps, please refer to the Node-Type Steps section in the documentation .

Scala 101

As mentioned earlier, Joern is written in Scala, and its query statements are also Scala statements, so in order to be familiar with Joern's queries, we need to learn Scala briefly first. Many students have never used Scala, so when they see it, they think it is anti-human like Lisp and Haskell, but it is actually a very simple language.

Scala is a JVM-based language, which means that .scalathe source code can be compiled into files like Java .classand loaded and run by the JVM. In fact, Scala can be regarded as an extension of Java and can use the existing Java ecosystem. If you have used Kotlin, you should be familiar with Scala's syntax.

HelloWorld

A simple hello.scalacode is as follows:

object hello {
  def main(args: Array[String]) = {
    println("Hello, World!")
  }
}

If it's Scala3, it can be even simpler:

@main def hello() = println("Hello, World!")

Compile and run:

scalac hello.scala
scala hello

Grammar Speedrun

This section mainly introduces some frequently used syntaxes of Joern queries.

First, the variable definition:

var a = 1
val b = List(1, 2)
val c = 1 to 5 // Iterator

You can see that although Scala is a strongly typed language, it can do intelligent type inference, so there is usually no need to specify the type when assigning a variable.

Function definition:

def fn(x: Any) = println(x)
def fn1(x: Int) = { x * x }
def fn2 = 1 to 5

For function definitions, all parameters still need to be declared, but the return type can be automatically inferred.

Note the difference def a = 1 to 5between and val a = 1 to 5. Both return Iterator objects, but the former returns a new Iterator each time, while the latter points to the same Iterator, so the latter can only be traversed once.

Anonymous functions are similar to JavaScript and can be written as arrow functions. (x: R) => x * xThey are often used in scenarios where filter functions need to be passed in, such as:

def it = 1 to 5
it.map(x => x * 2)
// For a single function parameter, the parentheses can be omitted
it.map { x =>
  val y = x * 2
  println(y)
  y
}
// You can use `_` instead of a single parameter：
it.map(_ * 2)

There are some special syntactic sugars in Scala, such as infix syntactic sugars:

case class Pair(x: Int, y: Int) {
  def plus(other: Pair): Int = x + other.x + y + other.y
}

val p1 = Pair(1, 2)
val p2 = Pair(3, 4)
val sum = p1 plus p2  // Use infix syntax
// Equivalent to
val sum = p1.plus(p2)

Therefore the following two expressions are equivalent:

5 + 3
5.+(3)

By convention, if you call a method in order to take advantage of its "side effects", you write empty parentheses. If the method does not have any side effects (does not modify other program state), you can omit the parentheses.

(1 to 5).toList()
// Generally written as
(1 to 5).toList

In terms of class definition, ordinary classes and abstract classes are similar to those in Java, while interfaces are similar to Rust and traitare defined using:

trait MyTrait {
  def method(): Unit = println("This is a trait method.")
}

Case class, used to generate immutable data interface, automatically generate apply, unapply, toString, equals and other methods:

case class MyCaseClass(name: String, age: Int)

An object class, also called a singleton class, can have only one instance:

object MyObject {
  def method(): Unit = println("yeah.")
}

That's all about the basic syntax. You can get familiar with more Scala syntax and usage details by reading the source code of Joern or other projects (projects using Scala include Apache Spark, Kafka, Flink, etc.). After all, language is just a tool, and you can make progress by reading and writing more.

sbt

Just like Java uses Maven for project management, Scala generally uses sbt for project management. Interestingly, sbt's management files build.sbtare also written in Scala. An example is as follows:

ThisBuild / scalaVersion := "2.13.12"
ThisBuild / organization := "com.example"

lazy val hello = project
  .in(file("."))
  .settings(
    name := "Hello",
    libraryDependencies += "org.scala-lang" %% "toolkit-test" % "0.1.7" % Test
  )

sbt can create a new project through a template just like maven:

sbt new scala/scala-seed.g8

Compile and test run:

sbt compile
sbt test
sbt clean run

Joern's project is also managed by sbt, so understanding sbt will help us understand its architecture and read the source code later. For more sbt functions, refer to the official documentation https://www.scala-sbt.org/

Practical Sharing

Now that you have learned Joern's basic code property graph and Scala syntax, you can start writing your first code analysis rule! This section lists some common query examples so that you can become familiar with the style of Joern queries.

We use two different projects for testing, one is a Spring source application, and the other is an Android application.

joern
joern> importCode("vuln-spring")
joern> importCode("xiaomi.apk")

The same joern-cli can load multiple projects. workspaceYou can view all projects through the command and workspace.openProjectopen the corresponding project through .

In the code graph attribute data layer, the specific search is language-independent, but some steps are only effective in specific languages (such as annotations), so the specific language selection is not very important. You can also use C/C++ or Python and JavaScript projects for testing.

Web Vulnerability Mining

vuln-spring is a Spring Web application. For this type of project, the first thing we might want to find is routing:

cpg.annotation.where(_.name(".*Mapping")).method.fullName.l

This query should be relatively straightforward. One thing to note is that the main difference between whereit and is that the return value of the former's filter function is of type Iterator, while the filter function of needs to return Boolean type. The at the end of is actually an alias of, which converts the Iterator into a List for output.filterfilter.l.toList

Then we need to find vulnerabilities, such as SQL injection. Here we take the springframework JdbcTemplate as an example and find its class methods.

Intuitively it is easy to write the following query:

cpg.typeDecl.fullNameExact("org.springframework.jdbc.core.JdbcTemplate").method.l

Among them, fullNameExcat is a shortcut method, equivalent to typeDecl.filter(_.fullName.equals("xx")). The previous query used name("xxx")is also implemented in a similar way. Taking nameas an example, the commonly used abbreviations include:

nameExact("xxx"): equal to

name("xxx"): Regular expression matching

nameNot("xxx"): Not equal to, also a regular match

.filterand .wherealso include similar filterNotand whereNotshortcuts for negated matches.

However, the query above will actually return an empty list, because the JdbcTemplate class is not defined in the code, but is an external class in the Spring jar package. When we built the CPG database, we did not obtain dependencies, so naturally we could not know all the methods under this class.

This is a common mistake made by novices, but it is easy to avoid as long as you understand the principle. The correct way to find external methods is:

cpg.method.fullName("org\\.springframework\\.jdbc.core\\.JdbcTemplate\\..*")
// or
cpg.method.filter(_.fullName.startsWith("org.springframework.jdbc.core.JdbcTemplate."))
// Or more usefully, call it directly
cpg.call.filter(_.methodFullName.startsWith("org.springframework.jdbc.core.JdbcTemplate.")).code.l

Combining source and sink, we have our first SQL injection query rule:

def source = cpg.annotation.where(_.name(".*Mapping")).method.parameter
def sink = cpg.call.filter(_.methodFullName.startsWith("org.springframework.jdbc.core.JdbcTemplate.")).argument(1)
sink.reachableByFlows(source).p

.pLike , .lthis is a shortcut method that outputs Iterator as formatted text. The above results have some false positives, mainly because all routing parameters are specified, which contain some uncontrollable Modelor HttpSessionobjects. The rules can be further optimized to constrain them.

Note that here sinkspecifies that argument(1)represents the first parameter, argument(0)which generally represents the object of the method call this.

sink.reachableBy(source)What is returned is the source that meets the conditions. For our query, it is Iterator[MethodParameterIn]; reachableByFlowswhat is returned will be printed in the form of a table. If you don’t like the default table output, you can also Iterator[Path]convert it to a custom output through ..pmap

def prettyPath = (p: Path) => p.elements.map(
	e => String.format("%-20s %s", e.label, e.code)
).mkString("\n")
sink.reachableByFlows(source).map(prettyPath).mkString("\n\n===\n")

mkStringEquivalent to the String.join method, combining lists into strings. Output example:

METHOD_PARAMETER_IN  @RequestParam(name = "password", required = true) String password
IDENTIFIER           password          
METHOD_PARAMETER_IN  String password     
IDENTIFIER           password                       
CALL                 "SELECT * FROM users WHERE USERNAME=\"" + username + "\" AND PASSWORD=\"" + password + "\""
IDENTIFIER           query

In this way, we can structure the vulnerability scan output to form a report or connect to other subsequent verification logic.

Android Vulnerability Discovery

Different from Java source code, some Jar packages only have bytecode, but Joern also supports using Soot to load bytecode and convert it to IR so as to store the code property graph in the database. Since both use Soot, does Joern support Android vulnerability scanning? The answer is yes:

For small APK, you can directly use jimple2cpg or importCode default parameters to load, while for large applications, it is best to add --androidparameter specifications android.jar(this is a bug caused by Soot's API). At the same time, for large applications, you need to appropriately increase the JVM memory.

jimple2cpg -J-Xmx30g --android $ANDROID_HOME/platforms/android-34/android.jar large.apk -o large.cpg

If you want to practice, you can use some open source vulnerability target applications for testing, such as:

oversecured/ovaa can download ovaa-debug.apk

DIVA Android

awesome-vulnerable-apps

Here we take an actual vulnerability as an example for analysis, namely a system application command injection vulnerability of Xiaomi last year.

For more information about the vulnerability, please refer to Oversecured's blog 20 Security Issues Found in Xiaomi Devices . This section reproduces a command injection vulnerability in the System Tracing app. The corresponding APK can be downloaded from apkmirror, see System Tracing 1.0

The principle of this vulnerability is that Xiaomi dynamically registers an AppReceiver, and in its onReceive method, it splices the received parameters to execute shell commands, resulting in command injection.

For vulnerability mining, what is the difficulty in automatically mining this vulnerability? First of all, among the four major components of Android, BroadCastReceiver is special. It is the only component that can be used without being defined in AndroidManifest.xml, that is, it is initialized by dynamic registration.

Then our first task is to find the BroadCastReceiver in all the codes. This requirement is relatively simple:

cpg.typeDecl.fullNameExact("android.content.BroadcastReceiver").derivedTypeDeclTransitive.fullName.l

derivedTypeDecland its variants derivedTypeDeclTransitiveare all auxiliary methods for finding subclasses. If you write them yourself, you can inheritsFromTypeFullNamefind interfaces and parent classes by recursively parsing and looking up properties.

This query only finds all Receivers, but not all of them are necessarily useful, so we further search for the actual registered Receiver object through Context.registerReceiver and locate the actual type.

We can use Joern's data flow analysis capabilities to find the data flow from the BroadcastReceiver constructor to the registerReceiver call, as shown below:

val baseCls = "android.content.BroadcastReceiver"
val receiverCls = cpg.typeDecl.fullNameExact(baseCls).derivedTypeDeclTransitive.fullName.l
def source = cpg.call.nameExact(Operators.alloc).filter(n => receiverCls.contains(n.typeFullName))
def sink = cpg.call.nameExact("registerReceiver").argument(1)

//Find data stream
sink.reachableBy(source)

This query finds calls like:

BroadcastReceiver r = new FooReceiver();
this.registerReceiver(r, filter);

But we can't find the location of our vulnerability, decompile the code:

public class ApplicationImpl extends Application {  
    private AppReceiver mDumpReceiver = new AppReceiver();  
  
    @Override // android.app.Application  
    public void onCreate() {  
        super.onCreate();  
        IntentFilter intentFilter = new IntentFilter();  
        intentFilter.addAction("com.android.traceur.DumpReceiver");  
        registerReceiver(this.mDumpReceiver, intentFilter);  
    }
}

Here, the data is actually transferred through attributes. You can see that Joern's data flow analysis is not related here, because the source and sink are in two unrelated functions <clinit>and respectively onCreate(), and the static analysis tool does not know when the life cycle function such as onCreate will be called.

One way is to modify Joern's front-end code, simulate the Runtime to make lifecycle method calls, and add custom calls to these methods when parsing the code property graph to complete the enhanced control flow graph (CFG).

Of course, this method may require modifying the Joern source code. We can use other opportunistic methods to observe that the connection breakpoints between source and sink are mainly on the class attributes, so we can try to create a new intermediate node and do another data flow analysis.

def fieldAccess = cpg.fieldAccess.filter(fa => receiverCls.contains(fa.typeFullName) || fa.typeFullName.equals(baseCls))
// Property reading nodes of all BroadCastReceiver and its subclass types
def fieldRead = fieldAccess.filter(fa => fa.argumentIndex == 2)
//Returns the actual BroadCastReceiver type property
val reads = sink.reachableBy(fieldRead).toList
// Find all attribute names, the result is List[List[String]]
val fields = reads.map(r => List(r.typeDecl.fullName.head, r.fieldIdentifier.canonicalName.head))
// Match all writes to the above field
def fieldWrite = fieldAccess
  .filter(fa => fa.argumentIndex == 1)
  .filter{fa =>
    val li = fa.map( r =>
        List(r.typeDecl.fullName.head, r.fieldIdentifier.canonicalName.head)
    )
    li.exists(fields.contains)
  }

//Data flow query
fieldWrite.reachableByFlows(source).p

A little explanation:

fieldAccess.argumentIndexIndicates the parameter position of the attribute in the parent expression, 1 is for writing, similar r0.field = value, 2 is for reading, similar r1 = r0.field; this specification is not required, but it can slightly improve the speed when the amount of code is large

fieldAccessContains access to all properties of type BroadCastReceiver and its subclasses. Note that receiverCls only contains subclasses, not BroadCastReceiverthe parent class itself, so it needs to be added to match polymorphic situations

fieldsIt is to get all the attribute field names and store them in List format, including class name and field name. It can also be saved as a string.

Then fieldWrite, the corresponding fields are matched and written. List.exists is used here to determine liwhether list contains fieldsthe elements in list;

Finally, reachableByFlows is used to find all sourcethe links from the source to the specified attribute, and the returned results are as follows:

joern> fieldWrite.reachableByFlows(source).p
val res131: List[String] = List(
  """
┌──────────────────┬──────────────────────────────────────────────────┬────┬────────┬────────────────────────────────┐
│nodeType          │tracked                                           │line│method  │file                            │
├──────────────────┼──────────────────────────────────────────────────┼────┼────────┼────────────────────────────────┤
│Call              │new com.android.traceur.MainFragment$10           │228 │onCreate│com.android.traceur.MainFragment│
│Identifier        │$r17 = new com.android.traceur.MainFragment$10    │228 │onCreate│com.android.traceur.MainFragment│
│Identifier        │$r17.com.android.traceur.MainFragment$10(r0)      │228 │onCreate│com.android.traceur.MainFragment│
│MethodParameterIn │<init>(this, com.android.traceur.MainFragment $r1)│227 │<init>  │com.android.traceur.MainFragment│
│MethodParameterOut│RET                                               │227 │<init>  │com.android.traceur.MainFragment│
│Identifier        │$r17.com.android.traceur.MainFragment$10(r0)      │228 │onCreate│com.android.traceur.MainFragment│
│Identifier        │r0.mRefreshReceiver = $r17                        │228 │onCreate│com.android.traceur.MainFragment│
│Call              │r0.mRefreshReceiver = $r17                        │228 │onCreate│com.android.traceur.MainFragment│
└──────────────────┴──────────────────────────────────────────────────┴────┴────────┴────────────────────────────────┘""",
  """
┌──────────────────┬─────────────────────────────────────────┬────┬──────┬───────────────────────────────────┐
│nodeType          │tracked                                  │line│method│file                               │
├──────────────────┼─────────────────────────────────────────┼────┼──────┼───────────────────────────────────┤
│Call              │new com.android.traceur.AppReceiver      │8   │<init>│com.android.traceur.ApplicationImpl│
│Identifier        │$r1 = new com.android.traceur.AppReceiver│8   │<init>│com.android.traceur.ApplicationImpl│
│Identifier        │$r1.com.android.traceur.AppReceiver()    │8   │<init>│com.android.traceur.ApplicationImpl│
│MethodParameterIn │<init>(this)                             │38  │<init>│com.android.traceur.AppReceiver    │
│MethodParameterOut│RET                                      │38  │<init>│com.android.traceur.AppReceiver    │
│Identifier        │$r1.com.android.traceur.AppReceiver()    │8   │<init>│com.android.traceur.ApplicationImpl│
│Identifier        │r0.mDumpReceiver = $r1                   │8   │<init>│com.android.traceur.ApplicationImpl│
│Call              │r0.mDumpReceiver = $r1                   │8   │<init>│com.android.traceur.ApplicationImpl│
└──────────────────┴─────────────────────────────────────────┴────┴──────┴───────────────────────────────────┘"""
)

Two matches were found, the second of which is the vulnerability class we are looking for, and the first result matches the following points:

public class MainFragment extends PreferenceFragment {
	// ...
	private BroadcastReceiver mRefreshReceiver;
	@Override // androidx.preference.PreferenceFragment, android.app.Fragment
    public void onCreate(Bundle bundle) {
	    // ...
	    this.mRefreshReceiver = new BroadcastReceiver() { // from class: com.android.traceur.MainFragment.10
            @Override // android.content.BroadcastReceiver
            public void onReceive(Context context, Intent intent) {
                MainFragment.this.refreshUi();
            }
        };
    }
	@Override // androidx.preference.PreferenceFragment, android.app.Fragment
    public void onStart() {
        super.onStart();
        // ...
        getActivity().registerReceiver(this.mRefreshReceiver, new IntentFilter("com.android.traceur.REFRESH_TAGS"), 4);
        Receiver.updateTracing(getContext());
    }
}

We can see that our query statement also finds anonymous class assignments based on polymorphism, where the dynamically registered class name is com.android.traceur.MainFragment$10. Although the rule is a bit ugly, it is basically usable.

The subsequent operation is to continue adding rules, merge the two different data streams, extract the class name corresponding to the construction method in the source, locate its onReceivemethod, and use the parameters of this method as the source of vulnerability analysis in the subsequent stage, so as to find Runtime.execthe call of dangerous functions such as etc.

Advanced Operations

From the previous sharing, you should have a basic understanding of Joern's usage and common queries, and also know that the tool has some limitations. Fortunately, Joern also has strong extensibility. This section will introduce some useful extension operations.

Data Flow Semantics

When we introduced Web vulnerability mining earlier, our SQL query actually had many false positives. The most serious false positive was that the sink was clearly specified as the first parameter of the JdbcTemplate method, but in fact, other parameters were passed in and it was also considered a valid path:

def sink = cpg.call.filter(_.methodFullName.startsWith("org.springframework.jdbc.core.JdbcTemplate.")).argument(1)

This is because Joern sets the propagation rules of external methods without code to propagate to all parameters and return values in order to maintain soundness, sacrificing the false positive rate to reduce the false negative rate.

However, to remain sound, Joern will treat external methods with no semantic definitions as able to propagate taint from all arguments, to all arguments including the return value.

This undoubtedly increases the cost of our analysis of false positives. So is there any way to optimize data flow analysis, that is, to configure semantics for specified methods? The answer is yes. We can specify additional FlowSemantic lists, as shown below:

import io.joern.dataflowengineoss.layers.dataflows.{OssDataFlow, OssDataFlowOptions}
import io.shiftleft.semanticcpg.layers.LayerCreatorContext
import io.joern.dataflowengineoss.semanticsloader.FlowSemantic

val extraFlows = List(
    FlowSemantic.from(
        "^path.*<module>\\.sanitizer$", // Method full name
        List((1, 1)), // Flow mappings
        regex = true  // Interpret the method full name as a regex string
    )
)

val context = new LayerCreatorContext(cpg)
val options = new OssDataFlowOptions(extraFlows = extraFlows)
new OssDataFlow(options).run(context)

Data flow mapping is implemented through digital Tuple:

1, -1Indicates that the first parameter data flow will be propagated to the return value

1, 2Indicates that the first parameter data flow will be propagated to the second parameter

1, 0Indicates that the first parameter data flow will be propagated to the example object (this)

1, 1Indicates that the first parameter data stream will be propagated to itself, usually used to indicate whether the data stream is interrupted, that is, used to specify the sanitizer

For example, the following code:

x = source()
foo(x) // "foo" 1->1 means that the data flow continues to propagate downward, otherwise it will be interrupted
sink(x)

In addition to specifying in the above code form, Joern also supports using semanticsloader/Parser to load semantic files. Each line represents an extraFlows, containing multiple FlowSemantics separated by spaces, for example:

"foo" 1->-1 2->3

Each parameter must be positional, but some languages allow named parameters:

"foo" 1 "param1"->2 3 -> 2 "param2"

For general data flow analysis, it is assumed that default parameters do not pollute each other, and parameters will pollute return values. Because people write a lot, Joern also provides special rules, namely PASSTHROUGH:

"foo" PASSTHROUGH 0 -> 0

The Scala code is:

FlowSemantic("foo", List(PassThroughMapping))

Semantically equivalent to foo(1, 2) = 1 -> 1, 2 -> 2, 1 -> -1, 2 -> -1, that is, all parameters will pollute themselves and the return value of the function, note that there is no 0 -> 0, so thisthe object will not pollute itself.

We can context.semantics.elementsview the current default FlowSemantic list by adding semantics previously used val context = new LayerCreatorContext(cpg)will overwrite the current context object.

From a practical point of view, there are many ways to create semantics, and you can choose the appropriate method:

import io.joern.dataflowengineoss.semanticsloader.{FlowSemantic, PassThroughMapping, Parser}
//Call constructor
val s = FlowSemantic("org\\.springframework.*", List(PassThroughMapping), true)
// Call static method
val s = FlowSemantic.from("org\\.springframework.*", List((1, -1)), true)
//Use Parser
val parser = Parser()
//Load single or multiple semantics from a string and return a list
val extraFlows = parser.parse(""" "foo" PASSTHROUGH 0 -> 0 """)
//Load from file
val extraFlows = parser.parseFile("semantics.txt")

For more details, see the official documentation on Dataflow Semantics or view the source code.

Control Flow Enhancements

Different from data flow analysis, sometimes we only focus on control flow. For example, the following statement can recursively find calls:

cpg.method.name("exec").repeat(_.caller)(_.emit.dedup).fullName.sorted

In our previous example of Android vulnerability mining, data flow analysis was unable to track some callback functions in the Android runtime, essentially because the control flow was not associated. Since Joern itself is based on Scala queries and exposes all data structures to users, it can actually add new control flow rules just like custom data flow semantics.

The pseudo code is as follows:

val methods = cpg.method
val node1 = methods.next
val node2 = methods.next
node1.addEdge(EdgeTypes.AST, node2)

You can view the supported edge types based on joern-cli's auto-completion:

joern> EdgeTypes.
ALIAS_OF             AST                  CALL                 CDG                  CONTAINS             IMPORTS              PARAMETER_LINK       REACHING_DEF         SOURCE_FILE
ALL                  BINDS                CAPTURE              CFG                  DOMINATE             INHERITS_FROM        POINTS_TO            RECEIVER             TAGGED_BY
ARGUMENT             BINDS_TO             CAPTURED_BY          CONDITION            EVAL_TYPE            IS_CALL_FOR_IMPORT   POST_DOMINATE        REF

This is the usage of the old version. In the new version, use diffGraphthe object of type DiffGraphBuilder. diffGraph.addEdgeYou can use to directly add control flow edges.

Consider one of the most common multi-threaded calling scenarios:

class ThreadDemo {
  public static void main(String args[]) throws Exception {
    final String cmd = String.format("sh -c \"%s\"", args[0]);
    Thread th = new Thread(new Runnable() {
      @Override
      public void run() {
        System.out.println("Running in a new thread");
        try {
          System.out.println("return: " + Runtime.getRuntime().exec(cmd));
        } catch(Exception ignore) {}
      }
    });
    th.start();
    Thread.sleep(1000);
  }
}

As mentioned earlier, if normal static analysis is used, it is impossible to identify the data flow from input to output:

def source = cpg.method.nameExact("main").parameter
def sink = cpg.call.nameExact("exec").argument
sink.reachableBy(source)

For this use case, we need to let the program know Thread.startthat will be called Runnable.run, so we can add an edge directly:

val call = cpg.call("start").head
val target = cpg.method("run").head
diffGraph.addEdge(call, target, EdgeTypes.CALL)
run.commit

Then run again sink.reachableByto find the correct call chain:

joern> sink.reachableByFlows(source).p
val res4: List[String] = List(
  """                                                                                                                  
┌──────────────────┬────────────────────────────────────────────────────────────────────────────┬────┬──────┬────┐     
│nodeType          │tracked                                                                     │line│method│file│   
├──────────────────┼────────────────────────────────────────────────────────────────────────────┼────┼──────┼────┤
│MethodParameterIn │main(String[] args)                                                         │2   │main  │    │
│Call              │<operator>.arrayInitializer                                                 │3   │main  │    │
│Call              │<operator>.arrayInitializer                                                 │3   │main  │    │
│Call              │String.format("sh -c \"%s\"", args[0])                                      │3   │main  │    │
│Identifier        │String cmd = String.format("sh -c \"%s\"", args[0])                         │3   │main  │    │
│Identifier        │new Runnable() { @Override public void run() { System.out.println("Running  │4   │main  │    │
│                  │in a new thread"); try { System.out.println("return: " +                    │    │      │    │
│                  │Runtime.getRuntime().exec(cmd)); } catch (Exception ignore) { } } }         │    │      │    │
│MethodParameterIn │<init>(this, cmd)                                                           │4   │<init>│    │
│Identifier        │this.cmd = cmd                                                              │4   │<init>│    │
│Call              │this.cmd = cmd                                                              │4   │<init>│    │
│MethodParameterOut│RET                                                                         │N/A │<init>│    │
│Identifier        │new Runnable() { @Override public void run() { System.out.println("Running  │4   │main  │    │
│                  │in a new thread"); try { System.out.println("return: " +                    │    │      │    │
│                  │Runtime.getRuntime().exec(cmd)); } catch (Exception ignore) { } } }         │    │      │    │
│Block             │$obj0                                                                       │4   │main  │    │
│Identifier        │new Thread(new Runnable() { @Override public void run() {                   │4   │main  │    │
│                  │System.out.println("Running in a new thread"); try {                        │    │      │    │
│                  │System.out.println("return: " + Runtime.getRuntime().exec(cmd)); } catch    │    │      │    │
│                  │(Exception ignore) { } } })                                                 │    │      │    │
│Identifier        │th.start()                                                                  │13  │main  │    │
│MethodParameterIn │run(this)                                                                   │5   │run   │    │
│Call              │Runtime.getRuntime().exec(cmd)                                              │9   │run   │    │
└──────────────────┴────────────────────────────────────────────────────────────────────────────┴────┴──────┴────┘""",

Note that diffGraph is only a temporary difference graph, and the difference needs to be committed before it can be submitted to cpg. This is just a simple example, and the code has not yet associated the specific Runable class corresponding to Thread, which is left to the reader to improve.

If you need to perform complex operations on the code property graph, such as implementing data flow tracking for reflection, you can use a custom Pass, as described in the next section.

CpgPass

We mentioned diffGraph earlier. I actually found it by searching the code and found that there is a similar operation in JumpPass.scala of ghidra2cpg:

class JumpPass(cpg: Cpg) extends ForkJoinParallelCpgPass[Method](cpg) {

  override def generateParts(): Array[Method] =
    cpg.method.toArray
  override def runOnPart(diffGraph: DiffGraphBuilder, method: Method): Unit = {
    method.ast
      .filter(_.isInstanceOf[Call])
      .map(_.asInstanceOf[Call])
      .nameExact("<operator>.goto")
      .where(_.argument.order(1).isLiteral)
      .foreach { sourceCall =>
        sourceCall.argument.order(1).code.l.headOption.flatMap(parseAddress) match {
          case Some(destinationAddress) =>
            method.ast.filter(_.isInstanceOf[Call]).lineNumber(destinationAddress).foreach { destination =>
              diffGraph.addEdge(sourceCall, destination, EdgeTypes.CFG)
            }
          case _ => // Ignore for now
          /* TODO: Ask ghidra to resolve addresses of JMPs */
        }
      }
  }

  private def parseAddress(address: String): Option[Int] = {
    Try(Integer.parseInt(address.replaceFirst("0x", ""), 16)).toOption
  }
}

PassIt is a node in Joern used to post-process the CPG graph database, such as generating CFG (Control Flow Graph) and DDG (Data Dependency Graph).

In response to the control flow enhancement mentioned in the previous section, the author wrote a simple DemoPass:

import io.shiftleft.codepropertygraph.generated.{Cpg, EdgeTypes, PropertyNames}
import io.shiftleft.codepropertygraph.generated.nodes.{Call, Method, StoredNode, Type, TypeDecl}
import io.shiftleft.passes.CpgPass


class DemoPass(cpg: Cpg) extends CpgPass(cpg) {

  override def run(diffGraph: DiffGraphBuilder): Unit = {
    val call = cpg.call("start").head
    val target = cpg.method("run").head
    val targetNode = methodFullNameToNode(target.fullName).get
    diffGraph.addEdge(call, targetNode, EdgeTypes.CALL)
    println(s"Add Edge: $call -> $targetNode")
  }

  private def nodesWithFullName(x: String): Iterator[StoredNode] =
    cpg.graph.nodesWithProperty(PropertyNames.FULL_NAME, x).cast[StoredNode]

  private def methodFullNameToNode(x: String): Option[Method] =
    nodesWithFullName(x).collectFirst { case x: Method => x }
}

Load it via:

new DemoPass(cpg).createAndApply()
run.commit

The corresponding pass is Overlay. When Joern first loads CPG or generates a database using joern-parse, it will execute some passes by default, and execute them in sequence according to different levels. We can project.appliedOverlaysview the currently effective pass by:

joern> project.availableOverlays
joern> project.appliedOverlays
res0: Seq[String] = IndexedSeq("base", "controlflow", "typerel", "callgraph", "dataflowOss")

Taking dataflowOss, which is responsible for data flow analysis, as an example, its code is implemented in OssDataFlow.scala and will be executed first ReachingDefPass, and at the same time, it extraFlowsimplements custom data flow control.

We can refer to its code to implement our own Overlay to enhance the functionality of Joern, which is also a manifestation of its strong extensibility.

other

In the process of writing Joern code query rules, a necessary knowledge is to understand the API it supports. We can use the joern-cli helpcommand to view the specific steps of cpg and the subsequent supported steps and a brief introduction:

cgp.help
cpg.method.help
cpg.typeDecl.help

You can also refer to the following documents:

Node Type Steps - Reference Card - Official documentation

queries.joern.io - Example query rules for joern-query

But neither the help command of joern-cli nor the official documentation fully covers all the steps, such as the query decoration annotationand the query all subclasses we used above derivedTypeDecl. So how did the author know these APIs? One way is to read other people's questions and replies in the community, but this is obviously too inefficient. Another way is to find this information by querying the source code.

There are many unit tests in Joern's main repository joernio/joern.repeat( . For example, if we look at some examples of recursive search, we can search for to see specific options such as .emit, maxDepthetc.

In addition, Joern's graph database is based on joernio/flatgraph , which was formerly overflowdb and was only recently switched to version 4.0. The implementation of each step can be found in this repository. From it, we can find that recursive search .repeatdefaults to depth-first, which can be .bfsset to breadth-first, and so on.

From the code, we can find some practical Step, such as collectAllused to return a node of a specified type. In Scala, collectis equivalent filter+mapto the combination of:

.collectAll[Call]
// Equivalent to
.collect { case x: Call => x }
// Equivalent to
.filter(_.isInstanceOf[Call]).map(_.asInstanceOf[Call])

Custom Step

The steps we used before can actually be customized. For example, if we want to add a fooStep query in the method, we can use the following method:

implicit class MyMethodTraversals(method: Traversal[Method]) {
  def fooStep = method.fullName(".*org.example.*").isPublic
}

cpg.method.fooStep

This uses Scala's implicitimplicit classes feature, which is used to add new methods to existing types through implicit conversions. Implicit classes are often used to enhance the functionality of existing types without directly modifying the definitions of those types.

Visualization

Sometimes it is difficult to judge some conditions when writing rules, but printing all nodes is not intuitive. Joern provides a series of drawing methods to print the flow graph of specified nodes:

joern> cpg.method("main").plotDot
plotDotAst     plotDotCdg     plotDotCfg     plotDotCpg14   plotDotDdg     plotDotPdg

The definitions of various graphs are as follows:

AST: Abstract Syntax Tree, Abstract Syntax Tree.

CDG: Control Dependence Graph, which mainly includes the dependency relationships of control structures such as if/else.

CFG: Control Flow Graph, all possible paths of program execution.

DDG: Data Dependency Graph, data dependency graph, including dependency relationships.

PDG: Program Dependence Graph, which includes control dependency and data dependency.

plotDotXXX will directly open the default image preview tool. If there is no graphical interface, you can dotXXXget the digraph text format of the dot image through:

cpg.method("main").dotAst.head #> "/tmp/main.dot"

The custom operator is used #>to redirect the string to a file, which is a useful feature. There are other uses that need to be explored in depth.

Summarize

Through the above sharing, I believe everyone has a more intuitive understanding of Joern. Let's talk about the disadvantages of using it first. First of all, the documentation is not perfect. Many interfaces need to view the source code or find answers in the community; another widely criticized is its data flow engine, which has many false positives and false negatives. In the above code, we can see that its name is ossdataflow, which only implements a very simple data flow, and there is still a gap compared with commercial software such as CodeQL.

However, Shiftleft (Joern's company) also has a commercial code scanning tool called , ocularwhich has most of the same functions as joern, except that its data flow engine is optimized and many interface modeling is added. If you are interested, you can refer to joern-vs-ocular . This shows that it is not that they have no ability to do it well, but they are destined not to invest too much in it due to interests.

Although Joern has many shortcomings, it also has its advantages. First, its architecture design is elegant, and new language support can be added based on different front ends. In addition to source code, it also supports bytecode and binary programs (assembly); second, its query language is based on the mature programming language Scala, which has strong flexibility and can implement various complex queries; finally, as an open source software, it focuses on "autonomy and controllability". As long as you have a certain coding ability, basically all kinds of needs can be met, thus overcoming the many shortcomings mentioned above.