Implementation of obfuscation detection by Text Classification
链安博客2023-06-25 10:00:15

Overview

Obfuscation in Android APKs

Obfuscation is a technique used in software development, to make the code more difficult to understand, analyze, and reverse engineer. It involves transforming code into a form that is complex and convoluted while preserving its functionality. The primary goal of obfuscation is to hinder unauthorized access to the code and protect intellectual property.

In Android APKs, several obfuscation techniques are commonly employed to protect the code and make it harder to understand or reverse engineer. Code obfuscation is one such technique that involves transforming the source code into an equivalent but more complex form, making it difficult to decipher and analyze. String encryption is another commonly used technique where sensitive strings, such as API keys or URLs, are encrypted to prevent easy extraction. Additionally, control flow obfuscation is employed to disrupt the logical flow of the code, making it challenging to follow the program's execution path and understand its functionality.

Obfuscation Implications for Android Security

The use of obfuscation techniques increases the difficulty of security research analysis and renders some signature-based detection methods ineffective. String encryption makes it challenging to trace critical information. These measures make malware more difficult to identify and track.

A Text Classification-Based Approach for Accurate Obfuscation Detection in Malware Analysis By LianSecurity

For these reasons, our company, Liansecurity, developed a product called Incinerator, aiming to provide efficient, precise, and automated reverse engineering service. Through extensive analysis of malware and research on previous obfuscation detection techniques, we have implemented a Text Classification-Based obfuscation detection method in our android apk reverse engineering product Incinerator. Based on our testing, our method achieves an impressive accuracy rate of 98%, which has exceeded our expectations. In the following sections, we will describe our approach in detail.

Background

One of the state-of-art systems that detect obfuscation detection over Android applications is "AndrODet"[1] . In this work, the authors built an obfuscation detection system, extracting different features for each obfuscation type, then training an online machine learning model. The target obfuscation types and the relative F-measure achieved by AndrODet are shown below:

  • Identifier Renaming: 0.92

  • String Encryption: 0.79

  • Control Flow Obfuscation: 0.67

The Limitations of AndrODet in Android Context

In the context of Android, AndrODet faces certain limitations that affect its accuracy and effectiveness as a static code analysis tool. This section highlights two major issues:

APK-based Calculation and Weakened Features

AndrODet calculates its metrics based on the entire APK, including both the core business code and the associated libraries. In the Android ecosystem, dependency libraries can be significantly large, sometimes even larger than the core business code itself. And most of the time, dependency libraries are no need to obfuscate. When relying solely on the entire APK for calculation, the presence of these large unobfuscated libraries weakens the importance of obfuscated ones , ultimately affecting the accuracy of correctness judgments made by AndrODet.

Inability to Handle Unicode Encoding

AndrODet's method of calculating distances is limited to ASCII encoding. However, the use of Unicode encoding in obfuscation techniques has become increasingly prevalent. As a result, AndrODet can not handle and analyze code that has been obfuscated using Unicode encoding. This limitation hampers the tool's ability to accurately detect and assess the security and quality aspects of obfuscated code in real-world production scenarios.

The limitations of AndrODet pose challenges to its accuracy in the real-world production scenarios. Understanding these limitations and their impact on real production environments is essential for researchers and practitioners seeking to improve the capabilities of code analysis tools in the field of Android app security.

An Approach We've Come Up With

Our approach focuses on addressing the challenge of identifying Identifier Renaming in code obfuscation techniques, which is the most common obfuscation technique used by malware. Our method can be expanded to cover String Encryption as well. In our research, we observed that when researchers assess whether a code snippet is obfuscated, their initial judgment relies on the comprehensibility of class names, method names, and variable names, recognizable and commonly used coding conventions, referred to as "Coding English," vs unintelligible names like 'a', 'Zb', 'c4', '1li', '0Oo', etc. Initially, we attempted algorithmic approaches to tackle this problem but found limited success. However, we had an epiphany that this is essentially a classic NLP classification problem.

By this sudden flash of inspiration, we transform the problem of obfuscation detection into a text classification problem, which is already an easily solvable task for deep neural networks. Our test results also demonstrate that this transformation was highly successful. "String Encryption" is essentially a text classification problem as well, so we believe that this approach can be easily extended to String Encryption without significant issues.

Proposed Methodology

Step 1: Decompilation and Smali Extraction

The first step involves decompiling the Android APK and extracting Smali code. In our implementation, we use our own decompilation engine "Reactor". Other open source tools like Androguard or Apktools are ok too. From each class, we extract class names and class variable names. These names serve as crucial features for further analysis.In theory, more features can be extracted, such as function parameter names and local variables, but they do not significantly improve accuracy because the previous three features already achieve high accuracy.

Step 2: Creation of Training Sets

Once we have extracted the necessary information, we proceed to create two distinct training sets. The first training set comprises classes that have undergone obfuscation, labeled as 1. The second training set consists of classes that remain unobfuscated, labeled as 0. This categorization forms the basis for training our model accurately.

1.PNG

Step 3: Text Classification Neural Network Training

To classify obfuscated and non-obfuscated classes effectively, we construct a Text Classification neural network. This neural network is trained using the extracted features from Step 1 and the corresponding labels from Step 2. By leveraging the power of deep learning, the network learns to differentiate between obfuscated and non-obfuscated classes based on the provided training data.

The model consists of three main layers: an Embedding layer, an LSTM layer, and a Dense layer.

1)Embedding Layer:

The Embedding layer converts input integer sequences into dense vector representations.

2) LSTM Layer:

The LSTM (Long Short-Term Memory) layer is a type of recurrent neural network (RNN) capable of processing sequential data and capturing long-term dependencies. In this model, an LSTM layer with 128 units is utilized.

3)Dense Layer:

The Dense layer is a fully connected layer that performs a linear transformation on the LSTM layer's output and applies an activation function. In this case, the Dense layer has one unit with a sigmoid activation function, indicating binary classification.

Step 4: Train

We started with 1000 data samples and found that the results were already promising. As we increased the sample size to 10000, both the accuracy and validation accuracy became highly satisfactory. Ultimately, our model was trained using 100000 data samples. We attempted to augment the dataset further, but there was no significant improvement in the accuracy and validation accuracy. To avoid bias caused by data generated from a single APK, we randomly extracted several hundred APKs from the database to generate our data. From the resulting several million data samples, we randomly selected 100000 for training.

2.png

Training results are as follows:

Training accuracy: 99.75%

Validation accuracy: 98.50%

Experimental Results and Analysis

In practical applications, to determine whether an APK is obfuscated, we use a method that involves checking each class within the APK for obfuscation. By dividing the number of obfuscated classes by the total number of classes, we can calculate the proportion of obfuscated code in the APK. Although this approach may result in false positives or false negatives when dealing with classes that are minimally obfuscated due to technical reasons or are difficult to determine as obfuscated based on their naming conventions, it is highly unlikely to make errors when determining whether an APK exhibits obfuscation behavior. This is because a properly obfuscated APK ensures that all obfuscated classes do not resemble code written in normal, understandable English. If they did, it would defeat the purpose of obfuscation. Therefore, our model achieves an accuracy rate of nearly 100% when determining the presence of obfuscation in an APK.

After the first round of training, we obtained 1000 APKs from both Fdroid and Abuse for verification testing. Fdroid represents benign APKs, while Abuse represents malware APKs. During the testing phase, we observed a high number of false positives, primarily targeting very short internal classes, such as 'Class: MainActivity ExternalSyntheticLambda15; Method: run Field: f$0 f$1 f$2'. To address this issue, we extracted 3000 data samples from the false positive APKs and added them to the training set. After retraining, we significantly reduced the false positive rate.

Below are 100[2] randomly selected test samples. Since our validation accuracy is about 98%, cases with a confusion coverage rate of 1% and 2% should be considered as non-obfuscated. The remaining 4% (md5: 8328cd96c931d06d25f67d42a50fd20d) is a false positive, but due to the minimal number of classes in this APK, three false positive data instances resulted in this error. The other 5% (md5:923df6854199e999fdd274729b28a1ad) and 7% (md5:71e293f29e636112e0a00ebac8cf3eb8) cases represent genuine instances of obfuscation. Hence, this model demonstrates an accuracy rate close to 100% in identifying obfuscation, even in cases with a minimal level of obfuscation in the APK.

In our training dataset, we did not encounter any samples with Unicode obfuscation. However, during testing, such cases are still recognized as obfuscated because the model has excellent recognition capabilities for non-obfuscated text. Therefore, even if there are other obfuscation patterns that were not present in the training samples, the model is still able to identify them.

APKAPK Md5Obfuscation Coverage
et.nWifiManager.apk11c43f6d781457352e5e61e725998ea80%
jackpal.androidterm.apk8bbc3d9173e6d6b19e561a8651e837310%
com.boombuler.widgets.contacts.apk8328cd96c931d06d25f67d42a50fd20d4%
cz.jirkovsky.lukas.chmupocasi.apk86f763c8cf4530e1c46c75d26374855a99%
com.example.poleidoscope.apk08cf9be157669f3e0f7dd88975fdc22c1%
dufmvh.frdnoj.oggtsh.apkcf2f9963933457dcdd1f28fec054cd0756%
ua.com.radiokot.lnaddr2invoice.apkc1ade85027c6178e43daac2e957ba9b196%
org.openbmap.unifiedNlp.apk79ce98b9d38490625ad15f5948afe32f0%
com.dekics.chat.message.apkdc84f225fdb1c21071ee70d43af3922450%
org.getdisconnected.libreipsum.apka394d3131303bd24bdcddc7e0a507f0d1%
com.pnr.engproverbsandsayings.apk1d28e138a9ecf1c9b3240868879bbd5410%
org.ligi.blexplorer.apk49619da57858ffdd6bd55bb5b962efe31%
net.osmand.srtmPlugin.paid.apkc7dd9b418933ceea723527487bd942681%
org.broeuschmeul.android.gps.bluetooth.provider.apkcf1d9aa2d5eec5a8e0af76d9708a8da00%
com.intense.pub1.sbgs.apke272df5c9abd7d4c03982bb50692242815%
tgr.kitach.messenger.apkcd4acd78cf29adf56837e944c0ea379150%
com.github.lamarios.clipious.apk0e728b50b101456d74329f97552ea2db94%
com.ctbcad.cnove01.apk782216c3d9db96da2ef0285daddbdcdb0%
in.ac.iitb.cse.cartsbusboarding.apkee83d9a3c3fcffbd833f1b73d28d28cd2%
de.reimardoeffinger.quickdic.apk8e5e7cc0e581fac6c5d83802dadc009598%
com.fastcleaner.forphoneandroid.freenoads.apkc31ca58e67d55bb20a06e0f986cf04c192%
com.gh4a.apk 22556b8c3b0f4196b0db777d64cac5ee1%
nznm.qfvxs.apka827ee829d6067eda9c19f1dee15b9af1%
com.freezingwind.animereleasenotifier.apkc0786ccbcfe7cb57f82f36a66040d4521%
ogjp.otmyswhz.apkef3c97b748088019dc986dce53ae07551%
com.scare.obscure.apkb11e72c94d810958df65d8716d853bc346%
org.smc.inputmethod.indic.apkc9eeb111666c723e3a4f78e2e11ab10d1%
com.blame.annual.apk376fc34c1eb64a348311156b1f22763e45%
org.xapek.andiodine.apk923df6854199e999fdd274729b28a1ad5%
ir.PluTus.pluto.apkdc9f73c8ec88a8b493a15a3cbcb36f1533%
org.sufficientlysecure.viewer.apkdcb35395a9a3fa0aea0bd9c876c4fadc1%
ir.shz.shzkisi.apk7ec247424733c287c3322fc49f1a776633%
com.mimic.left.apk4076db4387eb8ddf8f2010e3db8c8b0759%
com.igllc.reign.apkbb78d33aac9b1c0c741b9e66d1ad971096%
org.tuxpaint.apk5f1d4d542004efd946a40a26166aed001%
Adliran.ir.apk3c0cccf2790ba49a122d0235225dbceb26%
com.believe.blouse.apk768ec2246d2c92330ba8fafe6513963e5%
Rahbar.Api.apk2f1570b5b5723d3f4ddd615905e8c08f27%
net.everythingandroid.smspopup.apk1e5d955dabdd0ee548054c8cdc2236531%
com.cointrend.apkcb3726beeb870d96e2dd458da66af96b97%
com.junjunguo.pocketmaps.apk0be11a3a032b35e2ce8021d32780cf3221%
com.kabood.koroshkabir.apk6129cc4392d2e10ffdb80db67ca2534b24%
site.leos.setter.apk2f03d669939c74b508a3959838fbba4c94%
jp.co.qsdn.android.jinbei3d.apkf25da1334e4db5d6c14c2361ba4defa81%
ir.game.co.apk9849247aef1aa1ae82c4dc06a638f29d1%
fr.xgouchet.texteditor.apka3f79b347a1c06140697326acb04581a1%
org.smssecure.smssecure.apka6dcb00ee7482256f8070b2d2eb23f622%
com.ebaschiera.triplecamel.apkd36cd1850f8dfec7298c08e8eed3f9971%
org.y20k.trackbook.apkd4054bf60b2fbcfc152b32397cb861b097%
com.comfort.digital.apka32c36009a37893be90e4f385b26b5ee35%
com.kylecorry.trail_sense.apk42501430e5b199df00f0068b3bd59db492%
com.helphomestickers.heartcarejingchat.apkfec9d39eb80814e1eec29e52e0fede2d50%
de.markusfisch.android.pielauncher.apkd0cf7f183b84ff040f237da0d7e89c5890%
org.xcsoar.apk35923a4197bcd2efd8d22a167af3f0281%
com.takela.message.apk55774d1c8251ee3c12ce08af65000bd716%
tech.bogomolov.incomingsmsgateway.apk85d0288b9b04c7d71bfd8185a916490b1%
com.rmowa.wpamz.apk23e49cc28a5feeed4b9e362aa43e158a65%
piste.security.path.vf.apk95d33595783ede50bd428a18823ca0a920%
de.rwth_aachen.phyphox.apk0a3fa3b09980e629c6a983a2c33d04001%
com.brief.blouse.apkbe9d61e3363c3399b55a44895fd1cf6047%
xjl.lrl.jzk.xkbnif.apkf140ec3c051717491aac1a477c0f453a44%
net.goroid.maya.apk9b1de8718bb348e74ecde66dfa7332a819%
eu.polarclock.apkc3c6f8ba040f1715d32ac7563d7d9b0c33%
tube.chikichiki.sako.apkd79144a6e4aad73e78bc25af25e8f8d11%
org.dyndns.fules.ck.apk7c1e243288ff30b602976d2ce634b0f30%
com.nima.demomusix.apk 93a79a8f1b2ad1eb2b670782e571107d1%
aps.js.piste.asd.apk b1e0ad60b4113ecfdf74e930848dcab421%
com.tutpro.baresip.apk702d0800421413f73f0f3d65a577986e1%
iroj.jnafjk.apk d0118fe80f1af4cf2fad4579fa7f87411%
de.monocles.mail.apk21ce417bd40a12c2333ab505a00958911%
com.example.myapplication.apk52a5b10ae074459fbbeb1a0e8c297eac1%
com.piolang.transltor.voice.apkc4c0982149feaf5266d6b2a9c463485884%
net.sourceforge.kid3.apk7bff47951d893d50b7bf1bb1512250061%
com.burtonben.goodlauncher.apkd7ffbdf8e491f0c3e53901cf830f10b29%
com.howwatchfunsms.locktextmessage.apkd59b366ab1870d17f9abdd48244613270%
free.vpn.unblock.proxy.turbovpn.apk1fd53adfc1ff5f6262567592dfc88fd470%
com.yshlhh.com.apkf0c84c3ffcc77a88ce344e7f632afb2d67%
com.feis.bphealthy.blood.apk6e05b674fb8725a4f1faae9d39be1b9414%
org.servalproject.apk8b2df68517574eb0c7d1b428584036951%
plus.H59300BC9.apk889e1c52bdebe6e1ae952bcc38b5daf111%
mon.suxzgi.apkb48f43a3c6b7c4ef07b7f87b62f64d611%
com.seleuco.easshs.apk013a0f9ddc9db42f06ae2cd1b6228c8f31%
com.vicman.toonmeapp.apkf724e92bdf978fb3bbdac308d4ba800c73%
com.hugo.apk6320c822ba4ce417ffb82746dbf6f6f827%
org.segin.bfinterpreter.apk69d3cd2ef0e619193f145c89b22ce9201%
de.jonasbernard.tudarmstadtmoodlewrapper.apk29bf40b35ce52d6e44c61304fdd8561a1%
com.belt.space.apk71e293f29e636112e0a00ebac8cf3eb87%
center.bestlinks.samuraivpn.apkf6e5f704bf5910b4d0aff44df2a77a8b91%
com.dev.xavier.tempusromanum.apk1d51ef04566cc66661358f7708c0a9d31%
com.zanghh.pdfreader.apke9133a533614dafee5780d50b29484c391%
org.avmedia.gshockGoogleSync.apkf942cf3de1107400be084ddd596016d91%
com.github.igrmk.smsq.apk72dfae851b1c93838094fe3b059ac5b11%
com.ljechbei.apk87118a9b63adebe8ad642509ff76818b16%
org.courville.nova.apk7041af61162329c4e2022d82939a2d2d1%
com.cliambrown.easynoise.apk8755ffdd6fe155593af77536bc8d1da11%
net.mullvad.mullvadvpn.apk956659e2df6362a79e110fac0fda353465%
nl.eduvpn.app.apkaa2099699b3c8b68aa33925899ad9e8496%
com.cheogram.android.apk4987ea46c3679a191434c1546231bade1%
io.pslab.apk37f9a2a3e4c906bf2cc3c14895620b1e1%
ru.yanus171.feedexfork.apk742aebc4c88564678e78276dbf29e9351%

Limitations and Future Directions

Our current model can not solve the issue of Control Flow Obfuscation, and the results from AndrODet are also not satisfactory in this aspect. In the future, we will conduct research on models specifically designed for Code Recognition to address this problem。

Compared to AndrODet, our model takes relatively more time to determine whether an APK is obfuscated because it needs to evaluate each class individually. Although batch processing is possible, an APK may contain thousands or even tens of thousands of classes. However, in a production environment, this is acceptable since analyzing an APK involves various aspects such as static analysis, dynamic analysis, which take longer to perform. Therefore, in our product, the waiting time for obfuscation detection is reasonable. Additionally, this time can also be mitigated through parallel processing.

Conclusion

We have proposed a text classification-based approach to detect whether an APK is obfuscated. This method has not been previously applied in existing research and can be extended to obfuscation detection in other software as well as String Encryption detection. Furthermore, we suggest that the detection of obfuscation in an APK should be performed at the class level, which results in an accuracy rate of nearly 100%.

We have already implemented this method in a production environment.

Appendix

[1].https://0m1d.com/software/AndrODet

[2]. https://drive.google.com/file/d/1OYYegY7MP7nGgfMORz_M7L4c3QFEjJW0/view?usp=sharing

邮箱
客服
公众号