Implementation of obfuscation detection by Text Classification

链安博客2023-06-25 10:00:15

Overview

Obfuscation in Android APKs

Obfuscation is a technique used in software development, to make the code more difficult to understand, analyze, and reverse engineer. It involves transforming code into a form that is complex and convoluted while preserving its functionality. The primary goal of obfuscation is to hinder unauthorized access to the code and protect intellectual property.

In Android APKs, several obfuscation techniques are commonly employed to protect the code and make it harder to understand or reverse engineer. Code obfuscation is one such technique that involves transforming the source code into an equivalent but more complex form, making it difficult to decipher and analyze. String encryption is another commonly used technique where sensitive strings, such as API keys or URLs, are encrypted to prevent easy extraction. Additionally, control flow obfuscation is employed to disrupt the logical flow of the code, making it challenging to follow the program's execution path and understand its functionality.

Obfuscation Implications for Android Security

The use of obfuscation techniques increases the difficulty of security research analysis and renders some signature-based detection methods ineffective. String encryption makes it challenging to trace critical information. These measures make malware more difficult to identify and track.

A Text Classification-Based Approach for Accurate Obfuscation Detection in Malware Analysis By LianSecurity

For these reasons, our company, Liansecurity, developed a product called Incinerator, aiming to provide efficient, precise, and automated reverse engineering service. Through extensive analysis of malware and research on previous obfuscation detection techniques, we have implemented a Text Classification-Based obfuscation detection method in our android apk reverse engineering product Incinerator. Based on our testing, our method achieves an impressive accuracy rate of 98%, which has exceeded our expectations. In the following sections, we will describe our approach in detail.

Background

One of the state-of-art systems that detect obfuscation detection over Android applications is "AndrODet"[1] . In this work, the authors built an obfuscation detection system, extracting different features for each obfuscation type, then training an online machine learning model. The target obfuscation types and the relative F-measure achieved by AndrODet are shown below:

Identifier Renaming: 0.92
String Encryption: 0.79
Control Flow Obfuscation: 0.67

The Limitations of AndrODet in Android Context

In the context of Android, AndrODet faces certain limitations that affect its accuracy and effectiveness as a static code analysis tool. This section highlights two major issues:

APK-based Calculation and Weakened Features

AndrODet calculates its metrics based on the entire APK, including both the core business code and the associated libraries. In the Android ecosystem, dependency libraries can be significantly large, sometimes even larger than the core business code itself. And most of the time, dependency libraries are no need to obfuscate. When relying solely on the entire APK for calculation, the presence of these large unobfuscated libraries weakens the importance of obfuscated ones , ultimately affecting the accuracy of correctness judgments made by AndrODet.

Inability to Handle Unicode Encoding

AndrODet's method of calculating distances is limited to ASCII encoding. However, the use of Unicode encoding in obfuscation techniques has become increasingly prevalent. As a result, AndrODet can not handle and analyze code that has been obfuscated using Unicode encoding. This limitation hampers the tool's ability to accurately detect and assess the security and quality aspects of obfuscated code in real-world production scenarios.

The limitations of AndrODet pose challenges to its accuracy in the real-world production scenarios. Understanding these limitations and their impact on real production environments is essential for researchers and practitioners seeking to improve the capabilities of code analysis tools in the field of Android app security.

An Approach We've Come Up With

Our approach focuses on addressing the challenge of identifying Identifier Renaming in code obfuscation techniques, which is the most common obfuscation technique used by malware. Our method can be expanded to cover String Encryption as well. In our research, we observed that when researchers assess whether a code snippet is obfuscated, their initial judgment relies on the comprehensibility of class names, method names, and variable names, recognizable and commonly used coding conventions, referred to as "Coding English," vs unintelligible names like 'a', 'Zb', 'c4', '1li', '0Oo', etc. Initially, we attempted algorithmic approaches to tackle this problem but found limited success. However, we had an epiphany that this is essentially a classic NLP classification problem.

By this sudden flash of inspiration, we transform the problem of obfuscation detection into a text classification problem, which is already an easily solvable task for deep neural networks. Our test results also demonstrate that this transformation was highly successful. "String Encryption" is essentially a text classification problem as well, so we believe that this approach can be easily extended to String Encryption without significant issues.

Proposed Methodology

Step 1: Decompilation and Smali Extraction

The first step involves decompiling the Android APK and extracting Smali code. In our implementation, we use our own decompilation engine "Reactor". Other open source tools like Androguard or Apktools are ok too. From each class, we extract class names and class variable names. These names serve as crucial features for further analysis.In theory, more features can be extracted, such as function parameter names and local variables, but they do not significantly improve accuracy because the previous three features already achieve high accuracy.

Step 2: Creation of Training Sets

Once we have extracted the necessary information, we proceed to create two distinct training sets. The first training set comprises classes that have undergone obfuscation, labeled as 1. The second training set consists of classes that remain unobfuscated, labeled as 0. This categorization forms the basis for training our model accurately.

Step 3: Text Classification Neural Network Training

To classify obfuscated and non-obfuscated classes effectively, we construct a Text Classification neural network. This neural network is trained using the extracted features from Step 1 and the corresponding labels from Step 2. By leveraging the power of deep learning, the network learns to differentiate between obfuscated and non-obfuscated classes based on the provided training data.

The model consists of three main layers: an Embedding layer, an LSTM layer, and a Dense layer.

1）Embedding Layer:

The Embedding layer converts input integer sequences into dense vector representations.

2） LSTM Layer:

The LSTM (Long Short-Term Memory) layer is a type of recurrent neural network (RNN) capable of processing sequential data and capturing long-term dependencies. In this model, an LSTM layer with 128 units is utilized.

3）Dense Layer:

The Dense layer is a fully connected layer that performs a linear transformation on the LSTM layer's output and applies an activation function. In this case, the Dense layer has one unit with a sigmoid activation function, indicating binary classification.

Step 4: Train

We started with 1000 data samples and found that the results were already promising. As we increased the sample size to 10000, both the accuracy and validation accuracy became highly satisfactory. Ultimately, our model was trained using 100000 data samples. We attempted to augment the dataset further, but there was no significant improvement in the accuracy and validation accuracy. To avoid bias caused by data generated from a single APK, we randomly extracted several hundred APKs from the database to generate our data. From the resulting several million data samples, we randomly selected 100000 for training.

Training results are as follows:

Training accuracy: 99.75%

Validation accuracy: 98.50%

Experimental Results and Analysis

In practical applications, to determine whether an APK is obfuscated, we use a method that involves checking each class within the APK for obfuscation. By dividing the number of obfuscated classes by the total number of classes, we can calculate the proportion of obfuscated code in the APK. Although this approach may result in false positives or false negatives when dealing with classes that are minimally obfuscated due to technical reasons or are difficult to determine as obfuscated based on their naming conventions, it is highly unlikely to make errors when determining whether an APK exhibits obfuscation behavior. This is because a properly obfuscated APK ensures that all obfuscated classes do not resemble code written in normal, understandable English. If they did, it would defeat the purpose of obfuscation. Therefore, our model achieves an accuracy rate of nearly 100% when determining the presence of obfuscation in an APK.

After the first round of training, we obtained 1000 APKs from both Fdroid and Abuse for verification testing. Fdroid represents benign APKs, while Abuse represents malware APKs. During the testing phase, we observed a high number of false positives, primarily targeting very short internal classes, such as 'Class: MainActivity ExternalSyntheticLambda15; Method: run Field: f$0 f$1 f$2'. To address this issue, we extracted 3000 data samples from the false positive APKs and added them to the training set. After retraining, we significantly reduced the false positive rate.

Below are 100[2] randomly selected test samples. Since our validation accuracy is about 98%, cases with a confusion coverage rate of 1% and 2% should be considered as non-obfuscated. The remaining 4% (md5: 8328cd96c931d06d25f67d42a50fd20d) is a false positive, but due to the minimal number of classes in this APK, three false positive data instances resulted in this error. The other 5% (md5:923df6854199e999fdd274729b28a1ad) and 7% (md5:71e293f29e636112e0a00ebac8cf3eb8) cases represent genuine instances of obfuscation. Hence, this model demonstrates an accuracy rate close to 100% in identifying obfuscation, even in cases with a minimal level of obfuscation in the APK.

In our training dataset, we did not encounter any samples with Unicode obfuscation. However, during testing, such cases are still recognized as obfuscated because the model has excellent recognition capabilities for non-obfuscated text. Therefore, even if there are other obfuscation patterns that were not present in the training samples, the model is still able to identify them.

APK	APK Md5	Obfuscation Coverage
et.nWifiManager.apk	11c43f6d781457352e5e61e725998ea8	0%
jackpal.androidterm.apk	8bbc3d9173e6d6b19e561a8651e83731	0%
com.boombuler.widgets.contacts.apk	8328cd96c931d06d25f67d42a50fd20d	4%
cz.jirkovsky.lukas.chmupocasi.apk	86f763c8cf4530e1c46c75d26374855a	99%
com.example.poleidoscope.apk	08cf9be157669f3e0f7dd88975fdc22c	1%
dufmvh.frdnoj.oggtsh.apk	cf2f9963933457dcdd1f28fec054cd07	56%
ua.com.radiokot.lnaddr2invoice.apk	c1ade85027c6178e43daac2e957ba9b1	96%
org.openbmap.unifiedNlp.apk	79ce98b9d38490625ad15f5948afe32f	0%
com.dekics.chat.message.apk	dc84f225fdb1c21071ee70d43af39224	50%
org.getdisconnected.libreipsum.apk	a394d3131303bd24bdcddc7e0a507f0d	1%
com.pnr.engproverbsandsayings.apk	1d28e138a9ecf1c9b3240868879bbd54	10%
org.ligi.blexplorer.apk	49619da57858ffdd6bd55bb5b962efe3	1%
net.osmand.srtmPlugin.paid.apk	c7dd9b418933ceea723527487bd94268	1%
org.broeuschmeul.android.gps.bluetooth.provider.apk	cf1d9aa2d5eec5a8e0af76d9708a8da0	0%
com.intense.pub1.sbgs.apk	e272df5c9abd7d4c03982bb506922428	15%
tgr.kitach.messenger.apk	cd4acd78cf29adf56837e944c0ea3791	50%
com.github.lamarios.clipious.apk	0e728b50b101456d74329f97552ea2db	94%
com.ctbcad.cnove01.apk	782216c3d9db96da2ef0285daddbdcdb	0%
in.ac.iitb.cse.cartsbusboarding.apk	ee83d9a3c3fcffbd833f1b73d28d28cd	2%
de.reimardoeffinger.quickdic.apk	8e5e7cc0e581fac6c5d83802dadc0095	98%
com.fastcleaner.forphoneandroid.freenoads.apk	c31ca58e67d55bb20a06e0f986cf04c1	92%
com.gh4a.apk 22556b8c3b0f4196b0db777d64cac5ee	1%
nznm.qfvxs.apk	a827ee829d6067eda9c19f1dee15b9af	1%
com.freezingwind.animereleasenotifier.apk	c0786ccbcfe7cb57f82f36a66040d452	1%
ogjp.otmyswhz.apk	ef3c97b748088019dc986dce53ae0755	1%
com.scare.obscure.apk	b11e72c94d810958df65d8716d853bc3	46%
org.smc.inputmethod.indic.apk	c9eeb111666c723e3a4f78e2e11ab10d	1%
com.blame.annual.apk	376fc34c1eb64a348311156b1f22763e	45%
org.xapek.andiodine.apk	923df6854199e999fdd274729b28a1ad	5%
ir.PluTus.pluto.apk	dc9f73c8ec88a8b493a15a3cbcb36f15	33%
org.sufficientlysecure.viewer.apk	dcb35395a9a3fa0aea0bd9c876c4fadc	1%
ir.shz.shzkisi.apk	7ec247424733c287c3322fc49f1a7766	33%
com.mimic.left.apk	4076db4387eb8ddf8f2010e3db8c8b07	59%
com.igllc.reign.apk	bb78d33aac9b1c0c741b9e66d1ad9710	96%
org.tuxpaint.apk	5f1d4d542004efd946a40a26166aed00	1%
Adliran.ir.apk	3c0cccf2790ba49a122d0235225dbceb	26%
com.believe.blouse.apk	768ec2246d2c92330ba8fafe6513963e	5%
Rahbar.Api.apk	2f1570b5b5723d3f4ddd615905e8c08f	27%
net.everythingandroid.smspopup.apk	1e5d955dabdd0ee548054c8cdc223653	1%
com.cointrend.apk	cb3726beeb870d96e2dd458da66af96b	97%
com.junjunguo.pocketmaps.apk	0be11a3a032b35e2ce8021d32780cf32	21%
com.kabood.koroshkabir.apk	6129cc4392d2e10ffdb80db67ca2534b	24%
site.leos.setter.apk	2f03d669939c74b508a3959838fbba4c	94%
jp.co.qsdn.android.jinbei3d.apk	f25da1334e4db5d6c14c2361ba4defa8	1%
ir.game.co.apk	9849247aef1aa1ae82c4dc06a638f29d	1%
fr.xgouchet.texteditor.apk	a3f79b347a1c06140697326acb04581a	1%
org.smssecure.smssecure.apk	a6dcb00ee7482256f8070b2d2eb23f62	2%
com.ebaschiera.triplecamel.apk	d36cd1850f8dfec7298c08e8eed3f997	1%
org.y20k.trackbook.apk	d4054bf60b2fbcfc152b32397cb861b0	97%
com.comfort.digital.apk	a32c36009a37893be90e4f385b26b5ee	35%
com.kylecorry.trail_sense.apk	42501430e5b199df00f0068b3bd59db4	92%
com.helphomestickers.heartcarejingchat.apk	fec9d39eb80814e1eec29e52e0fede2d	50%
de.markusfisch.android.pielauncher.apk	d0cf7f183b84ff040f237da0d7e89c58	90%
org.xcsoar.apk	35923a4197bcd2efd8d22a167af3f028	1%
com.takela.message.apk	55774d1c8251ee3c12ce08af65000bd7	16%
tech.bogomolov.incomingsmsgateway.apk	85d0288b9b04c7d71bfd8185a916490b	1%
com.rmowa.wpamz.apk	23e49cc28a5feeed4b9e362aa43e158a	65%
piste.security.path.vf.apk	95d33595783ede50bd428a18823ca0a9	20%
de.rwth_aachen.phyphox.apk	0a3fa3b09980e629c6a983a2c33d0400	1%
com.brief.blouse.apk	be9d61e3363c3399b55a44895fd1cf60	47%
xjl.lrl.jzk.xkbnif.apk	f140ec3c051717491aac1a477c0f453a	44%
net.goroid.maya.apk	9b1de8718bb348e74ecde66dfa7332a8	19%
eu.polarclock.apk	c3c6f8ba040f1715d32ac7563d7d9b0c	33%
tube.chikichiki.sako.apk	d79144a6e4aad73e78bc25af25e8f8d1	1%
org.dyndns.fules.ck.apk	7c1e243288ff30b602976d2ce634b0f3	0%
com.nima.demomusix.apk 93a79a8f1b2ad1eb2b670782e571107d	1%
aps.js.piste.asd.apk b1e0ad60b4113ecfdf74e930848dcab4	21%
com.tutpro.baresip.apk	702d0800421413f73f0f3d65a577986e	1%
iroj.jnafjk.apk d0118fe80f1af4cf2fad4579fa7f8741	1%
de.monocles.mail.apk	21ce417bd40a12c2333ab505a0095891	1%
com.example.myapplication.apk	52a5b10ae074459fbbeb1a0e8c297eac	1%
com.piolang.transltor.voice.apk	c4c0982149feaf5266d6b2a9c4634858	84%
net.sourceforge.kid3.apk	7bff47951d893d50b7bf1bb151225006	1%
com.burtonben.goodlauncher.apk	d7ffbdf8e491f0c3e53901cf830f10b2	9%
com.howwatchfunsms.locktextmessage.apk	d59b366ab1870d17f9abdd4824461327	0%
free.vpn.unblock.proxy.turbovpn.apk	1fd53adfc1ff5f6262567592dfc88fd4	70%
com.yshlhh.com.apk	f0c84c3ffcc77a88ce344e7f632afb2d	67%
com.feis.bphealthy.blood.apk	6e05b674fb8725a4f1faae9d39be1b94	14%
org.servalproject.apk	8b2df68517574eb0c7d1b42858403695	1%
plus.H59300BC9.apk	889e1c52bdebe6e1ae952bcc38b5daf1	11%
mon.suxzgi.apk	b48f43a3c6b7c4ef07b7f87b62f64d61	1%
com.seleuco.easshs.apk	013a0f9ddc9db42f06ae2cd1b6228c8f	31%
com.vicman.toonmeapp.apk	f724e92bdf978fb3bbdac308d4ba800c	73%
com.hugo.apk	6320c822ba4ce417ffb82746dbf6f6f8	27%
org.segin.bfinterpreter.apk	69d3cd2ef0e619193f145c89b22ce920	1%
de.jonasbernard.tudarmstadtmoodlewrapper.apk	29bf40b35ce52d6e44c61304fdd8561a	1%
com.belt.space.apk	71e293f29e636112e0a00ebac8cf3eb8	7%
center.bestlinks.samuraivpn.apk	f6e5f704bf5910b4d0aff44df2a77a8b	91%
com.dev.xavier.tempusromanum.apk	1d51ef04566cc66661358f7708c0a9d3	1%
com.zanghh.pdfreader.apk	e9133a533614dafee5780d50b29484c3	91%
org.avmedia.gshockGoogleSync.apk	f942cf3de1107400be084ddd596016d9	1%
com.github.igrmk.smsq.apk	72dfae851b1c93838094fe3b059ac5b1	1%
com.ljechbei.apk	87118a9b63adebe8ad642509ff76818b	16%
org.courville.nova.apk	7041af61162329c4e2022d82939a2d2d	1%
com.cliambrown.easynoise.apk	8755ffdd6fe155593af77536bc8d1da1	1%
net.mullvad.mullvadvpn.apk	956659e2df6362a79e110fac0fda3534	65%
nl.eduvpn.app.apk	aa2099699b3c8b68aa33925899ad9e84	96%
com.cheogram.android.apk	4987ea46c3679a191434c1546231bade	1%
io.pslab.apk	37f9a2a3e4c906bf2cc3c14895620b1e	1%
ru.yanus171.feedexfork.apk	742aebc4c88564678e78276dbf29e935	1%

Limitations and Future Directions

Our current model can not solve the issue of Control Flow Obfuscation, and the results from AndrODet are also not satisfactory in this aspect. In the future, we will conduct research on models specifically designed for Code Recognition to address this problem。

Compared to AndrODet, our model takes relatively more time to determine whether an APK is obfuscated because it needs to evaluate each class individually. Although batch processing is possible, an APK may contain thousands or even tens of thousands of classes. However, in a production environment, this is acceptable since analyzing an APK involves various aspects such as static analysis, dynamic analysis, which take longer to perform. Therefore, in our product, the waiting time for obfuscation detection is reasonable. Additionally, this time can also be mitigated through parallel processing.

Conclusion

We have proposed a text classification-based approach to detect whether an APK is obfuscated. This method has not been previously applied in existing research and can be extended to obfuscation detection in other software as well as String Encryption detection. Furthermore, we suggest that the detection of obfuscation in an APK should be performed at the class level, which results in an accuracy rate of nearly 100%.

We have already implemented this method in a production environment.

Appendix

[1].https://0m1d.com/software/AndrODet

[2]. https://drive.google.com/file/d/1OYYegY7MP7nGgfMORz_M7L4c3QFEjJW0/view?usp=sharing

首页

产品中心

Shambles Incinerator Flegias

联系我们

邮箱

客服

公众号