Machine Learning-based Android Malicious App Identification

链安博客2023-09-26 16:28:57

Problem Description

The openness of the Android market has led to the proliferation of malicious software (Malware). According to a report by the 360 Security Center, tens of thousands of Android Malwares are intercepted daily, making Android Malware Detection a hot topic among researchers. Traditional methods for detecting Android Malware primarily rely on rule-based or signature-based detection mechanisms, with implementation using Yara being relatively simple. However, the signature-based detection method is information-intensive, necessitating the continual collection of new signatures. On the other hand, rule-based implementations are extremely complex and highly susceptible to false positives, or evasion by cleverly designed Malware.

With the rise in popularity of machine learning, an increasing number of researchers are exploring its application for Malware detection. The core idea involves extracting information from Android APK files to train models, which are then utilized to predict whether other APK files are malicious. Based on the method of information extraction, there are mainly two categories: one is through static analysis, extracting components like permissions, intents, receivers, as well as code calls through decompilation from APKs for model training; the other is through actual installation and execution of the APK, intercepting network requests and monitoring system API calls to gather data for model training.

While exploring this issue, due to the dynamic detection functionality of the "Incinerator" project not being fully implemented, we temporarily lack the conditions to research dynamic analysis. Therefore, we have opted for investigating using the static analysis method.

Current Mainstream Solutions

The machine learning training schemes based on static analysis primarily fall into the following categories:

Permission Extraction Training:
- Extracting permission information from AndroidManifest.xml for model training.
Comprehensive Information Extraction Training:
- Extracting information like permissions, intents, and receivers from AndroidManifest.xml, and extracting Android system API call information from APKs based on rules through decompilation for model training.
Whole APK Training:
- Utilizing the entire APK file for machine learning training, including using the binary byte stream of the APK file as input, or extracting opcode through decompilation for training input.

According to literature reports, these schemes can achieve an accuracy rate of over 90%. Particularly, scheme 3 has an accuracy rate ranging between 92%-93%, while schemes 1 and 2 can reach an accuracy rate of over 95% in most studies.

We attempted to replicate the schemes mentioned in the literature, initially focusing on the permission-based scheme for validation.

Data Sources:

Malware Collection:
- A collection of 2000 Malware APKs from threat intelligence company abuse.ch (referred to as MB).
- Malware collections from VirusShare for the years 2020, 2021, and 2022 (referred to as VS2020/VS2021/VS2022).
Benign Collection:
- A collection of 10,000 APKs downloaded from Tencent's MyApp.
- A collection of 10,000 APKs downloaded from APKPure.

Training Methodology

Through static analysis, AOSP (Android Open Source Project) permissions are extracted from the AndroidManifest.xml of the APKs and input into the model using One-hot Encoding for training. The model selection involves using traditional machine learning binary classification models such as Random Forest and SVM for training. Upon testing, it was found that Random Forest yielded the best results, with an accuracy rate of 98%.

We chose the APKs from MyApp as benign samples and VS2022 as malicious software samples for training. The training data is as follows:

模型	Precision	Recall	FPR
随机森林	0.983	0.983	0.056
SVM	0.981	0.977	0.063

Then, we proceeded to test and validate on other datasets:

数据集	Precision	Recall	FPR
APKPure	0.0	NAN	0.59
MB	1.0	0.95	NAN
VS2020	1.0	0.96	NAN
VS2021	1.0	0.94	NAN

Upon validating with the APKPure dataset, we discovered an exceptionally high false positive rate in the model, exceeding 50%. This indicated that the model performed poorly in cross-validation across different datasets. Additionally, the high accuracy rates obtained on the MB, VS2020, and VS2021 datasets became meaningless due to the high false positive rate.

To delve deeper into the model's predictive performance, we opted to use Linear Support Vector Machine (LinearSVM) to interpret the model's predictions and attempt to explore potential issues: During the training process, a total of 265 permissions were utilized to train the model. We focused on analyzing the 30 permissions that had the most significant impact on the Malware prediction results:

0 1.9507425950717683 android.permission.READ_SMS
1 1.6805547441380115 android.permission.SEND_SMS
2 1.5291784053142392 android.permission.RECEIVE_SMS
3 1.281383891333467 android.permission.WRITE_SMS
4 1.1385944832617678 android.permission.GET_DETAILED_TASKS
5 1.0870145778775504 android.permission.MANAGE_USERS
6 0.9822953162458009 android.permission.SET_TIME_ZONE
7 0.9815855293627985 android.permission.REQUEST_DELETE_PACKAGES
8 0.8705538278525148 android.permission.ACCOUNT_MANAGER
9 0.7701851337780519 android.permission.ACCESS_CACHE_FILESYSTEM
10 0.7493889020376178 android.permission.PERSISTENT_ACTIVITY
11 0.742267985802697 android.permission.SET_PREFERRED_APPLICATIONS
12 0.6575763216374741 android.permission.USE_SIP
13 0.6423455602781643 android.permission.MODIFY_PHONE_STATE
14 0.5733719308777389 android.permission.READ_CALL_LOG
15 0.5713221448442122 android.permission.WRITE_SECURE_SETTINGS
16 0.5177117115666185 android.permission.CLEAR_APP_CACHE
17 0.5013751180995185 android.permission.WRITE_SYNC_SETTINGS
18 0.47540432455574055 android.permission.INJECT_EVENTS
19 0.450576746748121 android.permission.BIND_ACCESSIBILITY_SERVICE
20 0.4497437629117625 android.permission.READ_SYNC_STATS
21 0.40721040702182304 com.android.alarm.permission.SET_ALARM
22 0.3958974436391258 android.permission.GET_PACKAGE_SIZE
23 0.35828369132005317 android.permission.TRANSMIT_IR
24 0.3538089622374305 android.permission.CHANGE_COMPONENT_ENABLED_STATE
25 0.3303834311984685 android.permission.STATUS_BAR
26 0.3277728921018696 android.permission.WRITE_USER_DICTIONARY
27 0.31322691738916597 android.permission.SET_DEBUG_APP
28 0.28600828593282673 android.permission.INSTALL_PACKAGES
29 0.27804088205285526 android.permission.SHUTDOWN

The 30 permissions most crucial for predicting Benign results:

1 -1.0280830288092226 android.permission.FORCE_STOP_PACKAGES
2 -1.0244749163270055 android.permission.DELETE_CACHE_FILES
3 -0.9235183435775582 android.permission.READ_PRIVILEGED_PHONE_STATE
4 -0.7975588094210508 android.permission.USE_BIOMETRIC
5 -0.7691538868495551 android.permission.READ_CELL_BROADCASTS
6 -0.7288571523071693 android.permission.REQUEST_INSTALL_PACKAGES
7 -0.7278186994140812 android.permission.WRITE_CALL_LOG
8 -0.7029898754031535 android.permission.READ_SEARCH_INDEXABLES
9 -0.6832562629713737 android.permission.ACCESS_NOTIFICATION_POLICY
10 -0.6442707037030093 android.permission.BIND_NOTIFICATION_LISTENER_SERVICE
11 -0.6229441323892875 android.permission.CAPTURE_AUDIO_OUTPUT
12 -0.5951302503005503 android.permission.REORDER_TASKS
13 -0.552113274404841 android.permission.FACTORY_TEST
14 -0.5512329811397917 android.permission.CAMERA
15 -0.5415431826751977 android.permission.PACKAGE_USAGE_STATS
16 -0.5373788445105623 android.permission.READ_SYNC_SETTINGS
17 -0.5300427083556158 android.permission.ACCESS_WIFI_STATE
18 -0.48952375397337794 android.permission.READ_PHONE_NUMBERS
19 -0.4822239255635727 android.permission.STOP_APP_SWITCHES
20 -0.4525220364959383 android.permission.WRITE_MEDIA_STORAGE
21 -0.4133049145725493 com.android.browser.permission.WRITE_HISTORY_BOOKMARKS
22 -0.3902532535519829 android.permission.CAPTURE_VIDEO_OUTPUT
23 -0.34681147328619505 android.permission.READ_FRAME_BUFFER
24 -0.34134222449779317 android.permission.WRITE_GSERVICES
25 -0.3335042039412585 android.permission.BIND_APPWIDGET
26 -0.3263774109427998 android.permission.AUTHENTICATE_ACCOUNTS
27 -0.3136298914538836 android.permission.NFC
28 -0.3000955825422318 android.permission.READ_EXTERNAL_STORAGE
29 -0.2846046321402758 android.permission.CALL_PRIVILEGED
30 -0.28338090002182315 android.permission.READ_CALENDAR

In the table, the second column displays the weights calculated through SVM. Given the label settings, where Malware is marked as 1 and Benign is marked as 0 with training data formatted as 0,1,1,0,...0,1,1,0,... in boolean values, a positive weight implies higher importance in calculating the prediction result for Malware; the larger the weight, the higher its significance. Conversely, a negative weight implies higher importance in calculating the prediction result for Benign; the smaller the weight, the higher its significance.

Upon analyzing these permissions and their functions, we found that permissions related to Malware generally pose a higher risk compared to those related to Benign. To an extent, the design of this model is rational. For instance, the model successfully identified that permissions related to SMS are mainly associated with Malware and assigned them higher weights, implying that essentially, an app with SMS permissions is highly suspicious. In reality, ordinary apps should not request such permissions, as SMS management is typically a function of system apps.

However, a problem arises here: the existence of permissions is because the Android system perceives certain actions as potentially inappropriate, requiring user confirmation. Therefore, theoretically, all requested permissions could possibly cause harm to the user. Hence, the absence of permission requests should be viewed as a positive factor. But in a binary classification machine learning scenario, the model will make distinctions, as the existence of the Benign category implies that some permissions will be considered evidence supporting Benign.

Now, let's analyze why such a high false positive rate occurs: We employ LinearSVC to interpret the model's prediction results and analyze some permissions information with false positives:

0.1773649887447295 android.permission.WAKE_LOCK
0.01285824377030036 android.permission.INTERNET
-0.1357928094523775 android.permission.ACCESS_NETWORK_STATE

0.43102404170044467 com.android.alarm.permission.SET_ALARM
0.1773649887447295 android.permission.WAKE_LOCK
0.14741402851800423 android.permission.SYSTEM_ALERT_WINDOW
0.02740438240042149 android.permission.FOREGROUND_SERVICE
0.01285824377030036 android.permission.INTERNET
-0.1357928094523775 android.permission.ACCESS_NETWORK_STATE
-0.15043626374678254 android.permission.WRITE_EXTERNAL_STORAGE
-0.1975995718519041 android.permission.CHANGE_WIFI_STATE
-0.20461138790573433 android.permission.VIBRATE
-0.511067438637911 android.permission.ACCESS_WIFI_STATE

0.1773649887447295 android.permission.WAKE_LOCK
0.02740438240042149 android.permission.FOREGROUND_SERVICE
0.01285824377030036 android.permission.INTERNET
-0.1357928094523775 android.permission.ACCESS_NETWORK_STATE
-0.33867385510052594 android.permission.READ_EXTERNAL_STORAGE
-0.511067438637911 android.permission.ACCESS_WIFI_STATE

And the permissions information for true positives:

0.32757400447767016 android.permission.INSTALL_PACKAGES
0.2870058866311678 android.permission.READ_PHONE_STATE
0.1773649887447295 android.permission.WAKE_LOCK
0.1545767541451571 android.permission.FLASHLIGHT
0.14613075920332474 android.permission.BLUETOOTH_ADMIN
0.140268653568319 android.permission.GET_ACCOUNTS
0.08641386050999389 android.permission.MOUNT_UNMOUNT_FILESYSTEMS
0.06460516872049353 android.permission.ACCESS_COARSE_LOCATION
0.01285824377030036 android.permission.INTERNET
-0.009804892771664459 android.permission.ACCESS_FINE_LOCATION
-0.12321341834571817 android.permission.READ_LOGS
-0.1357928094523775 android.permission.ACCESS_NETWORK_STATE
-0.15043626374678254 android.permission.WRITE_EXTERNAL_STORAGE
-0.15994619600450963 android.permission.CHANGE_NETWORK_STATE
-0.16005902734200772 android.permission.WRITE_SETTINGS
-0.1975995718519041 android.permission.CHANGE_WIFI_STATE
-0.20461138790573433 android.permission.VIBRATE
-0.23536025455979454 android.permission.CALL_PHONE
-0.24802834827531783 android.permission.ACCESS_LOCATION_EXTRA_COMMANDS
-0.30018060973660377 android.permission.BLUETOOTH
-0.33867385510052594 android.permission.READ_EXTERNAL_STORAGE
-0.511067438637911 android.permission.ACCESS_WIFI_STATE
-0.5625902678304402 android.permission.CAMERA
-0.7242676191415552 android.permission.REQUEST_INSTALL_PACKAGES

Through analysis, we identified a pattern: APKs with fewer permissions tend to be misjudged, while those with more permissions are generally predicted correctly. Upon further investigation, we understood that this phenomenon primarily arose because most APKs in the APKPure sample had fewer permissions, while our training model was chiefly based on the App Bao APK samples, which had more permissions. Hence, to some extent, the prediction errors were driven by sample discrepancies.

To address this issue, a straightforward approach is to incorporate the APKPure data into the training process, aiming to enhance the model's generalization capability and prediction accuracy.

We took the following steps: randomly selecting half of the samples from APKPure, i.e., 5000 APKs, and concurrently randomly selecting half from the App Bao samples, approximately 5000 APKs, for joint model training. Subsequently, we used this newly trained model to predict the samples that were not part of the training. The results demonstrated a significant improvement in the predictive accuracy of the new model.

模型	Precision	Recall	FPR
随机森林	0.994	0.967	0.008
SVM	0.994	0.967	0.008

Then we conducted testing and validation on other datasets:

数据集	Precision	Recall	FPR
APKPure 未参与训练的样本	0.0	NAN	0.018
MB	1.0	0.878	NAN
VS2020	1.0	0.92	NAN
VS2021	1.0	0.89	NAN

The false positive rate has been reduced to an acceptable level. This experiment highlighted a crucial phenomenon: achieving ideal results on a training set is relatively easy, but accurately predicting in the real world may pose challenges. No one can guarantee that the collected samples perfectly reflect the real-world scenario, and our goal is to identify the features genuinely associated with Malware. Hence, we decided to explore other possible solutions.

1. Training Based on Intents and Receivers

Subsequently, we expanded our feature set, incorporating information extracted from AndroidManifest.xml such as intents and receivers for training. However, this did not improve the model's accuracy.

2. Training Based on System API Calls

We further attempted to extract all system API calls from the APKs, converting them into a format suitable for Convolutional Neural Networks (CNN), and then trained the model using CNN. The model achieved a satisfactory accuracy rate of 97% on the training set, but the performance remained subpar during cross-dataset validation.

During this process, we encountered several issues:

2.1 Insignificant Differences in API Call Frequency

Initially, by decompiling, we extracted all API calls only to find no significant difference in the frequency of these API calls. For instance, we initially believed that the call to accessibilityservice was significantly related to Malware, but found it to be frequently present in benign software as well. Later, we understood that this was mainly because most APKs rely on the android library, which contains a vast number of system API calls. Since both Malware and Benign APKs rely heavily on third-party libraries, which contain numerous system API calls, it became challenging to differentiate Malware from Benign based on the statistical results of system API calls. Even after using the incinerator's SCA analysis feature to detect and eliminate these third-party libraries, the results were still disappointing.

2.2 Interference from Third-Party Libraries

We found that very few studies considered the interference from third-party libraries or proposed specific plans to eliminate them. Without removing these libraries, the method based on static analysis is almost meaningless, as static extraction will extract a large number of uncalled APIs, which show no significant difference between Malware and Benign APKs.

2.3 Decompiling and Unshelling Issues

Since this method involves extracting system API calls from APKs, decompiling is necessary. However, many Malwares are shelled, and unshelling is a complex adversarial technique. Very few references we consulted mentioned this point. Unshelling might be too challenging for non-specialist teams, but at the very least, APKs without shells should be selected from the dataset. Without unshelling, only a very few system API calls like dexloader could be extracted directly, significantly impacting the training effectiveness.

3. Reverting to the Model Based on Permissions

Given the unsatisfactory outcomes of the previous attempts, we reverted to the model based on Permissions, seeking to enhance cross-validation accuracy through other methods. We made the following attempts:

3.1 Selecting Permissions through the Select Model

The aim of this method was to find the Permissions actually utilized by Malware. However, the actual test results showed that the training accuracy decreased, the validation accuracy also decreased, and the cross-validation accuracy decreased even more.

3.2 Our experimental process is as follows

Using the SelectFromModel method, based on the RandomForestClassifier model (which included all data participating in training), we extracted parameters. We extracted 100, 50, and 30 Permissions respectively for experimentation, and obtained the following data:

Selecting 100 Permissions

Training Data:

模型	Precision	Recall	FPR
随机森林	0.989	0.974	0.013

数据集	Precision	Recall	FPR
APKPure	0.0	NAN	0.64
MB	1.0	0.95	NAN
VS2020	1.0	0.96	NAN
VS2021	1.0	0.94	NAN

Choosing 50 Permissions

Training Data:

模型	Precision	Recall	FPR
随机森林	0.981	0.986	0.063

数据集	Precision	Recall	FPR
APKPure	0.0	NAN	0.59
MB	1.0	0.94	NAN
VS2020	1.0	0.96	NAN
VS2021	1.0	0.94	NAN

Choosing 30 Permissions

Training Data:

模型	Precision	Recall	FPR
随机森林	0.983	0.983	0.054

数据集	Precision	Recall	FPR
APKPure	0.0	NAN	0.59
MB	1.0	0.96	NAN
VS2020	1.0	0.96	NAN
VS2021	1.0	0.95	NAN

Analysis and New Attempts

From the preceding data, it is evident that reducing the number of Permissions did not ameliorate the false positive rate or enhance accuracy.

Dimensionality Reduction using Auto Encoder
In a bid to further refine the model performance, we turned towards Auto Encoder technology, aspiring to boost accuracy through grouping and dimensionality reduction of Permissions. An Auto Encoder is a type of neural network capable of learning data encoding, often employed for dimensionality reduction or feature learning. Our rationale was to transition the high-dimensional data of Permissions into lower dimensions using an Auto Encoder, while retaining crucial information pertinent to Malware identification.
Nevertheless, despite our venture into dimensionality reduction using Auto Encoders, the cross-validation outcomes left much to be desired, with the false positive rate remaining elevated.

模型	Precision	Recall	FPR
随机森林	0.977	0.980	0.074

数据集	Precision	Recall	FPR
APKPure	0.0	NAN	0.64
MB	1.0	0.95	NAN
VS2020	1.0	0.95	NAN
VS2021	1.0	0.92	NAN

Selecting Permissions Through Business Logic

Subsequently, we endeavored to choose appropriate Permissions through business rules. Our methodology was as follows:

We accounted for the occurrence frequency of each Permission in all Malware (( mp )) and each Permission in all Benign (( bp )).
We then computed the value of ( bp/mp ), and selected all Permissions with ( bp/mp > 1 ) for training. The objective behind this approach was to pick out the Permissions most commonly used by Malware.

However, this technique also failed to improve the training accuracy and validation accuracy. For instance, permissions like android.permission.INTERNET have a very similar occurrence probability in both Malware and Benign, around 97% in both cases, thus participating in training doesn't bring much significance. Similarly, android.permission.WAKE_LOCK also has a very similar occurrence probability in both Malware and Benign, around 80% in both cases.

We also attempted some methods mentioned in other research papers. Through testing, we discovered that regardless of the method employed to filter Permissions, the training accuracy post-filtering never surpasses the accuracy attained by training with all Permissions unfiltered. This finding contradicts the conclusions of many papers that focus on Permission filtering as the primary research objective.

Conclusion

After numerous attempts, we have temporarily been unable to train a model on a single dataset that also exhibits an ideal accuracy rate on other datasets. Hence, we decided to merge the data from App Bao and APKPure as the training set for Benign APK, while also combining the other several Malware datasets as the training set for Malware APK, aiming to improve the level of sample balance.

Of course, we have also discerned some potential optimization directions:

We noticed that there are certain patterns in the signatures of Malware and Benign APKs. If these patterns can be verified, they could be incorporated into the existing model.
While manually analyzing the samples, we realized that many Malware APK samples cannot be identified as malicious during the static analysis phase, as their permissions don't harbor any suspicious information. For example, many dropper, downloader, and other types of malicious APKs don't exhibit any noticeable difference in permissions compared to Benign APKs. Hence, manual screening of malicious APKs might help to elevate the training accuracy.
Additionally, incorporating information from dynamic detection into the training model is also a direction worth exploring. We plan to integrate dynamic information in subsequent research to enhance the model's performance.

首页

产品中心

Shambles Incinerator Flegias

联系我们

邮箱

客服

公众号