App markets, despite being crucial to today’s mobile ecosystem, have become the most powerful malware delivery channel since they in fact “lend credibility” to malicious apps in a way. In the past decade, machine learning (ML) techniques have been widely explored for automated, robust malware detection. Unfortunately to date, we have yet to see a ML-based malware detection solution deployed at market scales. To understand the real-world challenges, we undertake a collaborative study with a major Android app market (Market- X) offering us large-scale ground-truth data. In summary, we find the key to successfully developing such systems is manifold, including feature selection/engineering, dynamic analysis speed, developer engagement, and model evolution. Also, failure in any of the above aspects would lead to the “wooden barrel effect” of the whole system. This paper documents our judicious design choices and first-hand deployment experiences in building such a ML-powered malware detection system. It has been operational at Market-X for over one year, using a single commodity server to vet ∼10K apps every day, and achieves an overall precision of 98% and recall of 96% with an average per-app scan time of 1.3 minutes.
File Name | Detailed Information |
---|---|
apichecker.pkl | dataset file including features and labels |
apk_detect_result.txt | the detecting results of APKs from Market-X |
running_logs/ | the testing logs of apps tested in 18 Feb, 2018 |
src/ | source codes files in our experiments |
We release the model configurations in the following table.
Algorithm | Training Model in Sklearn |
---|---|
Naive Bayes | BernoulliNB() |
Random Forest | RandomForestClassifier(n_estimators=15) |
CART Decision Tree | DecisionTreeClassifier() |
Logistic Regression | LinearRegression() |
K-Nearest Neighbor | KNeighborsClassifier(n_neighbors=5,weights=’uniform’, algorithm=’auto’) |
Support Vector Machine | svm.SVC(gamma='auto') |
Artificial Neural Network | MLPClassifier(hidden_layer_sizes=(100), solver=’adam’, alpha=0.0001) | Deep Neural Network | MLPClassifier(hidden_layer_sizes=(100,75,50,20),solver='adam', alpha=0.0001) |
Gradient Boosting Decision Tree | GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0) |
We upload our lightweight emulator in the Google Driver:Android x86 emulator. In the emulator, we install the Xposed framework tool and the hook module for hooking 426 APIs.
We study ground-truth data from one of the world’s largest Android app markets (called Market-X), and extract 426 key APIs (out of 50K APIs) which generally belong to the following three categories (with some overlap). The 426 key APIs (out of 50K APIs) are of fundamental importance to malware detection.
Table 1: The 426 key APIs which generally belong to the three categories (with some overlap).