1. Background

App markets, despite being crucial to today’s mobile ecosystem, have become the most powerful malware delivery channel since they in fact “lend credibility” to malicious apps in a way. In the past decade, machine learning (ML) techniques have been widely explored for automated, robust malware detection. Unfortunately to date, we have yet to see a ML-based malware detection solution deployed at market scales. To understand the real-world challenges, we undertake a collaborative study with a major Android app market (Market- X) offering us large-scale ground-truth data. In summary, we find the key to successfully developing such systems is manifold, including feature selection/engineering, dynamic analysis speed, developer engagement, and model evolution. Also, failure in any of the above aspects would lead to the “wooden barrel effect” of the whole system. This paper documents our judicious design choices and first-hand deployment experiences in building such a ML-powered malware detection system. It has been operational at Market-X for over one year, using a single commodity server to vet ∼10K apps every day, and achieves an overall precision of 98% and recall of 96% with an average per-app scan time of 1.3 minutes.

2. Dataset

We release our extracted features of an unbiased sample of the 500K apps studied in our Google Driver.
Because some files are larger than 100M, we storage the files in the Google Driver (APIChekcer).

File Name Detailed Information
apichecker.pkl dataset file including features and labels
apk_detect_result.txt the detecting results of APKs from Market-X
running_logs/ the testing logs of apps tested in 18 Feb, 2018
src/ source codes files in our experiments

3. Classification Model Configurations

We release the model configurations in the following table.

Algorithm Training Model in Sklearn
Naive Bayes BernoulliNB()
Random Forest RandomForestClassifier(n_estimators=15)
CART Decision Tree DecisionTreeClassifier()
Logistic Regression LinearRegression()
K-Nearest Neighbor KNeighborsClassifier(n_neighbors=5,weights=’uniform’, algorithm=’auto’)
Support Vector Machine svm.SVC(gamma='auto')
Artificial Neural Network MLPClassifier(hidden_layer_sizes=(100), solver=’adam’, alpha=0.0001)
Deep Neural Network MLPClassifier(hidden_layer_sizes=(100,75,50,20),solver='adam', alpha=0.0001)
Gradient Boosting Decision Tree GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)

4. Lightweight Emulator

We upload our lightweight emulator in the Google Driver:Android x86 emulator. In the emulator, we install the Xposed framework tool and the hook module for hooking 426 APIs.

5. API Set

We study ground-truth data from one of the world’s largest Android app markets (called Market-X), and extract 426 key APIs (out of 50K APIs) which generally belong to the following three categories (with some overlap). The 426 key APIs (out of 50K APIs) are of fundamental importance to malware detection.

Table 1: The 426 key APIs which generally belong to the three categories (with some overlap).