Fork 1
122 MiB materialized 122 MiB stored


This repo gives a new ML practitioner a walkthrough of tackling the handwritten digit recognition problem using different learning algorithms.

Data Set

To start, create your repo, e.g. Digit_Recognizer on Xethub, clone that repo onto your laptop. Download the data foler in this repo to your Digit_Recognizer repo.

Make sure you are on main branch, let's do git add . and git commit -m "import data" to snapshot importing data.

1. Decision Tree Model

As a baseline, the first model we will try is a decision tree classifier model. It is simple yet still power. To start, create a new feature branch and checkout to that branch using 'git checkout -b decision_tree'. After that, download the decision tree.ipynb notebook to your repo root foler, and experiment with it. The notebook itself should be quite self-explantory. If you don't change the default settings in the notebook, you should see this model gives prediction accuracy for training data set 90.57%, and that for validation data set 85.12%. This result is not bad for a baseline! In the next sections we explore if this can be improved.

After this, switch back to the main branch and merge this feature branch using git merge decision_tree.

2. Random Forest Model

The next model we will try is a random forest classifier model. It is more complicated and consists of multiple decision trees. Similarly, we create a new feature branch git checkout -b random_forest and download the random_forest.ipynb notebook. This model with the default settings int the notebook yields training accuracy 97.27% and validation accuracy 94.6%. Note that this is already a big improve.

After this, switch back to the main branch and merge this feature branch using git merge random_forest.

3. KNN and Naive-bayes Model

We can explore two more models. One is k-nearest-neighbors classifier, and the other is a naive-bayes model. To begin, create a new feature branch git checkout -b knn_and_naive_bayes and download the knn_and_naive_bayes.ipynb notebook. If running as given in the notebook, they yield accuracy respectively 97.79% for training, 96.74% for validation, and 82.72% for training, 82.8% for validation. Note that so far the KNN model gives the best prediction.

After this, switch back to the main branch and merge this feature branch using git merge knn_and_naive_bayes.

4. Feature Engineering

Up till now we train our models on 784 features - one pixel is a feature feeding to the model. This sounds unnecessarily large for models to converge and incurs a great amount of a computation burden. One optimization we can do is to reduce the number of features using algorithms like TSVD, PCA and t-SNE. In our given code we use a combination of TSVD and t-SNE. To give it a try, create a new feature branch git checkout -b feature_eng and download the feature_eng.ipynb notebook.

5. Back to our Decision Tree Model

Now let's test how much improvement we get from the feature engineering. In the feature_eng.ipynb notebook we train a simple decision tree classifier model after reducing 784 features to 2 features. It yields 97.96% accuracy on training data and 97.0% accuracy on validation data, which makes it the best result so far.

File List Total items: 9
Name Last Commit Size Last Modified
data import data 6 months ago
.gitattributes Initial commit 79 B 6 months ago
.gitignore gitignore 29 B 6 months ago
README.md update 3.2 KiB 6 months ago
decision tree.ipynb decision tree baseline 91 KiB 6 months ago
decision_tree_baseline.png decision tree baseline 218 KiB 6 months ago
feature_eng.ipynb feature engineering 9.3 KiB 6 months ago
knn_and_naive_bayes.ipynb knn_and_naive_bayes 9.1 KiB 6 months ago
random_forest.ipynb random forest 8.4 KiB 6 months ago

Repository Size

Materialized: 122 MiB
Stored: 122 MiB

Activity 13 commits

File Types