README.md
This repo gives a new ML practitioner a walkthrough of tackling the handwritten digit recognition problem using different learning algorithms.
Data Set
To start, create your repo, e.g. Digit_Recognizer
on Xethub, clone that repo onto your laptop. Download the data foler in this repo to your Digit_Recognizer
repo.
Make sure you are on main branch, let's do git add .
and git commit -m "import data"
to snapshot importing data.
1. Decision Tree Model
As a baseline, the first model we will try is a decision tree classifier model. It is simple yet still power. To start, create a new feature branch and checkout to that branch using 'git checkout -b decision_tree'. After that, download the decision tree.ipynb
notebook to your repo root foler, and experiment with it. The notebook itself should be quite self-explantory. If you don't change the default settings in the notebook, you should see this model gives prediction accuracy for training data set 90.57%
, and that for validation data set 85.12%
. This result is not bad for a baseline! In the next sections we explore if this can be improved.
After this, switch back to the main branch and merge this feature branch using git merge decision_tree
.
2. Random Forest Model
The next model we will try is a random forest classifier model. It is more complicated and consists of multiple decision trees. Similarly, we create a new feature branch git checkout -b random_forest
and download the random_forest.ipynb
notebook. This model with the default settings int the notebook yields training accuracy 97.27%
and validation accuracy 94.6%
. Note that this is already a big improve.
After this, switch back to the main branch and merge this feature branch using git merge random_forest
.
3. KNN and Naive-bayes Model
We can explore two more models. One is k-nearest-neighbors classifier, and the other is a naive-bayes model. To begin, create a new feature branch git checkout -b knn_and_naive_bayes
and download the knn_and_naive_bayes.ipynb
notebook. If running as given in the notebook, they yield accuracy respectively 97.79%
for training, 96.74%
for validation, and 82.72%
for training, 82.8%
for validation. Note that so far the KNN model gives the best prediction.
After this, switch back to the main branch and merge this feature branch using git merge knn_and_naive_bayes
.
4. Feature Engineering
Up till now we train our models on 784 features - one pixel is a feature feeding to the model. This sounds unnecessarily large for models to converge and incurs a great amount of a computation burden. One optimization we can do is to reduce the number of features using algorithms like TSVD, PCA and t-SNE. In our given code we use a combination of TSVD and t-SNE. To give it a try, create a new feature branch git checkout -b feature_eng
and download the feature_eng.ipynb
notebook.
5. Back to our Decision Tree Model
Now let's test how much improvement we get from the feature engineering. In the feature_eng.ipynb
notebook we train a simple decision tree classifier model after reducing 784 features to 2 features. It yields 97.96%
accuracy on training data and 97.0%
accuracy on validation data, which makes it the best result so far.
File List | Total items: 9 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
data | |||
.gitattributes | |||
.gitignore | |||
README.md | |||
decision tree.ipynb | |||
decision_tree_baseline.png | |||
feature_eng.ipynb | |||
knn_and_naive_bayes.ipynb | |||
random_forest.ipynb |
Repository Size
Activity 16 commits
-
committed cb0b321738 3wk ago
-
committed 350975ce1e 3wk ago
-
committed 4249cf5a43 4wk ago
-
committed 6be3663bb9 2mo ago
-
committed 022b9a0037 2mo ago