Intro_to_ML

Kelton Zhang

README.md

This repo gives a new ML practitioner a walkthrough of tackling the handwritten digit recognition problem using different learning algorithms.

Data Set

To start, create your repo, e.g. Digit_Recognizer on Xethub, clone that repo onto your laptop. Download the data foler in this repo to your Digit_Recognizer repo.

Make sure you are on main branch, let's do git add . and git commit -m "import data" to snapshot importing data.

1. Decision Tree Model

As a baseline, the first model we will try is a decision tree classifier model. It is simple yet still power. To start, create a new feature branch and checkout to that branch using 'git checkout -b decision_tree'. After that, download the decision tree.ipynb notebook to your repo root foler, and experiment with it. The notebook itself should be quite self-explantory. If you don't change the default settings in the notebook, you should see this model gives prediction accuracy for training data set 90.57%, and that for validation data set 85.12%. This result is not bad for a baseline! In the next sections we explore if this can be improved.

After this, switch back to the main branch and merge this feature branch using git merge decision_tree.

2. Random Forest Model

The next model we will try is a random forest classifier model. It is more complicated and consists of multiple decision trees. Similarly, we create a new feature branch git checkout -b random_forest and download the random_forest.ipynb notebook. This model with the default settings int the notebook yields training accuracy 97.27% and validation accuracy 94.6%. Note that this is already a big improve.

After this, switch back to the main branch and merge this feature branch using git merge random_forest.

3. KNN and Naive-bayes Model

We can explore two more models. One is k-nearest-neighbors classifier, and the other is a naive-bayes model. To begin, create a new feature branch git checkout -b knn_and_naive_bayes and download the knn_and_naive_bayes.ipynb notebook. If running as given in the notebook, they yield accuracy respectively 97.79% for training, 96.74% for validation, and 82.72% for training, 82.8% for validation. Note that so far the KNN model gives the best prediction.

After this, switch back to the main branch and merge this feature branch using git merge knn_and_naive_bayes.

4. Feature Engineering

Up till now we train our models on 784 features - one pixel is a feature feeding to the model. This sounds unnecessarily large for models to converge and incurs a great amount of a computation burden. One optimization we can do is to reduce the number of features using algorithms like TSVD, PCA and t-SNE. In our given code we use a combination of TSVD and t-SNE. To give it a try, create a new feature branch git checkout -b feature_eng and download the feature_eng.ipynb notebook.

5. Back to our Decision Tree Model

Now let's test how much improvement we get from the feature engineering. In the feature_eng.ipynb notebook we train a simple decision tree classifier model after reducing 784 features to 2 features. It yields 97.96% accuracy on training data and 97.0% accuracy on validation data, which makes it the best result so far.

File List			Total items: 11
Name	Last Commit	Size	Last Modified
data	mv heic to sub dir		1 year ago
img	test relative image in readme		9 months ago
.gitattributes	Initial commit	79 B	1 year ago
.gitignore	gitignore	29 B	1 year ago
86ea3f1f0b4e9a1a4397569fe0632a07d268b268724d97421d7e7832dc2951f1	long file name	0 B	11 months ago
README.md	add slash	3.3 KiB	9 months ago
decision tree.ipynb	heic	96 KiB	1 year ago
decision_tree_baseline.png	decision tree baseline	218 KiB	1 year ago
feature_eng.ipynb	feature engineering	9.3 KiB	1 year ago
knn_and_naive_bayes.ipynb	knn_and_naive_bayes	9.1 KiB	1 year ago
random_forest.ipynb	random forest	8.4 KiB	1 year ago

Repository Size

Loading repo size...