Python Machine Learning

25 Jun 2023 | 4 min read

Intro

Now we are starting to move towards AI development. Let’s move into Machine Learning a subset of AI, machine learning will be used for modeling information we want our programs to work off. This can then be used for language processing, vision processing and even for forecasting stock market trends and crypto trends.

The Process

Steps

Import the Data
Clean the Data, this can involve removing duplicates or anything that can create negative results
Split the Data into Training/ Test Sets
Create a Model, this can be a decision tree or something as sophisticated as a neural network
Train the Model
Make Predictions
Evaluate and Improve

Libraries

Numpy - Useful python tool for working with numbers docs
Pandas - Useful for working with rows and columns docs
MatPlotLib - Used for working with 2d charts docs
SciKit-Learn - A common library for working with machine learning docs

Tools

Jupyter - Web based IDE for Python docs
Anaconda - Within Jupyter we will use Anaconda docs

Prerequisetes

Installation

Download and install Anaconda
Run the following to install in your terminal to install Jupyter Notebook

$ pip install -U notebook
$ pip install -U pandas
$ pip install -U scikit-learn

Run

Now we will run Jupyter Notebook
This will load us into the Jupyter Dashboard

$ jupyter notebook

Jupyter Notebook Intro

Create a new Notebook

Select a directory in the Dashboard
In the top right select new and python3
Once the new Notebook is loaded change the title from Untitled to the name of your choice

Importing Dataset

We will download a dataset from kaggle.com
You will need to register for a account if one is not already owned
Once logged in search for Video Game Sales
Download and extract the data set to the same directory as the Notebook

Coding the Application

Each Notebook has segments, these are separate segments of code within the same context
Each bit of following code will be put in different segments

Segment 1

If you have multiple versions of Python on your machine Jupyter can pick up the wrong version of pandas, this can be resolved by adding !pip install pandas to the first line of your first segment

import pandas as pd
df = pd.read_csv('vgsales.csv')     # Pandas allows us to read the CSV file
# df                                # This allows us to dump the raw data out if we want to see it
df.shape                            # This allows us to see the rows and columns of the dataset

Example

Segment 2

df.describe()                       # Another way to dump data about the dataset

Example

Segment 3

df.values                           # This will output the actual data itself in its original array format

Example

The Application

We will create a music recommendation program
We will get out initial data from here

Segment 1

We will break the data into two sets
The first set is expected input from the user
The second set is results based on the expected input

import pandas as pd

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])  # Input Data (Always with a capital letter for input)
y = music_data['genre']                 # Output Data (Always with lower case for output)

Segment 2

Here we will use scikit-learn to predict from the given dataset

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X.values, y.values)                        # Here we are training our data
predictions = model.predict([[21, 1], [22, 0]])      # Then we use this data to predict future results
predictions

Segment 3

Next we want to test how accurate our test set is
This will help us determine if we should change out initial training set
We will get a score back between 0 and 1, the closer to 1 the better
Also running this Segment multiple times will give different results each time

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  # Allocating 20% of data for testing and 80% for training, this method returns a tuple

model.fit(X_train.values, y_train.values)
predictions2 = model.predict(X_test.values)

score = accuracy_score(y_test.values, predictions2)
score

Segment 4

In real examples we will have millions of rows and it is important to persist our data
We will use the joblib method to save and load models

import joblib                                      # This still comes from scikit-learn

joblib.dump(model, 'music-recommender.joblib')     # Here we create a file on our desktop of the model for future use

Segment 5

Next we will load our model and use it for predictions

model2 = joblib.load('music-recommender.joblib)

predictions2 = model.predict([[21, 1]])

Segment 6

Next we will visualise our Decision Tree
We use a VSCode plugin for the graph Dot Graph

from sklearn import tree

tree.export_graphviz(
    model2, 
    out_file='music-recommender.dot',
    feature_names=['age', 'gender'],
    class_names=sorted(y.unique()),              # This is required as many names are repeated
    label='all',
    rounded=True,
    filled=True
)

Final Product

Here we have the decision tree from our dot file output
It shows the decisions our model takes to resolve predictions

Python Machine Learning

Intro

The Process

Steps

Libraries

Tools

Prerequisetes

Installation

Run

Jupyter Notebook Intro

Create a new Notebook

Importing Dataset

Coding the Application

Segment 1

Segment 2

Segment 3

The Application

Segment 1

Segment 2

Segment 3

Segment 4

Segment 5

Segment 6

Final Product

Decision Tree