Data mining is currently popular among informatics students. In fact, there are courses that specifically address data mining concerns. Naive Bayes is the most widely used method in data mining courses.
Naive Bayes is a classification based on probability and statistical methods created by British scientist Thomas Bayes that forecasts future opportunities based on past experience.
in this article we will learn together how naive bayes works.
import pandas as pd import numpy as np
The code above is used to activate the pandas and numpy libraries, which will be used in the analysis phase. The Pandas library itself is used for processing data related to data frames, while the Numpy library is used for easy and fast array manipulation.
Reading Data Using Python
The code above is used to read the training data by using the pandas library. The training data read is in the form of an xlsx file.
Converting Data To Integer
Because naive bayes in Python cannot read training data in the form of strings, the training data is converted to integer form so that the data displayed is in the form of numbers.
Here is the code to convert the training data to integer.
from sklearn.preprocessing import LabelEncoder enc = LabelEncoder() data_training['Gender'] = enc.fit_transform(data_training['Gender'].values) data_training['Ever_Married'] = enc.fit_transform(data_training['Ever_Married'].values) data_training['Age'] = enc.fit_transform(data_training['Age'].values) data_training['Graduated'] = enc.fit_transform(data_training['Graduated'].values) data_training['Profession'] = enc.fit_transform(data_training['Profession'].values) data_training['Spending_Score'] = enc.fit_transform(data_training['Spending_Score'].values) data_training.head(30)
After that, the data will turn into an integer.
Checking training data
.info() is used to find out data type information.
It can be seen that the data used is 30, consisting of 9 columns, and each variable is of type integer and object, and there is no data that is null or empty (non-null).
Determining Independent And Dependent Variables
x = data_training.drop('Segmentation', axis=1) x.head(30)
The code above is used to delete the dependent variable and call the independent variable. Then when run, it will look like the image below:
Then to call the dependent variable, we can use the name of the existing data, followed by the column name, namely the "segmentation" column. To bring up the data, we can use the command "head".
Reading Testing Data
To display testing data, you can use the pandas library by using the read_excel function.
Then the segmentation column will be dropped or separated, and the segmentation column will be deleted from the data.
x_test = data_test.drop('Segmentation', axis=1) x_test.head(1)
To display the segmentation column, we can use this command.
y_test = data_test['Segmentation'] y_test.head(1)
To perform the classification, you can run the following command:
from sklearn.naive_bayes import GaussianNB modelnb = GaussianNB() nbtrain = modelnb.fit(x, y) Y_predict = nbtrain.predict(x_test) print(Y_predict)
- 181080200142 Jagad Yudha Awali
- 181080200134 Dewi Nur Afidah
- 181080200133 Ruri Aditya Pratama
- 181080200128 Tutut Anjarsari
- 181080200149 Jefry Fernando