Advancing Beginners’ towards Machine Learning.

Naveen Singh
5 min readMay 11, 2020

Previous article: Beginner’s Machine Learning (ML)

Dealing with a real-life problem, there is hardly any dataset that has only one dependent variable. Taking the example of a SmartPhone, everyone owns a phone which has a vast specification and took a lot of time to buy it. Because the phone was chosen on the basis of requirement whether it is for PUBG, MasterChef or ‘ये रिश्ता क्या कह लाता है’ |

Here the dependent variable is your requirement and independent variables are the specifications. For our youth, the best example is:

and if you are really searching this then watch YOUTUBE VS TIKTOK

In the above example, the y=Tik Tok and the x1=camera and x2=slow motion feature, and this is an example of a multilinear example.

The basic formula on which multilinear regression works is

Source

where y=dependent variable ; x1,x2,x3………..,xn=independent variable

b0= bias/constant ; b1,b2,b3……bn=weight/coefficient

let’s start the practical part:

we all have done this in the previous session.

In this practical, I have a dataset of placement of a college having these many columns (in the figure). We have to predict the salary of a selected one.

‘X’ having so much column or feature but all of them are really helpful to decide whether an individual gets the or will get the job?

Have you ever heard that if you got 10 CGPA in high school or 90% in intermediate or 12th standard you will get a job? I never heard that. This means that X has some useless columns that don’t matter in prediction. Let’s remove the useless things but again a question arises why to remove if they don’t have any effect on the result?

Answer:- We use Machine Learning to calculate weights and bias, the more the independent variable are there more the time will be consumed by the program to train the model. That’s why it is good practice to remove the useless ones.

This process is called or known as Feature elimination or selection.

And I did it! For this, one should know the domain (about dataset) or analyze the coefficient value (will be discussed ).

Now just repeat the linear regression process that was discussed and train your model but an error will come when you try this. Why? because program or function understands the only number, not a string and some of the attribute/columns/feature contains the string. These columns are known as a categorical variable.

Figure 4

Now the task comes to convert the string into the number and for this, we use one-hot encoding that gives a unique number to each string in a column but another question arises why we can’t just give a number to every string by ourself. Yes, you can do it but you know that this number indicates to this string but the machine will take this an ordering between categories.

Credits

For more Click here

pd.get_dummies(X).head()

You know Machine is quite similar to the brain when gets trained that is if I ask you the candidate is not a male then tell me what is the gender of the same? Anyone say ‘female’ similarly machine will also know the answer and this point creates a problem for him. The machine says these columns are correlated and I can’t differentiate between them so please remove one of them. This problem is known as the Dummy variable trap.

pd.get_dummies(X, drop_first=True)

Here this part is complete. head() is used to see top rows.

ACCURACY……

You trained your model but how accurate it is. The motto of machine learning is not just to create a model but to be accurate. Accuracy is calculated by loss given by the model that is least the error, more the accuracy, and vice-versa.

Don’t get scared by the loss function, the machine will do it by itself and behind the scene uses ABSOLUTE MEAN ERROR.

Whenever you plot a graph in 2-D a line appears called a best-fit line or regression line. Actually, the machine puts a lot of lines on the graph according to the observation and finally puts a line having the least error known as best it line.

Then it should a better way to test the existing data to calculate the accuracy because you know the answer and it can be done by importing function:

Figure 6

Compare figure 4 and figure 6.

What is random_state?

source

I tried to train my model and the result is:

It says dataset has NaN that is not a number. It means the dataset contains a null value and we have to remove it.

Either remove the row which contains NaN, or the column but in our case, Y has NaN value. Or fill the value with mean value, the motive is to remove.

Play with NaN or the mean.

If you want the value of bias and coefficient:

Yes, sometimes weights have a negative value as it produces an adverse effect. For example, Your phone that is may be used for online classes but who knows which class you are attending…..Just kidding. Using phone also affect eyes that is negative affect.

Give a try to multilinear regression and again smash as ironman do.

In the next article, we will cover Feature Selection/ Elimination.

--

--