Running Random Forest – Dev Community

introduction:

Random Forest is a supervised machine learning algorithm widely used in classification and regression problems. It builds a decision tree on different samples and takes their majority vote for classification and average in case of regression.
One of the most important features of the Random Forest algorithm is that it can handle data sets containing continuous variables as in the case of regression and categorical variables as in the case of classification. It gives better results for classification problems.

Real life analogy:-

Functions of Random Forest Algorithm:

We need to know the Ensemble technology. Ensemble uses two types of methods:

1. Win :- It creates a separate training subset from the sample training data with one replacement and the final output is based on majority voting. For example, random forest.

2. Boosting :- It combines weak learners into strong learners by making sequential models so that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.

picture description

Steps involved in Random Forest:-

Step 1: The number of random records is taken from a data set containing k number of records in the random forest.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: The final output is considered on the basis of majority voting for classification and regression respectively.

picture description

Coding in Python:-

1. Let’s import the libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
enter fullscreen mode

exit fullscreen mode

2. Importing Data Sets:

df = pd.read_csv('heart_v2.csv')
print(df.head())
sns.countplot(df['heart disease'])
plt.title('Value counts of heart disease patients')
plt.show()
enter fullscreen mode

exit fullscreen mode

picture description

3. Feature Variable to X and Target Variable to Y:

X = df.drop('heart disease',axis=1)
y = df['heart disease']
enter fullscreen mode

exit fullscreen mode

4. Train Test Split is done:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape
enter fullscreen mode

exit fullscreen mode

picture description

5. Import RandomForestClassifier and fit the data:

from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5, n_estimators=100, oob_score=True)
%%time
classifier_rf.fit(X_train, y_train)
classifier_rf.oob_score_
enter fullscreen mode

exit fullscreen mode

picture description

6. Hyperparameter Tuning and Data Fit for Random Forests Using GridSearchCV:

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="accuracy")
%%time
grid_search.fit(X_train, y_train)
enter fullscreen mode

exit fullscreen mode

picture description

grid_search.best_score_
enter fullscreen mode

exit fullscreen mode

picture description

rf_best = grid_search.best_estimator_
rf_best
enter fullscreen mode

exit fullscreen mode

picture description

7. Visualization:

from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[5], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
enter fullscreen mode

exit fullscreen mode

picture description

from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[7], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
enter fullscreen mode

exit fullscreen mode

picture description

8. Sorting the Data by Feature Importance:

rf_best.feature_importances_
enter fullscreen mode

exit fullscreen mode

picture description

imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf_best.feature_importances_
})
enter fullscreen mode

exit fullscreen mode

imp_df.sort_values(by="Imp", ascending=False)
enter fullscreen mode

exit fullscreen mode

picture description

summary:

Now, we can conclude that Random Forest is one of the best techniques with high performance which is widely used in various industries for its efficiency. It can handle binary, continuous and hierarchical data.
Random Forests are a great option if one wants to build models fast and efficiently because one of the best things about Random Forests is that it can handle missing values ​​as well.
Overall, Random Forest is a fast, simple, flexible and robust model with few limitations.

Leave a Comment