Machine learning models

When you first create your model you have to choose between many models.

Data preprocessing

Before running your scenario of your model, you can understand how the model is processed. First, it has to train, meaning we take 80% of the dataset to learn about it. Then, the remaining 20% are going to test it and calculate the model score. If the model score is high, the model trained is accurate and close to the test.

Data preprocessing is a crucial step in machine learning, enhancing model accuracy and performance by transforming and cleaning the raw data to remove inconsistencies, handle missing values, and scale features, and ensure compatibility with the chosen algorithm.

During preprocessing we can deal with

  • null values: if the column is 50% null or more, the column will not be included in model training

  • missing values: for a numerical column it will change it by the average, and for a categorical feature it will become "not_available"

  • One Hot Encoding: categorical data is transformed into numeric values before training a model, to be suitable for machine learning algorithms

  • fit imbalance: fixing the inequal distibution of target class which are not ideal for training

  • normalization: rescaling the values of numerical columns to have a better training result

  • constants: if the column has one unique value (a constant), the column will not be included in the model training

  • cardinality: if the column has high number of unique values, the column will not be included in the model training.

Last updated