Advanced parameters in ML Models
Last updated
Last updated
The Advanced Parameters section in Graphite Note provides users with the ability to fine-tune their machine learning models for Binary classification, Multiclass classification, and Regression tasks. These parameters mimic the adjustments a data scientist would make to optimize model performance.
While advanced parameters offer flexibility and control, changes to these settings can significantly impact model training and behavior. Users are advised to adjust them cautiously and only with a clear understanding of their effects.
Description: Specifies the proportion of the dataset to be used for training the model, while the remaining portion is reserved for testing. For example, a value of 0.75 means 75% of the data is used to train the model, and 25% is used to evaluate its performance.
Default Value: 0.75 (75% training and 25% testing).
Impact: Adjusting the training dataset size affects the balance between model learning and evaluation:
• A higher training size (e.g., 0.85) gives the model more data to learn from, which can improve its ability to recognize patterns. However, it leaves less data for testing, which may limit the ability to accurately assess how well the model will perform on new data.
• A lower training size (e.g., 0.6) reserves more data for testing, providing a better evaluation of the model’s generalization to unseen data. However, this reduces the data available for training, which might result in a less accurate model.
Choosing the right balance ensures the model has enough data to learn effectively while leaving sufficient data for reliable testing and validation.
Description: A list of machine learning algorithms that will be evaluated and compared during model training. The available algorithms depend on the type of task, with separate sets of algorithms for Regression and Classification (Binary and Multiclass).
Regression algorithms: Linear Regression, Ridge Regression, Decision Tree, Random Forest, Support Vector Machine, Light Gradient Boosting Machine, K-Nearest Neighbors
Binary and Multiclass Classification algorithms: K-Nearest Neighbors, Decision Tree, Random Forest, Logistic Regression, LightGBM, Gradient Boosting Classifier, AdaBoost, Multi-Layer Perceptron.
Impact: By choosing from these algorithms, users can experiment to identify the best-performing model for their specific use case. Selecting an appropriate algorithm based on the task type ensures optimal results and efficient model training.
Description: The Sort Models By option allows users to rank classification models (binary and multiclass) based on specific evaluation metrics after training. This helps users identify the best-performing model for their specific goals. Note that this option is available only for classification tasks and is not applicable to regression models.
Users can sort models by the following metrics:
• Accuracy: Measures the proportion of correct predictions among all predictions.
• AUC (Area Under the Curve): Indicates how well the model distinguishes between classes; higher values indicate better performance.
• F1 Score: The harmonic mean of precision and recall, balancing both metrics.
• Precision: The proportion of correctly predicted positive cases out of all positive predictions.
• Recall: The proportion of actual positives correctly identified by the model.
Default Value: The default metric for sorting is F1 Score, as it balances precision and recall, making it suitable for many classification tasks.
Impact: Sorting models allows users to prioritize and identify the best-performing model based on a specific metric that aligns with their business or project needs. For instance:
• If minimizing false negatives is critical, users might prioritize Recall.
• If balancing precision and recall is essential, F1 Score would be a better choice.
Description: Sets the decision threshold for classifying probabilities in binary classification models. For example, if the threshold is set to 0.5, the model will classify predictions with a probability above 50% as “positive” and below 50% as “negative.” Note that this option is available only for classification tasks and is not applicable to regression models.
Default Value: 0.5.
Impact: Adjusting the threshold changes how the model makes decisions, which can influence its behavior in identifying positive and negative outcomes.
• A lower threshold (e.g., 0.3) makes the model more likely to classify predictions as “positive.” This increases sensitivity (catching more actual positives) but may also increase false positives (incorrectly predicting positives).
• A higher threshold (e.g., 0.7) makes the model more conservative in predicting positives. This increases specificity (fewer false positives) but may miss some true positives, leading to more false negatives.
Simple Example: Imagine you are using a model to detect spam emails:
• A low threshold might flag more emails as spam, including some legitimate ones (false positives).
• A high threshold might avoid labeling legitimate emails as spam but could miss some actual spam emails (false negatives).
Choosing the right threshold depends on what is more important for your use case—minimizing missed positives or avoiding false alarms. For most general scenarios, the default value of 0.5 works well.
Description: A toggle to remove highly correlated features from the dataset to address multicollinearity issues. When enabled, the model will automatically exclude features that are too similar to each other.
Default Value: True.
Impact: Removing multicollinearity improves model stability, interpretability, and performance by ensuring that features are independent and not redundant.
• What is Multicollinearity? Multicollinearity occurs when two or more features in a dataset are highly correlated, meaning they provide overlapping information to the model. For example, “Total Price” and “Price per Unit” might be highly correlated because one depends on the other.
• Why is it a Problem? When features are highly correlated, the model struggles to determine which feature is actually influencing the prediction. This can lead to instability in the model’s results and make it harder to interpret which features are important.
• How Does Removing It Help? By removing one of the correlated features, the model focuses only on unique, non-redundant information. This makes the model more reliable and easier to understand.
ELI5 -Imagine you are solving a puzzle, but you have duplicate pieces that fit in the same spot. Removing the duplicate pieces makes it easier to complete the puzzle and understand how each piece fits. Similarly, removing multicollinearity helps the model work more efficiently and effectively.
Description: Defines the correlation threshold (e.g., 0.95) to determine which features in the dataset are considered multicollinear. If the correlation between two features exceeds this threshold, one of them will be removed. This option is only available if the Remove Multicollinearity toggle is set to True.
Default Value: 0.95.
Impact: Adjusting the multicollinearity threshold helps control how strictly the model identifies and removes redundant features. This improves model interpretability, simplifies feature selection, and ensures that only unique and valuable information is used for predictions.
• What Does the Threshold Do? The threshold determines how strong the correlation between two features must be for them to be considered “too similar.” For example:
• A threshold of 0.95 means that features with a correlation of 95% or more are considered redundant.
• A lower threshold (e.g., 0.85) will remove more features because it considers lower correlations as redundant.
• Why Does It Matter?: Highly correlated features confuse the model because they provide the same or overlapping information. By setting the threshold, you decide how much overlap is acceptable before a feature is removed.
ELI5 -Think of the threshold like deciding how similar two books need to be before you donate one of them to save space. If the books tell almost the same story (high correlation), you keep just one. The same logic applies to features in your dataset!
Description: When enabled, applies hyperparameter tuning to optimize the model’s configuration. Hyperparameter tuning adjusts internal settings of the algorithm to find the combination that delivers the best results.
Default Value: False.
Impact: Enabling model tuning can significantly improve the model’s accuracy and overall performance by finding the optimal settings for how the algorithm works. However, this process requires additional training time, as the system runs multiple tests to identify the best configuration.
• What is Hyperparameter Tuning? Think of hyperparameters as “knobs” that control how a model learns. For example, in a Random Forest algorithm, hyperparameters might decide how many decision trees to use or how deep each tree can grow. Tuning adjusts these knobs to find the best combination for your specific data.
• Why Enable Model Tuning? Without tuning, the model uses default settings, which might not be the best for your dataset. Tuning customizes the algorithm, helping it perform better by maximizing accuracy or minimizing errors.
• What’s the Trade-off? Tuning takes more time because the system tests many combinations of hyperparameters to find the best one. This makes training longer, but the results are usually more accurate and reliable.
ELI5 - Imagine you’re baking a cake and adjusting the temperature and baking time to get the perfect result. Hyperparameter tuning is like trying different combinations of time and temperature to make the cake just right. Enabling this feature ensures your “cake” (model) performs its best!
Description: Specifies whether to remove outliers—extreme or unusual data points—from the dataset based on a defined threshold. If set to True, you can adjust the Outliers Threshold option to determine which data points are considered outliers.
Default Value: False.
Impact: Removing outliers can improve model performance by eliminating data points that are far from the majority of the data and could negatively affect predictions. However, removing too many points might result in losing important information, so it’s essential to set the threshold carefully.
• What are Outliers? Outliers are data points that are very different from the rest of your dataset. For example, if most customers spend $100 to $200 monthly but one customer spends $10,000, that’s an outlier.
• Why Remove Them? Outliers can confuse the model because they don’t represent typical behavior. For example, if the model tries to adjust for the $10,000 spender, it might make poor predictions for customers in the normal $100-$200 range.
• What Happens if You Enable This? When you set Remove Outliers to True, you can choose an Outliers Threshold to decide how far a data point must be from the average to be removed. This helps keep only relevant and meaningful data for training the model.
ELI5 - Imagine you’re cooking and one ingredient is wildly over-measured compared to the rest. Removing that extreme amount ensures your dish tastes balanced. Similarly, removing outliers ensures your model isn’t influenced by extreme, unusual data points.
Description: The Outliers Threshold defines the proportion of data points that are considered outliers. For example, setting the threshold to 0.05 means that 5% of the most extreme data points in the dataset will be treated as outliers and removed. This option is available only if the Remove Outliers toggle is set to True.
Default Value: 0.05 (5% of data points are considered outliers)
Impact: Adjusting the threshold controls how strict the model is in identifying and removing outliers.
• A lower threshold (e.g., 0.02) is stricter and identifies fewer but more extreme outliers. This ensures that only the most unusual data points are removed, preserving the majority of the data.
• A higher threshold (e.g., 0.1) is less strict and removes a larger portion of the data. This can be useful for datasets with significant variability but might risk removing useful information.
By setting the threshold appropriately, users can ensure that extreme values that could negatively affect the model’s performance are removed while retaining as much meaningful data as possible. This balance is crucial for improving model accuracy and ensuring the dataset represents typical patterns.