LogoLogo
Log InSign UpHomepage
  • 👋Welcome
  • Account and Team Setup
    • Sign up
    • Subscription Plans
    • Profile information
    • Account information
    • Roles
    • Users
    • Tags
  • FAQ
  • UNDERSTANDING MACHINE LEARNING
    • What is Graphite Note
      • Graphite Note Insights Lifecycle
    • Introduction to Machine Learning
      • What is Machine Learning
      • Data Analitycs Maturity
    • Machine Learning concepts
      • Key Drivers
      • Confusion Matrix
      • Supervised vs Unsupervised ML
  • Demo datasets
    • Demo Datasets
      • Ads
      • Churn
      • CO2 Emission
      • Diamonds
      • eCommerce Orders
      • Housing Prices
      • Lead Scoring
      • Mall Customers
      • Marketing Mix
      • Car Sales
      • Store Item Demand
      • Upsell
    • What Dataset do I need for my use case?
      • Predict Cross Selling: Dataset
      • Predict Customer Churn: Dataset
      • Predictive Lead Scoring: Dataset
      • Predict Revenue : Dataset
      • Product Demand Forecast: Dataset
      • Predictive Ads Performance: Dataset
      • Media Mix Modeling (MMM): Dataset
      • Customer Lifetime Value Prediction : Dataset
      • RFM Customer Segmentation : Dataset
    • Dataset examples - from online sources
      • Free datasets for Machine Learning
  • Datasets
    • Introduction
    • Prepare your Data
      • Data Labeling
      • Expanding datasets
      • Merging datasets
      • CSV File creating and formatting
    • Data sources in Graphite Note
      • Import data from CSV file
        • Re-upload or append CSV
        • CSV upload troubleshooting tips
      • MySQL Connector
      • MariaDB Connector
      • PostgreSQL Connector
      • Redshift Connector
      • Big Query Connector
      • MS SQL Connector
      • Oracle Connector
  • Models
    • Introduction
    • Preprocessing Data
    • Machine Learning Models
      • Timeseries Forecast
      • Binary Classification
      • Multiclass Classification
      • Regression
      • General Segmentation
      • RFM Customer Segmentation
      • Customer Lifetime Value
      • Customer Cohort Analysis
      • ABC Pareto Analysis
      • New vs Returning Customers
    • Predict with ML Models
    • Overview and Model Health Check
    • Advanced parameters in ML Models
    • Actionable insights in ML Models
    • Improve your ML Models
  • Notebooks
    • What is Notebook?
    • My first Notebook
    • Data Visualization
  • REST API
    • API Introduction
    • Dataset API
      • Create
      • Fill
      • Complete
    • Prediction API
      • Quickstart
      • Request
        • Headers
        • Payload
        • Data
      • Response
        • Response Structure
      • API Limits
    • Model Results API
      • Quickstart
      • Request
        • Headers
        • Body
      • Response
      • Usage Notes
      • Code Examples
Powered by GitBook
On this page
  • Step 1: Exclusion of Columns
  • Step 2: Preprocessing

Was this helpful?

Export as PDF
  1. Models

Preprocessing Data

PreviousIntroductionNextMachine Learning Models

Last updated 6 months ago

Was this helpful?

In Graphite Note, data preparation is divided into two main steps to ensure optimal results, with all tasks handled automatically so you don’t have to worry about them. Data preprocessing is a crucial step in machine learning, enhancing model accuracy and performance by transforming and cleaning the raw data to remove inconsistencies, handle missing values, and scale features, and ensure compatibility with the chosen algorithm.

Step 1: Exclusion of Columns

Features Not Fit for Model: Graphite automatically excludes columns that aren’t suitable for modeling, such as date/datetime columns, to ensure only relevant features are used in training.

Step 2: Preprocessing

To achieve the best results, Graphite Note takes care of several preprocessing steps:

• Null Values: It identifies and processes null values based on best practices. If the column is 50% null or more, the column will not be included in model training

• Missing Values: Missing values are managed automatically to maintain data integrity. For a numerical column it will change it by the average, and for a categorical feature it will become "not_available"

• One-Hot Encoding: Categorical variables are automatically transformed using one-hot encoding, converting categories into numerical formats suitable for model training.

• Fix Imbalance: Graphite addresses class imbalance in classification tasks, fixing the inequal distibution of target class and ensuring a balanced representation of classes.

• Normalization: Numeric columns are scaled to a uniform range, ensuring consistent data for models that require normalized input.

• Constants: Columns with constant values, which don’t contribute useful information, are identified and excluded from the dataset.

• Cardinality: Graphite optimizes high-cardinality categorical columns for model performance, handling complex categorical data effectively.

In traditional data science projects, these steps would require manual effort from data scientists, including data cleaning, encoding, scaling, and testing, often involving a significant amount of time and expertise. Graphite Note automates this entire process, completing these steps in seconds and allowing users to focus on insights and decision-making rather than data preparation.

Features not fit for model
Preprocessing steps