LogoLogo
Log InSign UpHomepage
  • 👋Welcome
  • Account and Team Setup
    • Sign up
    • Subscription Plans
    • Profile information
    • Account information
    • Roles
    • Users
    • Tags
  • FAQ
  • UNDERSTANDING MACHINE LEARNING
    • What is Graphite Note
      • Graphite Note Insights Lifecycle
    • Introduction to Machine Learning
      • What is Machine Learning
      • Data Analitycs Maturity
    • Machine Learning concepts
      • Key Drivers
      • Confusion Matrix
      • Supervised vs Unsupervised ML
  • Demo datasets
    • Demo Datasets
      • Ads
      • Churn
      • CO2 Emission
      • Diamonds
      • eCommerce Orders
      • Housing Prices
      • Lead Scoring
      • Mall Customers
      • Marketing Mix
      • Car Sales
      • Store Item Demand
      • Upsell
    • What Dataset do I need for my use case?
      • Predict Cross Selling: Dataset
      • Predict Customer Churn: Dataset
      • Predictive Lead Scoring: Dataset
      • Predict Revenue : Dataset
      • Product Demand Forecast: Dataset
      • Predictive Ads Performance: Dataset
      • Media Mix Modeling (MMM): Dataset
      • Customer Lifetime Value Prediction : Dataset
      • RFM Customer Segmentation : Dataset
    • Dataset examples - from online sources
      • Free datasets for Machine Learning
  • Datasets
    • Introduction
    • Prepare your Data
      • Data Labeling
      • Expanding datasets
      • Merging datasets
      • CSV File creating and formatting
    • Data sources in Graphite Note
      • Import data from CSV file
        • Re-upload or append CSV
        • CSV upload troubleshooting tips
      • MySQL Connector
      • MariaDB Connector
      • PostgreSQL Connector
      • Redshift Connector
      • Big Query Connector
      • MS SQL Connector
      • Oracle Connector
  • Models
    • Introduction
    • Preprocessing Data
    • Machine Learning Models
      • Timeseries Forecast
      • Binary Classification
      • Multiclass Classification
      • Regression
      • General Segmentation
      • RFM Customer Segmentation
      • Customer Lifetime Value
      • Customer Cohort Analysis
      • ABC Pareto Analysis
      • New vs Returning Customers
    • Predict with ML Models
    • Overview and Model Health Check
    • Advanced parameters in ML Models
    • Actionable insights in ML Models
    • Improve your ML Models
  • Notebooks
    • What is Notebook?
    • My first Notebook
    • Data Visualization
  • REST API
    • API Introduction
    • Dataset API
      • Create
      • Fill
      • Complete
    • Prediction API
      • Request v1
        • Headers
        • Body
      • Request v2
        • Headers
        • Body
      • Response
      • Usage Notes
    • Model Results API
      • Request
        • Headers
        • Body
      • Response
      • Usage Notes
      • Code Examples
    • Model Info API
      • Request
        • Headers
        • Body
      • Response
      • Usage notes
      • Code Examples
Powered by GitBook
On this page

Was this helpful?

Export as PDF
  1. Datasets
  2. Prepare your Data

Expanding datasets

PreviousData LabelingNextMerging datasets

Last updated 7 months ago

Was this helpful?

Expanding your dataset with more records can be beneficial for your machine learning model, but the impact depends on the quality and relevance of the new data. To learn how to re-upload or append data to your existing CSV dataset go .

Here’s how adding more records might help:

1. Improved Generalization

More data generally helps the model to learn better and generalize to unseen data. If your initial dataset was limited, your model might have overfitted to the specific patterns in that data. Adding more data helps the model capture a wider range of patterns, leading to better performance on new data.

2. Reducing Overfitting

When your dataset is small, the model may learn noise or irrelevant patterns (overfitting). Expanding the dataset introduces more variety, making it harder for the model to memorize specific samples, thereby helping to reduce overfitting.

3. Better Representing the Data Distribution

A larger dataset often better represents the underlying data distribution, especially if the new records cover more edge cases, outliers, or scenarios that were underrepresented in the original dataset. This helps the model become more robust and perform well across a wider range of inputs.

4. Enhanced Model Accuracy

In most cases, expanding your dataset improves the accuracy of the model, especially if the model is data-hungry (like deep learning models). More data means more examples for the model to learn from, allowing it to better predict future outcomes.

5. Handling Class Imbalance

If your dataset suffers from class imbalance (e.g., if one class has far more records than another in a classification problem), adding more records from the minority class can make your dataset more balanced, improving the model’s ability to predict minority classes correctly.

Considerations:

• Quality over Quantity: Simply adding more data isn’t always beneficial if the additional data is noisy, irrelevant, or incorrectly labeled. High-quality, representative data is more important than just increasing the size of the dataset.

• Data Diversity: Adding data that captures a wider variety of features or scenarios is more helpful than adding redundant or very similar data points. If the new data points are too similar to the existing ones, the impact on model performance might be minimal.

• Graphite Note plan limits: Consider that expanding the dataset will increase both the computational requirements for training the model and the total number of dataset rows used in your current plan. More about Graphite Note plans finde .

here
here