> For the complete documentation index, see [llms.txt](https://docs.graphite-note.com/graphite-note-documentation/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.graphite-note.com/graphite-note-documentation/understanding-machine-learning/machine-learning-concepts/outliers-and-data-quality-checks.md).

# Outliers and Data Quality Checks

### Purpose

Outliers are data points that differ significantly from the majority of observations in a dataset. They may represent unusual events, measurement errors, or incorrect data entries. Detecting and handling outliers is an important part of exploratory data analysis because extreme values can strongly influence statistical results and machine learning models.

Data quality checks aim to ensure that the dataset accurately represents real-world conditions. By identifying anomalies, inconsistencies, or impossible values, analysts can prevent models from learning misleading patterns.

***

### What Are Outliers?

An outlier is an observation that lies far outside the typical range of values in a dataset.

Outliers can occur for several reasons:

* measurement errors
* incorrect data entry
* rare but valid events
* system or sensor failures

For example, if most values in a dataset fall within a certain range but one observation is dramatically higher or lower than the rest, that observation may be considered an outlier.

***

### Why Outliers Matter

Outliers can significantly affect statistical measures such as averages, correlations, and standard deviations. When extreme values are present, they may distort the patterns that machine learning algorithms attempt to learn.

Some potential consequences include:

* misleading correlations
* unstable model coefficients
* reduced predictive performance
* models that focus too heavily on rare cases

Because of these risks, outliers should always be investigated during exploratory analysis.

***

### Detecting Outliers

Several techniques are commonly used to identify unusual observations:

**Boxplots**

Boxplots highlight the central distribution of values and clearly display points that fall outside the typical range.

**Histograms**

Histograms reveal the distribution of values and may expose extreme values in the tails of the distribution.

**Scatter plots**

Scatter plots help detect unusual observations when comparing relationships between two variables.

These visualizations allow analysts to quickly spot anomalies and assess whether they represent genuine observations or potential data issues.

***

### Data Quality Checks

In addition to identifying outliers, data quality checks focus on detecting inconsistencies or impossible values within a dataset. These checks may include:

* verifying that numerical values fall within realistic ranges
* identifying missing values
* detecting duplicate records
* checking for invalid measurements or formatting errors

Ensuring high data quality is essential because machine learning models rely entirely on the information provided in the dataset.

***

### Why Data Quality Is Critical for Machine Learning

Machine learning models learn patterns directly from the data they receive. If the data contains errors, inconsistencies, or unrealistic values, the resulting predictions may be unreliable.

By performing careful data quality checks and addressing potential issues early in the workflow, analysts create a stronger foundation for building accurate and trustworthy predictive models.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.graphite-note.com/graphite-note-documentation/understanding-machine-learning/machine-learning-concepts/outliers-and-data-quality-checks.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.