{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Class Parity Label Analysis Tutorial\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## _Problem Statement_\n",
    "\n",
    "For machine learning tasks, a discrepancy in label frequencies between train and test datasets can result in poor model performance.\n",
    "\n",
    "To help with this, DataEval has a tool that compares the label distributions of two datasets.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _When to use_\n",
    "\n",
    "The `Parity` class and similar should be used when you would like to determine if two datasets have statistically independent labels.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _What you will need_\n",
    "\n",
    "1. A labeled training image dataset\n",
    "2. A labeled test image dataset to evaluate the label distribution of\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _Setting up_\n",
    "\n",
    "Let's import the required libraries needed to set up a minimal working example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import google.colab  # noqa: F401\n",
    "\n",
    "    # specify the version of DataEval (==X.XX.X) for versions other than the latest\n",
    "    %pip install -q dataeval\n",
    "except Exception:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataeval.metrics.bias import label_parity\n",
    "from dataeval.utils.dataset.datasets import MNIST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data\n",
    "\n",
    "We will use the MNIST dataset from torchvision for this tutorial on class label statistics\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds = MNIST(\"./data\", train=True, download=True, size=2000)\n",
    "test_ds = MNIST(\"./data\", train=False, download=True, size=500)\n",
    "\n",
    "# Take a subset of 2000 training images and 500 test images\n",
    "train_labels = train_ds.targets\n",
    "test_labels = test_ds.targets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate label statistical independence\n",
    "\n",
    "Now, let's look at how to use DataEval's label statistics analyzer.\n",
    "Start by initializing a `Parity` object. Compute the chi-squared value of hypothesis that test_ds has the same class distribution as train_ds by specifying the two datasets to be compared, as well as the number of unique classes (for MNIST, there are 10 unique classes). It also returns the p-value of the test.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = label_parity(train_labels, test_labels)\n",
    "print(f\"The chi-squared value for the two label distributions is {results.score}, with p-value {results.p_value}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### TEST ASSERTION CELL ###\n",
    "assert results.score == 0.0\n",
    "assert results.p_value == 1.0"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-3.12",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}