How to configure global hardware configuration defaults in DataEval

Problem statement

DataEval provides global configuration settings to control computational resources and hardware acceleration. This guide shows how to configure the default PyTorch device, batch size, and the maximum number of worker processes.

When to use

  • You need to specify GPU or CPU execution for PyTorch-based operations

  • You want to set a global default batch size for data processing operations

  • You want to control the number of parallel worker processes

  • You need to optimize performance for your hardware configuration

What you will need

  1. A Python environment with dataeval installed

Getting started

import dataeval

Configuring the PyTorch device

DataEval provides configuration options for setting the PyTorch device to use within DataEval. See torch.device for more information.

Set the default device to CPU

dataeval.config.set_device("cpu")

print(f"Current device for DataEval: {dataeval.config.get_device()}")
Current device for DataEval: cpu

Set the default device to CUDA GPU

dataeval.config.set_device("cuda")

print(f"Current device for DataEval: {dataeval.config.get_device()}")
Current device for DataEval: cuda

Set the default device to a specific CUDA GPU

dataeval.config.set_device("cuda:1")

print(f"Current device for DataEval: {dataeval.config.get_device()}")
Current device for DataEval: cuda:1

Reset the device to use PyTorch’s default device

dataeval.config.set_device(None)

print(f"Current device for DataEval: {dataeval.config.get_device()}")
Current device for DataEval: cpu

Configuring the default batch size

DataEval allows setting a global default batch size for operations that process data in batches. The batch size must be a positive integer.

Note that functions and methods that require a batch_size will fail if not provided and a global batch size is not set.

Set the default batch size

dataeval.config.set_batch_size(64)

print(f"Current batch size: {dataeval.config.get_batch_size()}")
Current batch size: 64

Reset the batch size to unset

dataeval.config.set_batch_size(None)

# When no batch size is set, get_batch_size() requires an explicit value
print("Batch size has been unset")
Batch size has been unset

Configuring maximum worker processes

DataEval follows the maximum worker configuration conventions used by scikit-learn and joblib.

Set the maximum number of worker processes

dataeval.config.set_max_processes(4)
print(f"Max processes: {dataeval.config.get_max_processes()}")
Max processes: 4

Set the maximum number of workers to all visible cpu cores

dataeval.config.set_max_processes(-1)
print(f"Max processes: {dataeval.config.get_max_processes()}")
Max processes: -1

Unset the maximum number of workers

dataeval.config.set_max_processes(None)
print(f"Max processes: {dataeval.config.get_max_processes()}")
Max processes: None

Using temporary context managers

Temporarily override configuration settings using context managers:

dataeval.config.set_batch_size(64)
print(f"Before context: {dataeval.config.get_batch_size()}")

with dataeval.config.use_batch_size(16):
    print(f"Inside context: {dataeval.config.get_batch_size()}")
    # Perform operations with batch_size=16

print(f"After context: {dataeval.config.get_batch_size()}")
Before context: 64
Inside context: 16
After context: 64
dataeval.config.set_max_processes(8)
print(f"Before context: {dataeval.config.get_max_processes()}")

with dataeval.config.use_max_processes(2):
    print(f"Inside context: {dataeval.config.get_max_processes()}")
    # Perform operations with max_processes=2

print(f"After context: {dataeval.config.get_max_processes()}")
Before context: 8
Inside context: 2
After context: 8