Skip to content

Introduction

Autolabel supports transformation of the input data! Input datasets are available in many shapes and form(at)s. We help you ingest your data in the format that you want in a way that is most useful for the downstream LLM or labeling task that you have in mind. We have tried to make the transforms performant, configurable and the outputs formatted in a way useful for the LLM.

Example

Here we will show you how to run an example transform. We will use the Webpage Transform to ingest national park websites and label the state that every national park belongs to. You can find a Jupyter notebook with code that you can run on your own here

Use this webpage transform yourself here in a Colab - open in colab

Changes to config

{
    "task_name": "NationalPark",
    "task_type": "question_answering",
    "dataset": {
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "transforms": [{
        "name": "webpage_transform",
        "params": {
            "url_column": "url"
        },
        "output_columns": {
            "content_column": "content"
        }
    }],
    "prompt": {
        "task_guidelines": "You are an expert at understanding websites of national parks. You will be given a webpage about a national park. Answer with the US State that the national park is located in.",
        "output_guidelines": "Answer in one word the state that the national park is located in.",
        "example_template": "Content of wikipedia page: {content}\State:",
    }
}

Notice the transforms key in the config. This is where we define our transforms. Notice that this is a list meaning we can define multiple transforms here. Every element of this list is a transform. A transform is a json requiring 3 inputs - 1. name: This tells the agent which transform needs to be loaded. Here we are using the webpage transform. 2. params: This is the set of parameters that will be passed to the transform. Read the documentation of the separate transform to see what params can be passed to the transform here. Here we pass the url_column, i.e the column containing the webpages that need to be loaded. 3. output_columns: Each transform can define multiple outputs. In this dictionary we map the output we need, in case content_column to the name of the column in the output dataset in which we want to populate this.

Running the transform

from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
ds = agent.transform(ds)

This runs the transformation. We will see the content in the correct column. Access this using ds.df in the AutolabelDataset.

Running the labeling job

ds = agent.run(ds)

Simply run the labeling job on the transformed dataset. This will extract the state of the national park from each webpage.

Transformation Labeling Run

Output of the transformation labeling run

Custom Transforms

We support the following transforms -

  1. Webpage Transform
  2. PDF Transform

We expect this list to grow in the future and need the help of the community to build transforms that work the best for their data. For this, we provide an abstraction that is easy to use. Any new transform just needs to be extend the BaseTransform class as penciled down below.

BaseTransform

Bases: ABC

Base class for all transforms.

Source code in src/autolabel/transforms/base.py
class BaseTransform(ABC):
    """Base class for all transforms."""

    TTL_MS = 60 * 60 * 24 * 7 * 1000  # 1 week
    NULL_TRANSFORM_TOKEN = "NO_TRANSFORM"

    def __init__(self, cache: BaseCache, output_columns: Dict[str, Any]) -> None:
        """
        Initialize a transform.
        Args:
            cache: A cache object to use for caching the results of this transform.
            output_columns: A dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.
        """
        super().__init__()
        self._output_columns = output_columns
        self.cache = cache

    @staticmethod
    @abstractmethod
    def name() -> str:
        """
        Returns the name of the transform.
        """
        pass

    @property
    def output_columns(self) -> Dict[str, Any]:
        """
        Returns a dictionary of output columns. The keys are the names of the output columns
        as expected by the transform. The values are the column names they should be mapped to in
        the dataset.
        """
        return {k: self._output_columns.get(k, None) for k in self.COLUMN_NAMES}

    @property
    def transform_error_column(self) -> str:
        """
        Returns the name of the column that stores the error if transformation fails.
        """
        return f"{self.name()}_error"

    @abstractmethod
    async def _apply(self, row: Dict[str, Any]) -> Dict[str, Any]:
        """
        Applies the transform to the given row.
        Args:
            row: A dictionary representing a row in the dataset. The keys are the column names and the values are the column values.
        Returns:
            A dictionary representing the transformed row. The keys are the column names and the values are the column values.
        """
        pass

    @abstractmethod
    def params(self) -> Dict[str, Any]:
        """
        Returns a dictionary of parameters that can be used to uniquely identify this transform.
        Returns:
            A dictionary of parameters that can be used to uniquely identify this transform.
        """
        return {}

    async def apply(self, row: Dict[str, Any]) -> Dict[str, Any]:
        if self.cache is not None:
            cache_entry = TransformCacheEntry(
                transform_name=self.name(),
                transform_params=self.params(),
                input=row,
                ttl_ms=self.TTL_MS,
            )
            output = self.cache.lookup(cache_entry)

            if output is not None:
                # Cache hit
                return output

        try:
            output = await self._apply(row)
        except Exception as e:
            logger.error(f"Error applying transform {self.name()}. Exception: {str(e)}")
            output = {
                k: self.NULL_TRANSFORM_TOKEN
                for k in self.output_columns.values()
                if k is not None
            }
            output[self.transform_error_column] = str(e)
            return output

        if self.cache is not None:
            cache_entry.output = output
            self.cache.update(cache_entry)

        return output

    def _return_output_row(self, row: Dict[str, Any]) -> Dict[str, Any]:
        """
        Returns the output row with the correct column names.
        Args:
            row: The output row.
        Returns:
            The output row with the correct column names.
        """
        # remove null key
        row.pop(None, None)
        return row

output_columns: Dict[str, Any] property

Returns a dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.

transform_error_column: str property

Returns the name of the column that stores the error if transformation fails.

__init__(cache, output_columns)

Initialize a transform. Args: cache: A cache object to use for caching the results of this transform. output_columns: A dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.

Source code in src/autolabel/transforms/base.py
def __init__(self, cache: BaseCache, output_columns: Dict[str, Any]) -> None:
    """
    Initialize a transform.
    Args:
        cache: A cache object to use for caching the results of this transform.
        output_columns: A dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.
    """
    super().__init__()
    self._output_columns = output_columns
    self.cache = cache

name() abstractmethod staticmethod

Returns the name of the transform.

Source code in src/autolabel/transforms/base.py
@staticmethod
@abstractmethod
def name() -> str:
    """
    Returns the name of the transform.
    """
    pass

params() abstractmethod

Returns a dictionary of parameters that can be used to uniquely identify this transform. Returns: A dictionary of parameters that can be used to uniquely identify this transform.

Source code in src/autolabel/transforms/base.py
@abstractmethod
def params(self) -> Dict[str, Any]:
    """
    Returns a dictionary of parameters that can be used to uniquely identify this transform.
    Returns:
        A dictionary of parameters that can be used to uniquely identify this transform.
    """
    return {}

rendering: show_root_heading: yes show_root_full_path: no

_apply() abstractmethod

Applies the transform to the given row. Args: row: A dictionary representing a row in the dataset. The keys are the column names and the values are the column values. Returns: A dictionary representing the transformed row. The keys are the column names and the values are the column values.

Source code in src/autolabel/transforms/base.py
@abstractmethod
async def _apply(self, row: Dict[str, Any]) -> Dict[str, Any]:
    """
    Applies the transform to the given row.
    Args:
        row: A dictionary representing a row in the dataset. The keys are the column names and the values are the column values.
    Returns:
        A dictionary representing the transformed row. The keys are the column names and the values are the column values.
    """
    pass