Introduction
Autolabel supports transformation of the input data! Input datasets are available in many shapes and form(at)s. We help you ingest your data in the format that you want in a way that is most useful for the downstream LLM or labeling task that you have in mind. We have tried to make the transforms performant, configurable and the outputs formatted in a way useful for the LLM.
Example¶
Here we will show you how to run an example transform. We will use the Webpage Transform to ingest national park websites and label the state that every national park belongs to. You can find a Jupyter notebook with code that you can run on your own here
Use this webpage transform yourself here in a Colab -
Changes to config¶
{
"task_name": "NationalPark",
"task_type": "question_answering",
"dataset": {
},
"model": {
"provider": "openai",
"name": "gpt-3.5-turbo"
},
"transforms": [{
"name": "webpage_transform",
"params": {
"url_column": "url"
},
"output_columns": {
"content_column": "content"
}
}],
"prompt": {
"task_guidelines": "You are an expert at understanding websites of national parks. You will be given a webpage about a national park. Answer with the US State that the national park is located in.",
"output_guidelines": "Answer in one word the state that the national park is located in.",
"example_template": "Content of wikipedia page: {content}\State:",
}
}
Notice the transforms
key in the config. This is where we define our transforms. Notice that this is a list meaning we can define multiple transforms here. Every element of this list is a transform. A transform is a json requiring 3 inputs -
1. name
: This tells the agent which transform needs to be loaded. Here we are using the webpage transform.
2. params
: This is the set of parameters that will be passed to the transform. Read the documentation of the separate transform to see what params can be passed to the transform here. Here we pass the url_column, i.e the column containing the webpages that need to be loaded.
3. output_columns
: Each transform can define multiple outputs. In this dictionary we map the output we need, in case content_column
to the name of the column in the output dataset in which we want to populate this.
Running the transform¶
from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
ds = agent.transform(ds)
This runs the transformation. We will see the content in the correct column. Access this using ds.df
in the AutolabelDataset.
Running the labeling job¶
Simply run the labeling job on the transformed dataset. This will extract the state of the national park from each webpage.
Custom Transforms¶
We support the following transforms -
- Webpage Transform
- PDF Transform
We expect this list to grow in the future and need the help of the community to build transforms that work the best for their data. For this, we provide an abstraction that is easy to use. Any new transform just needs to be extend the BaseTransform
class as penciled down below.
BaseTransform
¶
Bases: ABC
Base class for all transforms.
Source code in src/autolabel/transforms/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
|
output_columns: Dict[str, Any]
property
¶
Returns a dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.
transform_error_column: str
property
¶
Returns the name of the column that stores the error if transformation fails.
__init__(cache, output_columns)
¶
Initialize a transform. Args: cache: A cache object to use for caching the results of this transform. output_columns: A dictionary of output columns. The keys are the names of the output columns as expected by the transform. The values are the column names they should be mapped to in the dataset.
Source code in src/autolabel/transforms/base.py
name()
abstractmethod
staticmethod
¶
params()
abstractmethod
¶
Returns a dictionary of parameters that can be used to uniquely identify this transform. Returns: A dictionary of parameters that can be used to uniquely identify this transform.
Source code in src/autolabel/transforms/base.py
rendering: show_root_heading: yes show_root_full_path: no
_apply()
abstractmethod
¶
Applies the transform to the given row. Args: row: A dictionary representing a row in the dataset. The keys are the column names and the values are the column values. Returns: A dictionary representing the transformed row. The keys are the column names and the values are the column values.