-
Notifications
You must be signed in to change notification settings - Fork 449
Description
SDG's instructlab.sdg.taxonomy has a method called _read_taxonomy_file which returns a value called seed_instruction_data, a list of dictionaries of this form:
{
"questions_and_answers": question_answer_list,
"context": context,
"taxonomy_path": tax_path,
"documents": document_contents,
"filepaths": doc_filepaths,
"domain": domain,
"document_outline": contents.get("document_outline"),
}
Anything that consumes this then needs to know the string labels where each of the fields go. The effect is essentially the same as it would be if this were a Python class but without the type checking advantages you get from having a real class with field names. It would be better to re-represent this as a class, perhaps using typing.NamedTuple which is a convenient way to make a class with a simple list of fields.
Alternatively, we could at least replace the string labels with constants, but that seems like a less robust solution.
Note that this method is called by instructlab/rag/taxonomy_utils.py in core, so changing this code will also require corresponding changes in core. So it would be easier to do this after the SDG preprocessing also moves to core so the change is all contained in one repo.
Acceptance Criteria
- The
_read_taxonomy_filemethod returns a list of structured objects with named fields instead of a list of dictionaries with hard coded strings. - All consumers of this method are updated to use these objects.