Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

add langchain for chunking#772

Merged
anik120 merged 2 commits intoinstructlab:maininstructlab/instructlab:mainfrom
aartij22:multi-doc-supportCopy head branch name to clipboard
Apr 4, 2024
Merged

add langchain for chunking#772
anik120 merged 2 commits intoinstructlab:maininstructlab/instructlab:mainfrom
aartij22:multi-doc-supportCopy head branch name to clipboard

Conversation

@aartij22
Copy link
Contributor

@aartij22 aartij22 commented Apr 1, 2024

Changes

Which issue is resolved by this Pull Request:
Resolves #750

Description of your changes:
Knowledge documents auto-chunking support

Function accepts chunk_word_count (defaults to 1500) input from user as

  1. cli command
    lab generate --chunk-word-count 2000
  2. config.yaml
  generate:
    model: merlinite-7b-Q4_K_M
    num_cpus: 10
    num_instructions: 100
    output_dir: generated
    prompt_file: prompt.txt
    seed_file: seed_tasks.json
    taxonomy_base: origin/main
    taxonomy_path: /Users/aajha/Desktop/knowledge-docs-git/taxonomy/
    chunk_word_count: 2000

cli/generator/generate_data.py Outdated Show resolved Hide resolved
@xukai92
Copy link
Member

xukai92 commented Apr 2, 2024

I tested it locally and happy with it.
@anik120 if you can help with the functional tests we can move forward with merging it.

@xukai92
Copy link
Member

xukai92 commented Apr 3, 2024

if it is langchain that causes the issue (e.g. forcing some dependencies to change), we could try one of its "sub-package" instead: https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter instead

Signed-off-by: aajha <aajha@redhat.com>
@aartij22
Copy link
Contributor Author

aartij22 commented Apr 3, 2024

@xukai92 I replaced langchain with langchain-text-splitters, still getting the same error.

Signed-off-by: aajha <aajha@redhat.com>
@xukai92
Copy link
Member

xukai92 commented Apr 3, 2024

@anik120 is there a way we can see the exact deps (e.g. pip list) that the functional test is running with?

@anik120
Copy link
Contributor

anik120 commented Apr 4, 2024

@xukai92 now that you found the issue, and you're happy with this PR after testing it locally, let's merge this. I tested it locally too and it checks out.

Thanks a lot @aartij22 🎉 and sorry for the hair splitting test issue (@xukai92 got on the case 🎉 )

@anik120 anik120 merged commit 2bb6ef9 into instructlab:main Apr 4, 2024
anik120 added a commit that referenced this pull request Apr 4, 2024
Follow up to #772

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auto-chunking support for knowldge

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.