Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Hello. I have the following use case: I would like to generate a Knowledge Graph but starting from existing embedded Chunks.
I have been reading on how the indexing process is structured, essentially for my use case is I want to skip the step where we start from text Documents and we create Text Units.

Does the config allows for the population of TextUnit or how one would adapt the code to achieve this? Thank you in advance for all the help you can give me.

You must be logged in to vote

Replies: 3 comments · 3 replies

Comment options

This would be super really neat as you can combine it with existing vector databases.

You must be logged in to vote
0 replies
Comment options

Please see the response here: #396

You must be logged in to vote
3 replies
@gianpycea
Comment options

hey, thank you so much for your reply and yes that is essentially the same question. If I am really honest though I still don't get in practice how one would code the script to bring in own chunks (i am going to assume for the moment that i don't care to bring the embedding of chunk content and happy to use whatever graphrag uses).

I have been following the documentation and i can use graphrag in its "basic way" where you ran the indexing code starting from a txt document.

If i want to just skip the chunking phase and start from my own chunks what is the best way to do this? do i need to create my own workflow using that language or is there any other way?

I think a bit more of a deep dive in the code base or a minimal script that shows how one would use graphrag for this use case would have been good because it's unclear to me how to achieve this from just the docs.

Thanks in advance!

@natoverse
Comment options

If you want to skip our chunking and start from your own chunks, replace the starting txt document with individual txt documents for each chunk. As long as each of those documents are shorter than your configured GraphRAG chunk_size, our chunking will be skipped and your chunks used directly. The actual process/config for running GraphRAG in this case does not change at all, you have just supplied a different set of input documents.

If you need to create your chunks, I would suggest using tiktoken, which has encode/decode methods to match the encoding of your model. So your script would encode the document into a list of tokens, iterate through the tokens to subdivide them into sublists that are shorter than your chunk_size, and then decode those token lists back to text that you can write to a file.

@SS8816
Comment options

Hey, Could we pass a .faiss and .pkl embedded versions of out .txt files in the input folder? will that work? or the input needs to be either .txt, .json......?

Comment options

I have a similar use case - but with a lexical graph in a Neo4j graph database which has been created from multiple documents.

The input step in graphrag only takes in plain text strings (at least in the documentation I have found, including the csv example), would be cool to learn how to run graphrag on an exisiting lexical property graph.

In my case - my graph contains both section & table-row-data nodes with properties such as an id and text. To run the index engine on each node and let it extract text units, entities, communities etc would be MEGA. For small token nodes (like table-row-data nodes) the relation between a node and text unit would be 1:1, whereas with a larger section node (containing largers pieces of text) it would be 1:many.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.