Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Any solution for this, am reading from local directory, md format data, binary and streaming mode

You must be logged in to vote

Replies: 2 comments · 12 replies

Comment options

Hey @sarim2000 , could you tell about your hardware, which app you are running (one of the provided ones or a custom one), whether you are running on Docker, Pathway version, and the modules you are using (such as parser, splitter, OpenAI embedder, etc.)?
Thanks,

You must be logged in to vote
8 replies
@sarim2000
Comment options

I briefly checked your app, I think you can return a tuple from the function (string, filename), then store them in two separate columns. Later ingest/write to the file with pw.io.subscribe based on row values.

I did not get this point. Can u show me an example?

@sarim2000
Comment options

I think this is very likely the result of io writes in the UDF (user defined function). The functions you put into doc_post_processors are converted to pw.UDFs and applied to tables inside of the VectorStoreServer.

and why is this given as an option if it creates this issue? am i using it wrong?

@sarim2000
Comment options

more specifically, taking this rag example, can you tell me the flow so this doesnt cause max cpu usage..?

I did not get your answer. More specifically the flow, what should be the flow? As I am using pw.io.fs.read, should I make the changes here to add additional metadata column using pw.io.csv and then add vector server and not pass doc_post_processors in it?

@sarim2000
Comment options

also i remove writing to file inside doc_post_processor functions, its still hogging cpu resource, I am just extracting metadata adn saving

@berkecanrizai
Comment options

If this is happening even after removing the file save, this is not expected, I will try to replicate and get back to you.
That much CPU usage is not normal without any io write in the UDFs.

I did not get this point. Can u show me an example?

I meant something like this (while removing the doc_post_processors and keeping the rest of the code same):

def save_file_callback(key, row, time, is_addition):
    file_name = row["filename"].value
    write_success_metadata_files(row["date"], filename)

@pw.udf
def extract_published_date(doc) -> tuple[str, str]:
  data: dict = doc.as_dict()
  text = data["text"]
  metadata = data["metadata"]
  file_name = metadata["path"]
  ...
  return parsed_date, file_name

docs_table = vector_server._graph["parsed_docs"]

docs_table += docs_table.select(extract_tup=extract_published_date(pw.this.data))

result_table = docs_table.select(date=pw.this.extract_tup[0], filename=pw.this.extract_tup[1])

pw.io.subscribe(result_table, on_change=save_file_callback)
Comment options

Hey @sarim2000,
thank you for your report. To know if it is a problem on the pathway side please make an experiment not involving pathway. You can read the files one by one and apply your processing logic and see how much time it takes then.
You can just read all files in your ./content directory and apply extract_published_date to each of them.

You must be logged in to vote
4 replies
@sarim2000
Comment options

even after removing doc_post_processor functions, after all ingestion pathway still takes up 130% on average

@KamilPiechowiak
Comment options

Ok. Thanks for info. I'll investigate it and will get back to you once I have some conclusions.

@sarim2000
Comment options

ya cool; buzz me if you guys need additional info

@KamilPiechowiak
Comment options

Hey @sarim2000, one fix is already there. If you install pathway v0.13.2, the cpu usage in idle state should be decreased significantly (in our tests it's on 4% now). We're working on decreasing it even further but hope that this fix will already make the app usable for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.