CPU usage is max even when no ingestion is taking place and with small amount of data #78

Jun 28, 2024

sarim2000
Jun 28, 2024

Any solution for this, am reading from local directory, md format data, binary and streaming mode

sarim2000 · Jun 28, 2024

berkecanrizai
Jun 28, 2024

Hey @sarim2000 , could you tell about your hardware, which app you are running (one of the provided ones or a custom one), whether you are running on Docker, Pathway version, and the modules you are using (such as parser, splitter, OpenAI embedder, etc.)?
Thanks,

8 replies

sarim2000 Jun 28, 2024
Author

I briefly checked your app, I think you can return a tuple from the function (string, filename), then store them in two separate columns. Later ingest/write to the file with pw.io.subscribe based on row values.

I did not get this point. Can u show me an example?

sarim2000 Jun 28, 2024
Author

I think this is very likely the result of io writes in the UDF (user defined function). The functions you put into doc_post_processors are converted to pw.UDFs and applied to tables inside of the VectorStoreServer.

and why is this given as an option if it creates this issue? am i using it wrong?

sarim2000 Jun 28, 2024
Author

more specifically, taking this rag example, can you tell me the flow so this doesnt cause max cpu usage..?

I did not get your answer. More specifically the flow, what should be the flow? As I am using pw.io.fs.read, should I make the changes here to add additional metadata column using pw.io.csv and then add vector server and not pass doc_post_processors in it?

sarim2000 Jun 29, 2024
Author

also i remove writing to file inside doc_post_processor functions, its still hogging cpu resource, I am just extracting metadata adn saving

berkecanrizai Jul 2, 2024

If this is happening even after removing the file save, this is not expected, I will try to replicate and get back to you.
That much CPU usage is not normal without any io write in the UDFs.

I did not get this point. Can u show me an example?

I meant something like this (while removing the doc_post_processors and keeping the rest of the code same):

def save_file_callback(key, row, time, is_addition):
    file_name = row["filename"].value
    write_success_metadata_files(row["date"], filename)

@pw.udf
def extract_published_date(doc) -> tuple[str, str]:
  data: dict = doc.as_dict()
  text = data["text"]
  metadata = data["metadata"]
  file_name = metadata["path"]
  ...
  return parsed_date, file_name

docs_table = vector_server._graph["parsed_docs"]

docs_table += docs_table.select(extract_tup=extract_published_date(pw.this.data))

result_table = docs_table.select(date=pw.this.extract_tup[0], filename=pw.this.extract_tup[1])

pw.io.subscribe(result_table, on_change=save_file_callback)

sarim2000 · Jul 3, 2024

KamilPiechowiak
Jul 3, 2024
Maintainer

Hey @sarim2000,
thank you for your report. To know if it is a problem on the pathway side please make an experiment not involving pathway. You can read the files one by one and apply your processing logic and see how much time it takes then.
You can just read all files in your ./content directory and apply extract_published_date to each of them.

4 replies

sarim2000 Jul 3, 2024
Author

even after removing doc_post_processor functions, after all ingestion pathway still takes up 130% on average

KamilPiechowiak Jul 3, 2024
Maintainer

Ok. Thanks for info. I'll investigate it and will get back to you once I have some conclusions.

sarim2000 Jul 3, 2024
Author

ya cool; buzz me if you guys need additional info

KamilPiechowiak Jul 9, 2024
Maintainer

Hey @sarim2000, one fix is already there. If you install pathway v0.13.2, the cpu usage in idle state should be decreased significantly (in our tests it's on 4% now). We're working on decreasing it even further but hope that this fix will already make the app usable for you.

Search code, repositories, users, issues, pull requests...

CPU usage is max even when no ingestion is taking place and with small amount of data #78

Uh oh!

sarim2000 Jun 28, 2024

Replies: 2 comments · 12 replies

Uh oh!

Uh oh!

berkecanrizai Jun 28, 2024

Uh oh!

sarim2000 Jun 28, 2024 Author

Uh oh!

Uh oh!

sarim2000 Jun 28, 2024 Author

Uh oh!

sarim2000 Jun 28, 2024 Author

Uh oh!

sarim2000 Jun 29, 2024 Author

Uh oh!

berkecanrizai Jul 2, 2024

Uh oh!

KamilPiechowiak Jul 3, 2024 Maintainer

Uh oh!

sarim2000 Jul 3, 2024 Author

Uh oh!

KamilPiechowiak Jul 3, 2024 Maintainer

Uh oh!

sarim2000 Jul 3, 2024 Author

Uh oh!

KamilPiechowiak Jul 9, 2024 Maintainer

sarim2000
Jun 28, 2024

berkecanrizai
Jun 28, 2024

sarim2000 Jun 28, 2024
Author

sarim2000 Jun 28, 2024
Author

sarim2000 Jun 28, 2024
Author

sarim2000 Jun 29, 2024
Author

KamilPiechowiak
Jul 3, 2024
Maintainer

sarim2000 Jul 3, 2024
Author

KamilPiechowiak Jul 3, 2024
Maintainer

sarim2000 Jul 3, 2024
Author

KamilPiechowiak Jul 9, 2024
Maintainer