Skip to main content

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Visit Stack Exchange
Asked
Viewed 62 times
2
\$\begingroup\$

As part of my NLP project at work, I want to loop over all files that are either PDF of docx in the same directory. The end purpose is to create a dataframe with text content of the files in one column and filename in another column. So I wrote a loop to open them up, extract the text, add it to a list as below:

import docx
import PyPDF2


    #wordfile
    def ReadingText(filename):
        doc=docx.Document(filename)
        comp = []
        for par in doc.paragraphs:
            comp.append(par.text)
        return '\n' .join(comp)
    
    #PDF files
    def Readingpdf(pdfname):
        pdfRead=PyPDF2.PdfFileReader(pdfname)
        comp = ""
        for i in range(pdfRead.getNumPages()):
            comp += pdfRead.getPage(i).extractText()
        return comp

After that I will specify directory and perform the loop as below:

directory= "my directory"
all_together = []
name = []

for filename in os.listdir(directory):
    if filename.endswith(".docx"):
        f = ReadingText(filename)
        all_together.append(f)
        name.append(filename)
    elif filename.endswith(".pdf"):
            try:
                pdf = Readingpdf(filename)
                all_together.append(pdf)
                name.append(filename)
            except:
                pass

            
    df = pd.DataFrame({"article":name,"text":all_together})

        
 
df.head(10)

It is doing the job, but I really want to develop myself so I want to know your opinion - be as harsh as you can.

\$\endgroup\$

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.