Looping over files to create a dataframe

Ask Question

Asked 4 years, 8 months ago

Modified 4 years, 8 months ago

Viewed 62 times

As part of my NLP project at work, I want to loop over all files that are either PDF of docx in the same directory. The end purpose is to create a dataframe with text content of the files in one column and filename in another column. So I wrote a loop to open them up, extract the text, add it to a list as below:

import docx
import PyPDF2


    #wordfile
    def ReadingText(filename):
        doc=docx.Document(filename)
        comp = []
        for par in doc.paragraphs:
            comp.append(par.text)
        return '\n' .join(comp)
    
    #PDF files
    def Readingpdf(pdfname):
        pdfRead=PyPDF2.PdfFileReader(pdfname)
        comp = ""
        for i in range(pdfRead.getNumPages()):
            comp += pdfRead.getPage(i).extractText()
        return comp

After that I will specify directory and perform the loop as below:

directory= "my directory"
all_together = []
name = []

for filename in os.listdir(directory):
    if filename.endswith(".docx"):
        f = ReadingText(filename)
        all_together.append(f)
        name.append(filename)
    elif filename.endswith(".pdf"):
            try:
                pdf = Readingpdf(filename)
                all_together.append(pdf)
                name.append(filename)
            except:
                pass

            
    df = pd.DataFrame({"article":name,"text":all_together})

        
 
df.head(10)

It is doing the job, but I really want to develop myself so I want to know your opinion - be as harsh as you can.

edited Feb 12, 2021 at 4:26

Jamal

35.2k1313 gold badges134134 silver badges238238 bronze badges

asked Feb 11, 2021 at 8:30

Sam.H

14355 bronze badges

Add a comment |

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Looping over files to create a dataframe

0

You must log in to answer this question.

Hot Network Questions

Looping over files to create a dataframe

0

You must log in to answer this question.

Related

Hot Network Questions