As part of my NLP project at work, I want to loop over all files that are either PDF of docx in the same directory. The end purpose is to create a dataframe with text content of the files in one column and filename in another column. So I wrote a loop to open them up, extract the text, add it to a list as below:
import docx
import PyPDF2
#wordfile
def ReadingText(filename):
doc=docx.Document(filename)
comp = []
for par in doc.paragraphs:
comp.append(par.text)
return '\n' .join(comp)
#PDF files
def Readingpdf(pdfname):
pdfRead=PyPDF2.PdfFileReader(pdfname)
comp = ""
for i in range(pdfRead.getNumPages()):
comp += pdfRead.getPage(i).extractText()
return comp
After that I will specify directory and perform the loop as below:
directory= "my directory"
all_together = []
name = []
for filename in os.listdir(directory):
if filename.endswith(".docx"):
f = ReadingText(filename)
all_together.append(f)
name.append(filename)
elif filename.endswith(".pdf"):
try:
pdf = Readingpdf(filename)
all_together.append(pdf)
name.append(filename)
except:
pass
df = pd.DataFrame({"article":name,"text":all_together})
df.head(10)
It is doing the job, but I really want to develop myself so I want to know your opinion - be as harsh as you can.