In a folder I have 10 PDF files and 5 Word files (doc, docx). I would like to know how I can create a table in python with two columns:
ID of the file
Text of the pdf or word
Thanks for your help
For reading the PDFs, you can use the library: https://pypi.org/project/PyPDF2/
For docx, the library: https://github.com/ankushshah89/python-docx2txt
By file ID, do you simply mean the filename? You can use the following code to fetch a list of files and folders in the current working folder:
import os
filelist = os.listdir()
Related
I have folder C:\test_xml where i have list of XML files. I want to get all the xml file names and store this in csv file xml_file.csv. I am trying with below Python code but dont know how to proceed as i am quiet new in Python.
import os
import glob
files = list(glob.glob(os.path.join('C:\temp','*.xml')))
print (files)
A way to get a list of only the filenames:
import pathlib
files = [file.name for file in pathlib.Path(r"C:\temp").glob("*.xml")]
The documentation for the csv module has some examples on how to write a .csv file
I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.
I managed to extract text from one pdf file with tika package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.
# import parser object from tike
from tika import parser
# opening pdf file
parsed_pdf = parser.from_file("ducument_1.pdf")
# saving content of pdf
# you can also bring text only, by parsed_pdf['text']
# parsed_pdf['content'] returns string
data = parsed_pdf['content']
# Printing of content
print(data)
# <class 'str'>
print(type(data))
The desired output should look like this:
Folder_Name
pdf1
pdf2
17534
text of the pdf1
text of the pdf 2
63546
text of the pdf1
text of the pdf1
26374
text of the pdf1
-
If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners
Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame
#!/usr/bin/python3
import os, glob
from tika import parser
from pandas import DataFrame
# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."
# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
files += glob.glob(os.path.join(dirpath, ext))
# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))
# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
data = parser.from_file(filename)
text = data["content"]
df.loc[idx] = [filename, text]
# For debugging, print what we found
print(df)
Extremely easy to have a list of all pdfs on unix.
import os
# saves all pdf in a string.
a = os.popen("du -a|awk '{print $2}'|grep '.*\.pdf$'").read()[2:-1]
print(a)
On my computer the output was:
[luca#artix tmp]$ python3 forum.py
a.pdf
./foo/test.pdf
You can just do something like
for line in a.split('\n'):
print(line, line.split('/'))
and you'll know the folder of the pdf. I hope I helped you
I have Multiple txt file in a folder. I need to insert the data from the txt file into mySql table
I also need to sort the files by modified date before inserting the data into the sql table named TAR.
below is the file inside one of the txt file. I also need to remove the first character in every line
SSerial1234
CCustomer
IDivision
Nat22
nAembly
rA0
PFVT
fchassis1-card-linec
RUnk
TP
Oeka
[06/22/2020 10:11:50
]06/22/2020 10:27:22
My code only reads all the files in the folder and prints the contents of the file. im not sure how to sort the files before reading the files 1 by 1.
Is there also a way to read only a specific file (JPE*.log)
import os
for path, dirs, files in os.walk("C:\TAR\TARS_Source/"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName, "r") as myFile:
print(myFile.read())
Use glob.glob method to get all files using a regex like following...
import glob
files=glob.glob('./JPE*.log')
And you can use following to sort files
sorted_files=sorted(files)
I'm trying to find out the proper solution to check only folders inside the folder and read only XML files.
My code:
data= os.listdir('./home/')
for packs in data:
file=open('./home/' + packs + '/files' + 'data.xml', 'r') #i have 100's of folders inside home
all_file=file.read()
Output: It reads all the folders as per requirement but I have 3 csv and 2 text files in the folder home. My code is also reading those files and gives an error. I don't want to read those, is there any methods to read only XML files.
Does anyone know how to compare amount of files and size of the files in archiwum.rar and its extracted content in the folder?
The reason I want to do this, is that server I'am working on has been restarted couple of times during extraction and I am not sure, if all the files has been extracted correctly.
.rar files are more then 100GB's each and server is not that fast.
Any ideas?
ps. if the solution would be some code instead standalone program, my preference is Python.
Thanks
In Python you can use RarFile module. The usage is similar to build-in module ZipFile.
import rarfile
import os.path
extracted_dir_name = "samples/sample" # Directory with extracted files
file = rarfile.RarFile("samples/sample.rar", "r")
# list file information
for info in file.infolist():
print info.filename, info.date_time, info.file_size
# Compare with extracted file here
extracted_file = os.path.join(extracted_dir_name, info.filename)
if info.file_size != os.path.getsize(extracted_file):
print "Different size!"