How to validate format of data using glob in python?

How to validate format of data using glob in python? - python

I have a list of different files in my folder and these files have several formats, like PDF, txt, Docx and HTML. I want to validate the format of the files in python.
Here is my attempt
import os
import pdftables_api
import glob
path = r"myfolder\*"
files = glob.glob(path)
for i in files:
if i.endswith('.pdf'):
conversion = pdftables_api.Client('my_api')
conversion.xlsx(i,r"destination\*")
The reason for this is I want to iterate through each file and check if the file is pdf, then it is pdf, convert it into excel using API from PDFTable_api package in python and save it in the destination folder. But I don't feel like this is an efficient way to do this.
Can anyone please help me if there is an efficient manner of achieving this?

Related

How to import a whole folder of CSVs in python (pandas) from UCI ML Repo?

this is the link from which I want the csv files:http://archive.ics.uci.edu/ml/datasets/selfBACK
My approach right now is to download it locally, by simply clicking it. But, this folder has a lot of different folders with many CSVs in it. How I do i import it in an efficient manner?
I know how to do it one by one but I feel there has to be a more efficient way.

You can first read all paths in that folder, and filter for csv files (or add other filters e.g. for specific file names). After that combine the files, here i use pandas if the data is tabular and structured in the same way.
import os
import pandas as pd
path = 'your_folder_path'
dfs = [pd.read_csv(f) for f in os.listdir(path) if f.endswith('.csv')]
# combine them (if they have the same format) like this:
df = pd.concat(dfs)
Note: you could also make a dictionary instead (key=filename, value=dataframe) and then access the data by using the filename.

Read Word and PDF in python

In a folder I have 10 PDF files and 5 Word files (doc, docx). I would like to know how I can create a table in python with two columns:
ID of the file
Text of the pdf or word
Thanks for your help

For reading the PDFs, you can use the library: https://pypi.org/project/PyPDF2/
For docx, the library: https://github.com/ankushshah89/python-docx2txt
By file ID, do you simply mean the filename? You can use the following code to fetch a list of files and folders in the current working folder:
import os
filelist = os.listdir()

Writing XML filenames from folder in CSV in Python

I have folder C:\test_xml where i have list of XML files. I want to get all the xml file names and store this in csv file xml_file.csv. I am trying with below Python code but dont know how to proceed as i am quiet new in Python.
import os
import glob
files = list(glob.glob(os.path.join('C:\temp','*.xml')))
print (files)

A way to get a list of only the filenames:
import pathlib
files = [file.name for file in pathlib.Path(r"C:\temp").glob("*.xml")]
The documentation for the csv module has some examples on how to write a .csv file

How to import every docx file in a folder into Python?

I'm pretty new to Python and I'm using the Python-docx module to manipulate some docx files.
I'm importing the docx files using this code:
doc = docx.Document('filename.docx')
The thing is that I need to work with many docx files and in order to avoid write the same line code for each file, I was wondering, if I create a folder in my working directory, is there a way to import all the docx files in a more efficient way?

Something like:
from glob import glob
def edit_document(path):
document = docx.Document(path)
# --- do things on document ---
for path in glob.glob("./*.docx"):
edit_document(path)
You'll need to adjust the glob expression to suit.
There are plenty of other ways to do that part, like os.walk() if you want to recursively descend directories, but this is maybe a good place to start.

Compare archiwum.rar content and extracted data from .rar in the folder on Windows 7

Does anyone know how to compare amount of files and size of the files in archiwum.rar and its extracted content in the folder?
The reason I want to do this, is that server I'am working on has been restarted couple of times during extraction and I am not sure, if all the files has been extracted correctly.
.rar files are more then 100GB's each and server is not that fast.
Any ideas?
ps. if the solution would be some code instead standalone program, my preference is Python.
Thanks

In Python you can use RarFile module. The usage is similar to build-in module ZipFile.
import rarfile
import os.path
extracted_dir_name = "samples/sample" # Directory with extracted files
file = rarfile.RarFile("samples/sample.rar", "r")
# list file information
for info in file.infolist():
print info.filename, info.date_time, info.file_size
# Compare with extracted file here
extracted_file = os.path.join(extracted_dir_name, info.filename)
if info.file_size != os.path.getsize(extracted_file):
print "Different size!"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to validate format of data using glob in python? - python

Related

How to import a whole folder of CSVs in python (pandas) from UCI ML Repo?

Read Word and PDF in python

Writing XML filenames from folder in CSV in Python

How to import every docx file in a folder into Python?

Compare archiwum.rar content and extracted data from .rar in the folder on Windows 7

Categories

Resources