How to skip reading empty files with pandas in Python

How to skip reading empty files with pandas in Python - python

I read all the files in one folder one by one into a pandas.DataFrame and then I check them for some conditions. There are a few thousand files, and I would love to make pandas raise an exception when a file is empty, so that my reader function would skip this file.
I have something like:
class StructureReader(FileList):
def __init__(self, dirname, filename):
self.dirname=dirname
self.filename=str(self.dirname+"/"+filename)
def read(self):
self.data = pd.read_csv(self.filename, header=None, sep = ",")
if len(self.data)==0:
raise ValueError
class Run(object):
def __init__(self, dirname):
self.dirname=dirname
self.file__list=FileList(dirname)
self.result=Result()
def run(self):
for k in self.file__list.file_list[:]:
self.b=StructureReader(self.dirname, k)
try:
self.b.read()
self.b.find_interesting_bonds(self.result)
self.b.find_same_direction_chain(self.result)
except ValueError:
pass
Regular file that I'm searching for some condition looks like:
"A/C/24","A/G/14","WW_cis",,
"B/C/24","A/G/15","WW_cis",,
"C/C/24","A/F/11","WW_cis",,
"d/C/24","A/G/12","WW_cis",,
But somehow I don't ever get ValueError raised, and my functions are searching empty files, which gives me a lot of "Empty DataFrame ..." lines in my results file. How can I skip empty files?

I'd first check if the file is empty, and if it isn't empty I'll try to use it with pandas. Following this link https://stackoverflow.com/a/15924160/5088142 you can find a nice way to check if a file is empty:
import os
def is_non_zero_file(fpath):
return os.path.isfile(fpath) and os.path.getsize(fpath) > 0

You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not

You can get your work done with following code, just add your CSVs path to the path variable, and run. You should get an object raw_data which is a Pandas dataframe.
import os, pandas as pd, glob
import pandas.io.common
path = "/home/username/data_folder"
files_list = glob.glob(os.path.join(path, "*.csv"))
for i in range(0,len(files_list)):
try:
raw_data = pd.read_csv(files_list[i])
except pandas.errors.EmptyDataError:
print(files_list[i], " is empty and has been skipped.")

How about this
files = glob.glob('*.csv')
files = list(filter(lambda file: os.stat(file).st_size > 0, files))
data = pd.read_csv(files)

Related

Loading multiple files and concatenate: ValueError: No objects to concatenate

After checking Stack Overflow, I reviewed the path directory—and it works fine as shown in "Loading files".
However, when I try to concatenate I get the above ValueError as shown in "Concatenate files".
What can I learn from this?
Also, is it ideal to use the below line of code?
files_names=os.listdir()
Loading files:
import pandas as pd
import os
files_names=os.listdir()
def load_all_csv(files_names):
# Follow this function template: take a list of file names and return one dataframe
# YOUR CODE HERE
return files_names
print(files_names)
all_data=load_all_csv(files_names)
all_data
Concatenate files:
combined_data=[]
for filename in all_data:
if filename.endswith('csv'):
#print("All csv files: ", filename)
df=pd.read_csv(filename, index_col='Month')
combined_data.append(df)
all_data=pd.concat(combined_data, axis=1)
all_data

Trying to rename files to remove hyphens, file cannot be found error

I am working on locating a file which has hyphens(eg., Hours-2021.xml).When I perform the character replacement, I then get an error that the file cannot be found. If I simply use a filname without hyphens it works as expected. I had found on another thread a solution to reformat the filename and it doesnt appear to work. Most likely it is a simple fix that is eluding me. Here is a sample of my code...
import os
import os.path
import win32com.client
import pandas as pd
in_file = input('Enter filename to use:')
for file in os.listdir():
if file.startswith(in_file):
new_fn=file.replace('-','')
new_1 = os.rename(file, new_fn)
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open(new_1)
xlWbk.SaveAs(r"hours_conv.xlsx", 51)
xlWbk.Close(True)
xlApp.Quit()
except Exception as e:
print(e)
finally:
xlWbk = None; xlApp = None
del xlWbk; del xlApp
# READ EXCEL FILE
output_df = pd.read_excel(r"hours_conv.xlsx", skiprows = 3)
print(output_df)
Everything before the try: I can get an output that I expect (eg., Hours2021). Then Further I get the error that in this case ""Sorry, we couldn't find Hours2021.xml ..."

Without delving too deep in your code, it looks like you have an indentation problem. Your try-except-finally block should probably be indented to be under the if file.startswith line.
Plus, you should probably check that new_fn is not the same as new_1 before doing the os.rename(). Or alternatively you can make the call conditional, like changing:
if file.startswith(in_file):
to:
if file.startswith(in_file) and '-' in file:
or something along those lines.
Also, os.rename() does not return a value. So new_1 gets set to None.
Lastly, keep in mind that each time you run your program, it is renaming files. So you may have to rename them back before each run.
Also, keep in mind that your:
xlWbk = xlApp.Workbooks.Open(new_1)
will probably always fail, since new_1 is None.
Assuming I understand what you're trying to do, here is a working version of your code. In addition to the above comments, it also addresses path issues, since Excel wants to know the full path of the doc to work with. The working code is as follows:
import os.path
import win32com.client
import pandas as pd
cwd = os.getcwd()
print(f"{cwd=}")
out_filename = "hours_conv.xlsx"
in_file = input('Enter filename (starting string) to use: ')
for file in os.listdir():
if file.startswith(in_file):
new_fn = file.replace('-','')
if new_fn != file:
os.rename(file, new_fn)
# indent
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open(f"{cwd}/{new_fn}") # mod to be new_fn
xlWbk.SaveAs(f"{cwd}/{out_filename}", 51)
xlWbk.Close(True)
xlApp.Quit()
except Exception as e:
print(e)
finally:
xlWbk = xlApp = None
del xlWbk; del xlApp
# READ EXCEL FILE
if os.path.exists(f"{cwd}/{out_filename}"):
output_df = pd.read_excel(f"{cwd}/{out_filename}", skiprows = 3)
print(output_df)
else:
print(f"Output file {cwd}/{out_filename} Not Found")

How to monitor a CSV file for changes?

I'm trying to monitor a CSV file that is being written to by a separate program. Around every 10 seconds, the CSV file is updated with a couple more lines. Each time the file is updated, I want to be able to detect the file has been changed (will always be the same file), take the new lines, and write them to console (just for a test).
I have looked around the website, and have found numerous ways of watching a file to see if its updated (like so http://thepythoncorner.com/dev/how-to-create-a-watchdog-in-python-to-look-for-filesystem-changes/), but I can't seem to find anything that will allow me to get to the changes made in the file to print out to console.
Current code:
import time
from watchdog.observers import Observer
from watchdog.events import PatternMatchingEventHandler
def on_created(event):
print(f"hey, {event.src_path} has been created!")
def on_deleted(event):
print(f"Someone deleted {event.src_path}!")
def on_modified(event):
print(f"{event.src_path} has been modified")
def on_moved(event):
print(f"ok ok ok, someone moved {event.src_path} to {event.dest_path}")
if __name__ == "__main__":
patterns = "*"
ignore_patterns = ""
ignore_directories = False
case_sensitive = True
my_event_handler = PatternMatchingEventHandler(patterns, ignore_patterns, ignore_directories, case_sensitive)
my_event_handler.on_created = on_created
my_event_handler.on_deleted = on_deleted
my_event_handler.on_modified = on_modified
my_event_handler.on_moved = on_moved
path = "."
go_recursively = True
my_observer = Observer()
my_observer.schedule(my_event_handler, path, recursive=go_recursively)
my_observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
my_observer.stop()
my_observer.join()
This runs, but looks for changes in files all over the place. How do I make it listen for changes from one single file?

If you're more or less happy with the script other than it tracking a bunch of files then you could change the patterns = "*" part which is a wildcard matching string which tells the PatternMatchingEventHandler to look for any file. You could change that to paterns = 'my_file.csv' and also change the path variable to the directory that the file is in to save some time recursively scanning all the directories in '.'. Then you don't need recursive set to True for a single file either.
Print new lines to console part (one option):
import pandas as pd
...
def on_modified(event):
print(f"{event.src_path} has been modified")
# You said "a couple more lines" I'm going to take that
# as two:
df = pd.read_csv(event.src_path)
print("Newest 2 lines:")
print(df[-2:])
If it's not two lines you'll want to track the length of the file and pass that to the function which opens the CSV so it knows how many lines are new.

I believe since this is a CSV file, reading file using pandas and checking the file size can help. You can use df.tail(2) to print last two rows after reading the csv using pandas

How to get pyPdf to work with os or glob

My goal is to read a directory with several PDF files and return the number of pages in each file using Python. I'm trying to use the pyPdf library but it fails.
If I do this:
from pyPdf import PdfFileReader
testFile = "C:\\path\\file.pdf"
pdfFile = PdfFileReader(file(testFile, 'rb'))
print pdfFile.getNumPages()
I'll get a result
If I do this, it fails:
pdfList = []
for root, dirs, files in os.walk("C:\\path"):
for file in files:
pdfList.append(os.path.join(root, file)
for item in pdfList:
targetPdf = PdfFileReader(file(item,'rb'))
numPages = targetPdf.getNumPages()
print item, numPages
This always results in:
TypeError: 'str' object is not callable
If I try to recreate a pyPdf object manually, I get the same thing.
What am I doing wrong?

Issue is due to using name, file as variable.
You are using file as variable name in first for loop.
And as a function call in statement, targetPdf = PdfFileReader(file(item,'rb')).
Try changing variable name in first for loop from file to fileName.
Hope that helps

access logfile and return/open all files written in there

For Reference
I have a python class which is supposed to unpack an archive and recursively iterate over the directory structure and then return the files for further processing. In my case I want to hash those files. I'm struggling with returning the files. Here is my take.
I created an unzip function, a function which creates a log-file with all the paths of the files which were unpacked. Then I want to access this log-file and return ALL of the files so I can use them in another python class for further processing.This doesn't seem to work yet.
Structure of log-file:
/home/usr/Downloads/outdir/XXX.log
/home/usr/Downloads/outdir/Code/XXX.py
/home/usr/Downloads/outdir/Code/XXX.py
/home/usr/Downloads/outdir/Code/XXX.py
Code of interest:
#staticmethod
def read_received_files(from_log):
with open(from_log, 'r') as data:
data = data.readlines()
for lines in data:
\\ This does not seem to work zet
read_files = open(lines.strip())
return read_files

I believe that's what you're looking for:
#staticmethod
def read_received_files(from_log):
files = []
with open(from_log, 'r') as data:
for line in data:
files.append(open(line.strip()))
return files
You returned while iterating, preventing from opening the other files.

Since you are primarily after the meta data and hash of the files stored in the zip file, but not the file itself, there is no need to extract the files to the file system.
Instead you can use the ZipFile.open() method to access the contents of the file through a file-like object. Meta data could be gathered using the ZipInfo object for each file. Here's an example which gets file name and file size as meta data, and the hash of the file.
import hashlib
import zipfile
from collections import namedtuple
def get_files(archive):
FileInfo = namedtuple('FileInfo', ('filename', 'size', 'hash'))
with zipfile.ZipFile(archive) as zf:
for info in zf.infolist():
if not info.filename.endswith('/'): # exclude directories
f = zf.open(info)
hash_ = hashlib.md5(f.read()).hexdigest()
yield FileInfo(info.filename, info.file_size, hash_)
for f in get_files('some_file.zip'):
print('{}: {} {} bytes'.format(f.hash, f.filename, f.size))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to skip reading empty files with pandas in Python - python

You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not

How about this files = glob.glob('*.csv') files = list(filter(lambda file: os.stat(file).st_size > 0, files)) data = pd.read_csv(files)

Related

Loading multiple files and concatenate: ValueError: No objects to concatenate

Trying to rename files to remove hyphens, file cannot be found error

How to monitor a CSV file for changes?

How to get pyPdf to work with os or glob

access logfile and return/open all files written in there

Categories

Resources