Reading & processing CSV Files individually, outputting results to new individual files

Reading & processing CSV Files individually, outputting results to new individual files - python

I am making what i suspect to be a very silly error here but vast majority of what i've found online talks about reading multiple files into a single dataframe or outputting results into a single file which is not my goal here.
Aim: read hundreds of CSV files, one by one, filter each one and output the result in a file using the original file's name in the output/result file (e.g. "Processed_<original_file>.csv*")*, then move on to the next file in the loop, read & filter that, put the results for that in a new output/result file. and so on.
Problem: I either run into a problem where only a single result file is produced (from the last read file in the loop) or if i use the code below , having read various SO pages , i get an invalid argument error.
Error: OSError: [Error 22] invalid argument: 'c:/users/my Directory/sourceFiles\Processed_c:/users/my Directory/sourceFiles\files1.csv'
I know i'm getting my loop & re-naming wrong at the moment but can't figure out how to do this without loading ALL my csvs into a single dataframe using list & concat and outputting everything into a single result file (which is not my aim) --- i want to output each result file into individual files , which share the name of the original file.
ideally given the size & number of files (700+ each 400mb) involve i rather use Pandas as that seems to be more efficient from what ive learnt so far.
import pandas as pd
import glob
import os
path = "c:/users/my Directory/"
csvFiles = glob.glob( path + "/sourceFiles/files*")
for files in csvFiles:
df = pd.read_csv(files, index_col=None, encoding='Latin-1', engine = 'python',
error_bad_lines = False)
df_f = df[df.iloc[:, 2] == "Office"]
filepath = os.path.join(path,'Processed_'+str(files)+'.csv')
df_f.to_csv(filepath)

The error message is nice because it shows you exactly what is wrong--your filename for the output save is wrong because the c:/users/... is repeated twice and concatenated together.
Try something with os.path.basename() to strip file extension and path:
fileout = path + '\\' + 'Processed_' + os.path.splitext(os.path.basename(files))[0] + '.csv'
And most importantly, test it with a couple print statements to see if your ins and outs are what you expect. Just comment-out the analysis lines.
import pandas as pd
import glob
import os
path = "c:/users/my Directory/"
csvFiles = glob.glob( path + "/sourceFiles/files*")
for files in csvFiles:
print(files)
#df = pd.read_csv(files, index_col=None, encoding='Latin-1', engine = 'python',
error_bad_lines = False)
#df_f = df[df.iloc[:, 2] == "Office"]
filepath = path + '\\' + 'Processed_' + os.path.splitext(os.path.basename(files))[0] + '.csv'
print(filepath)
#df_f.to_csv(filepath)

Related

Why are CSV files not being created from df? (even though the code is running without error)

import glob
import pandas as pd
PREFIX = "/Users/test/Desktop/2022-selected"
filenames = [path.split("/")[-1] for path in glob.glob(PREFIX + "123456/*.csv")]
for filename in filenames:
filepath = f"{PREFIX}*/{filename}"
print(filepath)
csv_paths = glob.glob(filepath)
df = pd.concat((pd.read_csv(f, header = 0) for f in csv_paths), ignore_index=True)
df.to_csv(f"{PREFIX}output_{filename}", index=False)
I am trying to concatenate different .csv files across folders (e.g., in code it is 123456) into new aggregated .csv files. Even though my code is running without error, the .csv files are not being written. Please help!

How can i speed up importing many csv's while filtering and adding filenames?

I have some Python (3.8) code that does the following:
Walks directory and subdirectories of a given path
Finds all .csv files
Finds all .csv files with 'Pct' in filename
Joins path and file
Reads CSV
Adds filename to df
Concatonates all dfs together
The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.
My current working code is below:
start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/' # change this
df_result= None
#loop_number = 0
for path, dirs, files in os.walk(start_dirctory):
for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
# print(file)
if 'Pct' in file: # filter if contains 'Pct'
# print('Pct = ', file)
full_name=os.path.join(path, file) # make full file path
df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
df_tmp['file']=os.path.basename(file) # df.file = file name
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
#print(full_name, 'imported')
#loop_number = loop_number + 1
#print('Loop number =', loop_number)
Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.
What I have tried below (incomplete):
%%time
import glob
import pandas as pd
df = pd.concat(
[pd.read_csv(f, header=None)
for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
],
axis='index', ignore_index=True
)
Question
Is there any way that I can reduce the time to read and ingest the CSV's in my code above?
Thanks!

Check out the following solution, this assumes the open-file system limit is high enough, because this will stream every file one by one, but it has to open each of them to read headers. In cases where files have different columns, you will get the superset of them in the resulting file:
from convtools import conversion as c
from convtools.contrib.tables import Table
files = sorted(
os.path.join(path, file)
for path, dirs, files in os.walk(start_dirctory)
for file in files
if "Pct" in file and file.endswith(".csv")
)
table = None
for file in files:
table_ = Table.from_csv(file, header=True) # assuming there's header
if table is None:
table = table_
else:
table.chain(table_)
# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict) # or list, or tuple
# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice
If you are sure that all the files have same columns (one file is being opened at a time):
edited to add file column
def concat_files(files):
for file in files:
yield from Table.from_csv(file, header=True).update(
file=file
).into_iter_rows(dict)
# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)
P.S. of course you can replace Table.from_csv with a standard/other reader, but this one adapts to the file, so it is generally faster on large files.

Python: writing file to specific directory

This is likely a fundamental Python question, but I'm stumped (still learning). My script uses Pandas to create txt files from csv cells, and works properly. However, I'd like to write the files to a specific directory, listed as save_path below. However, my efforts to put this together keep running into errors.
Here's my (not) working code:
import os
import pandas as pd
save_path = "C:\users\name\folder\txts"
df= pd.read_csv("C:\users\name\folder\test.csv", sep=",")
df2 = df.fillna('')
for index in range(len(df)):
with open(df2["text_number"][index] + '.txt', 'w') as output:
output2 = os.path.join(save_path, output) # I'm uncertain how to structure or place the os.path.join command.
output2.write(df2["text"][index])
The resulting error is below:
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'TextIOWrapper'
Thoughts? Any assistance is greatly appreciated.

You need to first generate the file name and then open it in write mode to put the contents.
for index in range(len(df)):
# create file name
filename = df2["text_number"][index] + '.txt'
# then generate full path using os lib
full_path = os.path.join(save_path, filename)
# now open that file, dont forget to use w+ to create the file if it doesn't exist
with open(full_path, 'w+') as output_file_handler:
# and write the contents
output_file_handler.write(df2["text"][index])

This should work.
(But you might want to check out this answer)
for index in range(len(df)):
filename = df2["text_number"][index] + '.txt'
fp = os.path.join(save_path, filename)
with open(fp, 'w') as output:
output.write(df2["text"][index])

'EmptyDataError: No columns to parse from file' in Pandas when concatenating all files in a directory into single CSV

So I'm working on a project that analyzes Covid-19 data from this entire year. I have multiple csv files in a given directory. I am trying to merge all the files' contents from each month into a single, comprehensive csv file. Here's what I got so far as shown below...Specifically, the error message that appears is 'EmptyDataError: No columns to parse from file.' If I were to delete df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file) and simply run print(file) It lists all the correct files that I am trying to merge. However, when trying to merge all data into one I get that error message. What gives?
import pandas as pd
import os
df = pd.read_csv('./csse_covid_19_daily_reports_us/09-04-2020.csv')
files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')]
all_data = pd.DataFrame()
for file in files:
df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file)
all_data = pd.concat([all_data, df])
all_data.head()

Folks, I have resolved this issue. Instead of sifting through files with files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')], I have instead used files=[f for f in os.listdir("./") if f.endswith('.csv')]. This filtered out some garbage files that were not .csv, thus allowing me to compile all data into a single csv.

how to move multiple images from a folder to another folder in python?

I am trying to move multiple images from one folder to another, using shutil.move() , I have saved image names in a CSV file.
ex: [img1 , img25, img55....]
I Have tried the below code
import pandas as pd
import shutil
cop_folder = path to folder containing images
destination_folder = path wher i want to move the images
df = pd.read_csv('', header = None)
for i in df:
if i in cop_folder:
shutil.move( i, dest_folder)
else:
print('fail')
TypeError: 'in ' requires string as left operand, not int

Try this approach:
import pandas as pd
import os
def move_file(old_file_path, new_directory):
if not os.path.isdir(new_directory):
os.mkdir(new_directory)
base_name = os.path.basename(old_file_path)
new_file_path = os.path.join(new_directory, base_name)
# Deletes a file if that file already exists there, you can change this behavior
if os.path.exists(new_file_path):
os.remove(new_file_path)
os.rename(old_file_path, new_file_path)
cop_folder = 'origin-folder\\'
destination_folder = 'dest_folder\\'
df = pd.read_csv('files.csv', header=None)
for i in df[0]:
filename = os.path.join(cop_folder, i)
move_file(filename, destination_folder)
The file names inside the csv must have an extension. If they don't, then you should use filename = os.path.join(cop_folder, i + '.jpg')

There are a few issues here, firstly you are iterating over a dataframe which will return the column labels not the values - that's what's causing the error the you posted. If you really want to use pandas just to import a CSV then you could change it to for i in df.iterrows() but even then it won't simply return the file name, it will return a series object. You'd probably be better off using the standard CSV module to read the CSV. That way your filenames will be read in as a list and will behave as you intended.
Secondly unless there is something else going on in your code you can't look for files in a folder using the 'in' keyword, you'll need to construct a full filepath by concatenating the filename and the folder path.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading & processing CSV Files individually, outputting results to new individual files - python

Related

Why are CSV files not being created from df? (even though the code is running without error)

How can i speed up importing many csv's while filtering and adding filenames?

Python: writing file to specific directory

'EmptyDataError: No columns to parse from file' in Pandas when concatenating all files in a directory into single CSV

how to move multiple images from a folder to another folder in python?

Categories

Resources