Applying the same operations on multiple .csv file in pandas - python

I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.

Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe

You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.

Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.

Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))

# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)

Related

How to improve my append and read excel For loop in python

Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?
I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?
Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

Deleting rows from several CSV files using Python

I wanted to delete specific rows from every single csv. files in my directory (i.e. from row 0 to 33), but I have 224 separate csv. files which need to be done. I would be happy if you help me how can I use one code to carry out this.
I think you can use glob and pandas to do this quite easily, I'm not sure if you want to write over your original files something I never recommend, so be careful as this code will do that.
import os
import glob
import pandas as pd
os.chdir(r'yourdir')
allFiles = glob.glob("*.csv") # match your csvs
for file in allFiles:
df = pd.read_csv(file)
df = df.iloc[33:,] # read from row 34 onwards.
df.to_csv(file)
print(f"{file} has removed rows 0-33")
or something along those lines..
This is a simple combination of two separate tasks.
First, you need to loop through all the csv files in a folder. See this StackOverflow answer for how to do that.
Next, within that loop, for each file, you need to modify the csv by removing rows. See this answer for how to read a csv, write a csv, and omit certain rows based on a condition.
One final aspect is that you want to omit certain line numbers. A good way to do this is with the enumerate function.
So code such as this will give you the line numbers.
import csv
input = open('first.csv', 'r')
output = open('first_edit.csv', 'w')
writer = csv.writer(output)
for i, row in enumerate(input):
if i > 33:
writer.writerow(row)
input.close()
output.close()
Iterate over CSV files and use Pandas to remove the top 34 rows of each file then save it to an output directory.
Try this code after installing pandas:
from pathlib import Path
import pandas as pd
source_dir = Path('path/to/source/directory')
output_dir = Path('path/to/output/directory')
for file in source_dir.glob('*.csv'):
df = pd.read_csv(file)
df.drop(df.head(34).index, inplace=True)
df.to_csv(output_dir.joinpath(file.name), index=False)

How to tag records with filename, imported to pandas dataframe from multiple csv files?

I have a set of csv files I need to import into a pandas dataframe.
I have imported the filepaths as a list, FP, and I am using the following code to read the data:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
This is working great, but unfortunately there are no datetimestamps or file identifying attributes in the files. I need to know which file each record came from.
I tried adding this line, but this just returned the filename of the final file read:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
df['filename'] = i
I can imagine some messy multi-step solutions, but wondered if there was something more elegant I could do within my existing loop.
I'd do it this way:
df = pd.concat([pd.read_csv(f, header=None).assign(filename=f) for f in FP],
ignore_index=True)

Extracting columns containing a certain name

I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')

Categories

Resources