How do I combine large csv files in python? - python

I have 18 csv files, each is approximately 1.6Gb and each contain approximately 12 million rows. Each file represents one years' worth of data. I need to combine all of these files, extract data for certain geographies, and then analyse the time series. What is the best way to do this?
I have tired using pd.read_csv but i hit a memory limit. I have tried including a chunk size argument but this gives me a TextFileReader object and I don't know how to combine these to make a dataframe. I have also tried pd.concat but this does not work either.

Here is the elegant way of using pandas to combine a very large csv files.
The technique is to load number of rows (defined as CHUNK_SIZE) to memory per iteration until completed. These rows will be appended to output file in "append" mode.
import pandas as pd
CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"
for csv_file_name in csv_file_list:
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
But If your files contain headers than it makes sense to skip the header in the upcoming files except the first one. As repeating header is unexpected. In this case the solution is as the following:
import pandas as pd
CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"
first_one = True
for csv_file_name in csv_file_list:
if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
skip_row = [0]
else:
skip_row = []
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE, skiprows = skip_row)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
first_one = False

The memory limit is hit because you are trying to load the whole csv in memory. An easy solution would be to read the files line by line (assuming your files all have the same structure), control it, then write it to the target file:
filenames = ["file1.csv", "file2.csv", "file3.csv"]
sep = ";"
def check_data(data):
# ... your tests
return True # << True if data should be written into target file, else False
with open("/path/to/dir/result.csv", "a+") as targetfile:
for filename in filenames :
with open("/path/to/dir/"+filename, "r") as f:
next(f) # << only if the first line contains headers
for line in f:
data = line.split(sep)
if check_data(data):
targetfile.write(line)
Update: An example of the check_data method, following your comments:
def check_data(data):
return data[n] == 'USA' # < where n is the column holding the country

You can convert the TextFileReader object using pd.DataFrame like so: df = pd.DataFrame(chunk), where chunk is of type TextFileReader. You can then use pd.concat to concatenate the individual dataframes.

Related

How to combine multiple csv files on specific columns in a fast way?

I have about 200 CSV files and I need to combine them on specific columns. Each CSV file contains 1000 filled rows on specific columns. My file names are like below:
csv_files = [en_tr_translated0.csv, en_tr_translated1000.csv, en_tr_translated2000.csv, ......... , en_tr_translated200000.csv]
My csv file columns are like below:
The two first columns are prefilled up to same 200.000 rows/sentences in the all csv files. My each en_tr_translated{ }.csv files contains 1000 translated sentences related with their file name. For example:
en_tr_translated1000.csv file contains translated sentences from 0 to 1000th row, en_tr_translated2000.csv file contains translated sentences from 1000th to 2000th row etc. Rest is nan/empty. Below is an example image from en_tr_translated3000.csv file.
I want to copy/merge/join the rows to have one full csv file that contains all the translated sentences. I tried the below code:
out = pd.read_csv(path + 'en_tr_translated0.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
##
i = 1000
for _ in tqdm(range(200000)):
new = pd.read_csv(path + f'en_tr_translated{i}.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
out.loc[_, 'translated_tr_sentence'] = new.loc[_, 'translated_tr_sentence']
out.loc[_, 'translated_en_sentence'] = new.loc[_, 'translated_en_sentence']
if _ == i:
i += 1000
Actually, it works fine but my problem is, it takes 105 HOURS!!
Is there any faster way to do this? I have to do this for like 5 different datasets and this is getting very annoying.
Any suggestion is appreciated.
Your input files have one row of data exactly as one row in the file, correct? So it would probably be even faster if you don't even use pandas. Although if done correctly 200.000 should be still very fast no matter if using pandas or not.
For doing it without: Just open each file, move to the fitting index, write 1000 lines to the output file. Then move on to next file. You might have to fix headers etc. and look out that there is no shift in the indices, but here is an idea of how to do that:
with open(path + 'en_tr_translated_combined.csv', 'w') as f_out: # open out file in write modus
for filename_index in tqdm(range(0, 201000, 1000)): # iterate over each index in steps of 1000 between 0 and 200000
with open(path + f'en_tr_translated{filename_index}.csv') as f_in: # open file with that index
for row_index, line in enumerate(f_in): # iterate over its rows
if row_index < filename_index: # skip rows until you reached the ones with content in translation
continue
if row_index > filename_index + 1000: # close the file if you reached the part where the translations end
break
f_out.write(line) # for the inbetween: copy the content to out file
I would load all the files, drop the rows that are not fully filled, and afterwards concatenate all of the dataframes.
Something like:
dfs = []
for ff in Path('.').rglob('*.csv'):
dfs.append((pd.read_csv(ff, names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=True).dropna())
df = pd.concat(dfs)

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

I have a large CSV (53Gs) and I need to process it in chunks in parallel

I tried out the pool.map approach given in similar answers but I ended up with 8 files of 23G each which is worse.
import os
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
#csv file name to be read in
in_csv = 'mycsv53G.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in tqdm((open(in_csv, encoding = 'latin1')), desc = 'Reading number of lines....'))
print (number_lines)
#size of rows of data to write to the CSV,
#you can change the row size according to your need
rowsize = 11367260 #decided based on your CPU core count
#start looping through data writing it to a new file for each set
def reading_csv(filename):
for i in tqdm(range(1,number_lines,rowsize), desc = 'Reading CSVs...'):
print ('in reading csv')
df = pd.read_csv(in_csv, encoding='latin1',
low_memory=False,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = './csvlist/input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
def main():
# get a list of file names
files = os.listdir('./csvlist')
file_list = [filename for filename in tqdm(files) if filename.split('.')[1]=='csv']
# set up your pool
with Pool(processes=8) as pool: # or whatever your hardware can support
print ('in Pool')
# have your pool map the file names to dataframes
try:
df_list = pool.map(reading_csv, file_list)
except Exception as e:
print (e)
if __name__ == '__main__':
main()
The above approach took 4 hours to split the files in a concurrent fashion and then parsing every CSV will be even more.. not sure if multiprocessing helped or not!
Currently, I read the CSV file through this code:
import pandas as pd
import datetime
import numpy as np
for chunk in dd.read_csv(filename, chunksize = 10**5, encoding='latin-1', skiprows=1, header=None):
#chunk processing
final_df = final_df.append(agg, ignore_index=True)
final_df.to_csv('Final_Output/'+output_name, encoding='utf-8', index=False)
It takes close to 12 hours to process the large CSV at once.
What will the improvements here be?
Any suggestions?
I am willing to try out dask now.. but no other options left.
I would read the csv file line by line and feed the lines into a queue from which the processes pick their tasks. This way, you don't have to split the file first.
See this example here: https://stackoverflow.com/a/53847284/4141279

Python Pandas, Reading in file and skipping rows ahead of header

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?
For example,
some files require this:
f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])
And some require this:
f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])
Here is my chunk of code looping over my files:
for name,ID in stations:
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.
This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.
from io import StringIO
import pandas as pd
from copy import copy
mystr = StringIO("""dasfaf
kgafsda
Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")
mystr2 = copy(mystr)
# replace mystr with open('file.csv', 'r')
with mystr as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))
# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])
print(df)
Date/Time num1 num2
0 2018-01-01 0 1
1 2018-01-02 2 3
Wrap this in a function if you need to repeat the task:
def calc_skiprows(fname):
with fname as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1
df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])
The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:
for name,ID in stations:
#read in each stations .csv files, concatenate together, insert station id column
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
try:
f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
except:
f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Pandas to_csv index=False not working when writing incremental chunks

I'm writing a fixed-width file to CSV. Because the file is too large to read at once, I'm reading the file in chunks of 100000 and appending to CSV. This is working fine, however it's adding an index to the rows despite having set index = False.
How can I complete the CSV file without index?
infile = filename
outfile = outfilename
cols = [(0,10), (12,19), (22,29), (34,41), (44,52), (54,64), (72,80), (82,106), (116,144), (145,152), (161,169), (171,181)]
for chunk in pd.read_fwf(path, colspecs = col_spec, index=False, chunksize=100000):
chunk.to_csv(outfile,mode='a')
The to_csv method has a header parameter, indicating if to output the header. In this case, you probably do not want this for writes that are not the first write.
So, you could do something like this:
for i, chunk in enumerate(pd.read_fwf(...)):
first = i == 0
chunk.to_csv(outfile, header=first, mode='a')

Categories

Resources