Renaming large csv files [duplicate] - python

I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this -
Rename column "Id" to "IgnoreId"
Rename column "SFID" to "Id"
else-
Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.

Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()

For huge file a simple command line solution with the stream editor sed might be faster than a python script:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.

Related

How to combine multiple csv files on specific columns in a fast way?

I have about 200 CSV files and I need to combine them on specific columns. Each CSV file contains 1000 filled rows on specific columns. My file names are like below:
csv_files = [en_tr_translated0.csv, en_tr_translated1000.csv, en_tr_translated2000.csv, ......... , en_tr_translated200000.csv]
My csv file columns are like below:
The two first columns are prefilled up to same 200.000 rows/sentences in the all csv files. My each en_tr_translated{ }.csv files contains 1000 translated sentences related with their file name. For example:
en_tr_translated1000.csv file contains translated sentences from 0 to 1000th row, en_tr_translated2000.csv file contains translated sentences from 1000th to 2000th row etc. Rest is nan/empty. Below is an example image from en_tr_translated3000.csv file.
I want to copy/merge/join the rows to have one full csv file that contains all the translated sentences. I tried the below code:
out = pd.read_csv(path + 'en_tr_translated0.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
##
i = 1000
for _ in tqdm(range(200000)):
new = pd.read_csv(path + f'en_tr_translated{i}.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
out.loc[_, 'translated_tr_sentence'] = new.loc[_, 'translated_tr_sentence']
out.loc[_, 'translated_en_sentence'] = new.loc[_, 'translated_en_sentence']
if _ == i:
i += 1000
Actually, it works fine but my problem is, it takes 105 HOURS!!
Is there any faster way to do this? I have to do this for like 5 different datasets and this is getting very annoying.
Any suggestion is appreciated.
Your input files have one row of data exactly as one row in the file, correct? So it would probably be even faster if you don't even use pandas. Although if done correctly 200.000 should be still very fast no matter if using pandas or not.
For doing it without: Just open each file, move to the fitting index, write 1000 lines to the output file. Then move on to next file. You might have to fix headers etc. and look out that there is no shift in the indices, but here is an idea of how to do that:
with open(path + 'en_tr_translated_combined.csv', 'w') as f_out: # open out file in write modus
for filename_index in tqdm(range(0, 201000, 1000)): # iterate over each index in steps of 1000 between 0 and 200000
with open(path + f'en_tr_translated{filename_index}.csv') as f_in: # open file with that index
for row_index, line in enumerate(f_in): # iterate over its rows
if row_index < filename_index: # skip rows until you reached the ones with content in translation
continue
if row_index > filename_index + 1000: # close the file if you reached the part where the translations end
break
f_out.write(line) # for the inbetween: copy the content to out file
I would load all the files, drop the rows that are not fully filled, and afterwards concatenate all of the dataframes.
Something like:
dfs = []
for ff in Path('.').rglob('*.csv'):
dfs.append((pd.read_csv(ff, names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=True).dropna())
df = pd.concat(dfs)

Rename a column header in csv using python pandas

I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this -
Rename column "Id" to "IgnoreId"
Rename column "SFID" to "Id"
else-
Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.
Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
For huge file a simple command line solution with the stream editor sed might be faster than a python script:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.

how to edit a csv in python and add one row after the 2nd row that will have the same values in all columns except 1

I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,
You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.
The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!

Pandas to_csv index=False not working when writing incremental chunks

I'm writing a fixed-width file to CSV. Because the file is too large to read at once, I'm reading the file in chunks of 100000 and appending to CSV. This is working fine, however it's adding an index to the rows despite having set index = False.
How can I complete the CSV file without index?
infile = filename
outfile = outfilename
cols = [(0,10), (12,19), (22,29), (34,41), (44,52), (54,64), (72,80), (82,106), (116,144), (145,152), (161,169), (171,181)]
for chunk in pd.read_fwf(path, colspecs = col_spec, index=False, chunksize=100000):
chunk.to_csv(outfile,mode='a')
The to_csv method has a header parameter, indicating if to output the header. In this case, you probably do not want this for writes that are not the first write.
So, you could do something like this:
for i, chunk in enumerate(pd.read_fwf(...)):
first = i == 0
chunk.to_csv(outfile, header=first, mode='a')

Python- Import Multiple Files to a single .csv file

I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)

Categories

Resources