I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this -
Rename column "Id" to "IgnoreId"
Rename column "SFID" to "Id"
else-
Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.
Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
For huge file a simple command line solution with the stream editor sed might be faster than a python script:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.
Related
I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this -
Rename column "Id" to "IgnoreId"
Rename column "SFID" to "Id"
else-
Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.
Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
For huge file a simple command line solution with the stream editor sed might be faster than a python script:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.
I have about 200 CSV files and I need to combine them on specific columns. Each CSV file contains 1000 filled rows on specific columns. My file names are like below:
csv_files = [en_tr_translated0.csv, en_tr_translated1000.csv, en_tr_translated2000.csv, ......... , en_tr_translated200000.csv]
My csv file columns are like below:
The two first columns are prefilled up to same 200.000 rows/sentences in the all csv files. My each en_tr_translated{ }.csv files contains 1000 translated sentences related with their file name. For example:
en_tr_translated1000.csv file contains translated sentences from 0 to 1000th row, en_tr_translated2000.csv file contains translated sentences from 1000th to 2000th row etc. Rest is nan/empty. Below is an example image from en_tr_translated3000.csv file.
I want to copy/merge/join the rows to have one full csv file that contains all the translated sentences. I tried the below code:
out = pd.read_csv(path + 'en_tr_translated0.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
##
i = 1000
for _ in tqdm(range(200000)):
new = pd.read_csv(path + f'en_tr_translated{i}.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
out.loc[_, 'translated_tr_sentence'] = new.loc[_, 'translated_tr_sentence']
out.loc[_, 'translated_en_sentence'] = new.loc[_, 'translated_en_sentence']
if _ == i:
i += 1000
Actually, it works fine but my problem is, it takes 105 HOURS!!
Is there any faster way to do this? I have to do this for like 5 different datasets and this is getting very annoying.
Any suggestion is appreciated.
Your input files have one row of data exactly as one row in the file, correct? So it would probably be even faster if you don't even use pandas. Although if done correctly 200.000 should be still very fast no matter if using pandas or not.
For doing it without: Just open each file, move to the fitting index, write 1000 lines to the output file. Then move on to next file. You might have to fix headers etc. and look out that there is no shift in the indices, but here is an idea of how to do that:
with open(path + 'en_tr_translated_combined.csv', 'w') as f_out: # open out file in write modus
for filename_index in tqdm(range(0, 201000, 1000)): # iterate over each index in steps of 1000 between 0 and 200000
with open(path + f'en_tr_translated{filename_index}.csv') as f_in: # open file with that index
for row_index, line in enumerate(f_in): # iterate over its rows
if row_index < filename_index: # skip rows until you reached the ones with content in translation
continue
if row_index > filename_index + 1000: # close the file if you reached the part where the translations end
break
f_out.write(line) # for the inbetween: copy the content to out file
I would load all the files, drop the rows that are not fully filled, and afterwards concatenate all of the dataframes.
Something like:
dfs = []
for ff in Path('.').rglob('*.csv'):
dfs.append((pd.read_csv(ff, names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=True).dropna())
df = pd.concat(dfs)
I am trying to add a header to my CSV file.
I am importing data from a .csv file which has two columns of data, each containing float numbers. Example:
11 22
33 44
55 66
Now I want to add a header for both columns like:
ColA ColB
11 22
33 44
55 66
I have tried this:
with open('mycsvfile.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(('ColA', 'ColB'))
I used 'a' to append the data, but this added the values in the bottom row of the file instead of the first row. Is there any way I can fix it?
One way is to read all the data in, then overwrite the file with the header and write the data out again. This might not be practical with a large CSV file:
#!python3
import csv
with open('file.csv',newline='') as f:
r = csv.reader(f)
data = [line for line in r]
with open('file.csv','w',newline='') as f:
w = csv.writer(f)
w.writerow(['ColA','ColB'])
w.writerows(data)
i think you should use pandas to read the csv file, insert the column headers/labels, and emit out the new csv file. assuming your csv file is comma-delimited. something like this should work:
from pandas import read_csv
df = read_csv('test.csv')
df.columns = ['a', 'b']
df.to_csv('test_2.csv')
I know the question was asked a long time back. But for others stumbling across this question, here's an alternative to Python.
If you have access to sed (you do if you are working on Linux or Mac; you can also download Ubuntu Bash on Windows 10 and sed will come with it), you can use this one-liner:
sed -i 1i"ColA,ColB" mycsvfile.csv
The -i will ensure that sed will edit in-place, which means sed will overwrite the file with the header at the top. This is risky.
If you want to create a new file instead, do this
sed 1i"ColA,ColB" mycsvfile.csv > newcsvfile.csv
In this case, You don't need the CSV module. You need the fileinput module as it allows in-place editing:
import fileinput
for line in fileinput.input(files=['mycsvfile.csv'], inplace=True):
if fileinput.isfirstline():
print 'ColA,ColB'
print line,
In the above code, the print statement will print to the file because of the inplace=True parameter.
For the issue where the first row of the CSV file gets replaced by the header, we need to add an option.
import pandas as pd
df = pd.read_csv('file.csv', **header=None**)
df.to_csv('file.csv', header = ['col1', 'col2'])
You can set reader.fieldnames in your code as list
like in your case
with open('mycsvfile.csv', 'a') as fd:
reader = csv.DictReader(fd)
reader.fieldnames = ["ColA" , "ColB"]
for row in fd
I'm writing a fixed-width file to CSV. Because the file is too large to read at once, I'm reading the file in chunks of 100000 and appending to CSV. This is working fine, however it's adding an index to the rows despite having set index = False.
How can I complete the CSV file without index?
infile = filename
outfile = outfilename
cols = [(0,10), (12,19), (22,29), (34,41), (44,52), (54,64), (72,80), (82,106), (116,144), (145,152), (161,169), (171,181)]
for chunk in pd.read_fwf(path, colspecs = col_spec, index=False, chunksize=100000):
chunk.to_csv(outfile,mode='a')
The to_csv method has a header parameter, indicating if to output the header. In this case, you probably do not want this for writes that are not the first write.
So, you could do something like this:
for i, chunk in enumerate(pd.read_fwf(...)):
first = i == 0
chunk.to_csv(outfile, header=first, mode='a')
I am trying to add a header to my CSV file.
I am importing data from a .csv file which has two columns of data, each containing float numbers. Example:
11 22
33 44
55 66
Now I want to add a header for both columns like:
ColA ColB
11 22
33 44
55 66
I have tried this:
with open('mycsvfile.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(('ColA', 'ColB'))
I used 'a' to append the data, but this added the values in the bottom row of the file instead of the first row. Is there any way I can fix it?
One way is to read all the data in, then overwrite the file with the header and write the data out again. This might not be practical with a large CSV file:
#!python3
import csv
with open('file.csv',newline='') as f:
r = csv.reader(f)
data = [line for line in r]
with open('file.csv','w',newline='') as f:
w = csv.writer(f)
w.writerow(['ColA','ColB'])
w.writerows(data)
i think you should use pandas to read the csv file, insert the column headers/labels, and emit out the new csv file. assuming your csv file is comma-delimited. something like this should work:
from pandas import read_csv
df = read_csv('test.csv')
df.columns = ['a', 'b']
df.to_csv('test_2.csv')
I know the question was asked a long time back. But for others stumbling across this question, here's an alternative to Python.
If you have access to sed (you do if you are working on Linux or Mac; you can also download Ubuntu Bash on Windows 10 and sed will come with it), you can use this one-liner:
sed -i 1i"ColA,ColB" mycsvfile.csv
The -i will ensure that sed will edit in-place, which means sed will overwrite the file with the header at the top. This is risky.
If you want to create a new file instead, do this
sed 1i"ColA,ColB" mycsvfile.csv > newcsvfile.csv
In this case, You don't need the CSV module. You need the fileinput module as it allows in-place editing:
import fileinput
for line in fileinput.input(files=['mycsvfile.csv'], inplace=True):
if fileinput.isfirstline():
print 'ColA,ColB'
print line,
In the above code, the print statement will print to the file because of the inplace=True parameter.
For the issue where the first row of the CSV file gets replaced by the header, we need to add an option.
import pandas as pd
df = pd.read_csv('file.csv', **header=None**)
df.to_csv('file.csv', header = ['col1', 'col2'])
You can set reader.fieldnames in your code as list
like in your case
with open('mycsvfile.csv', 'a') as fd:
reader = csv.DictReader(fd)
reader.fieldnames = ["ColA" , "ColB"]
for row in fd