I am trying to add a header to my CSV file.
I am importing data from a .csv file which has two columns of data, each containing float numbers. Example:
11 22
33 44
55 66
Now I want to add a header for both columns like:
ColA ColB
11 22
33 44
55 66
I have tried this:
with open('mycsvfile.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(('ColA', 'ColB'))
I used 'a' to append the data, but this added the values in the bottom row of the file instead of the first row. Is there any way I can fix it?
One way is to read all the data in, then overwrite the file with the header and write the data out again. This might not be practical with a large CSV file:
#!python3
import csv
with open('file.csv',newline='') as f:
r = csv.reader(f)
data = [line for line in r]
with open('file.csv','w',newline='') as f:
w = csv.writer(f)
w.writerow(['ColA','ColB'])
w.writerows(data)
i think you should use pandas to read the csv file, insert the column headers/labels, and emit out the new csv file. assuming your csv file is comma-delimited. something like this should work:
from pandas import read_csv
df = read_csv('test.csv')
df.columns = ['a', 'b']
df.to_csv('test_2.csv')
I know the question was asked a long time back. But for others stumbling across this question, here's an alternative to Python.
If you have access to sed (you do if you are working on Linux or Mac; you can also download Ubuntu Bash on Windows 10 and sed will come with it), you can use this one-liner:
sed -i 1i"ColA,ColB" mycsvfile.csv
The -i will ensure that sed will edit in-place, which means sed will overwrite the file with the header at the top. This is risky.
If you want to create a new file instead, do this
sed 1i"ColA,ColB" mycsvfile.csv > newcsvfile.csv
In this case, You don't need the CSV module. You need the fileinput module as it allows in-place editing:
import fileinput
for line in fileinput.input(files=['mycsvfile.csv'], inplace=True):
if fileinput.isfirstline():
print 'ColA,ColB'
print line,
In the above code, the print statement will print to the file because of the inplace=True parameter.
For the issue where the first row of the CSV file gets replaced by the header, we need to add an option.
import pandas as pd
df = pd.read_csv('file.csv', **header=None**)
df.to_csv('file.csv', header = ['col1', 'col2'])
You can set reader.fieldnames in your code as list
like in your case
with open('mycsvfile.csv', 'a') as fd:
reader = csv.DictReader(fd)
reader.fieldnames = ["ColA" , "ColB"]
for row in fd
Related
I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this -
Rename column "Id" to "IgnoreId"
Rename column "SFID" to "Id"
else-
Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.
Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
For huge file a simple command line solution with the stream editor sed might be faster than a python script:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.
Basically I have data from a mechanical test in the output format .raw and I want to access it in Python.
The file needs to be splitted using delimiter ";" so it contains 13 columns.
By doing this the idea is to index and pullout the desired information, which in my case is the "Extension mm" and "Load N" values as arrays in row 41 in order to create plot.
I have never worked with .raw files and I dont know what to do.
The file can be downloaded here:
https://drive.google.com/file/d/0B0GJeyFBNd4FNEp0elhIWGpWWWM/view?usp=sharing
Hope somebody can help me out there!
you can convert the raw file into csv file then use the csv module remember to set the delimeter=' ' otherwise by default it take comma as delimeter
import csv
with open('TST0002.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=' ')
for row in reader: //this will read each row line by line
print (row[0]) //you can use row[0] to get first element of that row.
Your file looks basically like a .tsv with 40 lines to skip. Could you try this ?
import csv
#export your file.raw to tsv
with open('TST0002.raw') as infile, open('new.tsv', 'w') as outfile:
lines = infile.readlines()[40:]
for line in lines:
outfile.write(line)
Or if you want to make directly some data analysis on your two columns :
import pandas as pd
df = pd.read_csv("TST0002.raw", sep="\t", skiprows=40, usecols=['Extension mm', 'Load N'])
print(df)
output:
Extension mm Load N
0 -118.284 0.1365034
1 -117.779 -0.08668576
2 -117.274 -0.1142517
3 -116.773 -0.1092401
4 -116.271 -0.1144083
5 -11.577 -0.1314806
6 -115.269 -0.03609632
7 -114.768 -0.06334914
....
I have a bunch of csv files with the same columns but in different order. We are trying to upload them with SQL*Plus but we need the columns with a fixed column arrange.
Example
required order: A B C D E F
csv file: A C D E B (sometimes a column is not in the csv because it is not available)
is it achievable with python? we are using Access+Macros to do it... but it is too time consuming
PS. Sorry if anyone get upset for my English skills.
You can use the csv module to read, reorder, and then and write your file.
Sample File:
$ cat file.csv
A,B,C,D,E
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
Code
import csv
with open('file.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
# output dict needs a list for new column ordering
fieldnames = ['A', 'C', 'D', 'E', 'B']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
# reorder the header first
writer.writeheader()
for row in csv.DictReader(infile):
# writes the reordered rows to the new file
writer.writerow(row)
output
$ cat reordered.csv
A,C,D,E,B
a1,c1,d1,e1,b1
a2,c2,d2,e2,b2
So one way to tackle this problem is to use pandas library which can be easily install using pip. Basically, you can download csv file to pandas dataframe then re-order the column and save it back to csv file. For example, if your sample.csv looks like below:
A,C,B,E,D
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
Here is a snippet to solve the problem.
import pandas as pd
df = pd.read_csv('/path/to/sample.csv')
df_reorder = df[['A', 'B', 'C', 'D', 'E']] # rearrange column here
df_reorder.to_csv('/path/to/sample_reorder.csv', index=False)
csv_in = open("<filename>.csv", "r")
csv_out = open("<filename>.csv", "w")
for line in csv_in:
field_list = line.split(',') # split the line at commas
output_line = ','.join(field_list[0], # rejoin with commas, new order
field_list[2],
field_list[3],
field_list[4],
field_list[1]
)
csv_out.write(output_line)
csv_in.close()
csv_out.close()
You can use something similar to this to change the order, replacing ';' with ',' in your case.
Because you said you needed to do multiple .csv files, you could use the glob module for a list of your files
for file_name in glob.glob('<Insert-your-file-filter-here>*.csv'):
#Do the work here
The csv module allows you to read csv files with their values associated to their column names. This in turn allows you to arbitrarily rearrange columns, without having to explicitly permute lists.
for row in csv.DictReader(open("foo.csv")):
print row["b"], row["a"]
2 1
22 21
Given the file foo.csv:
a,b,d,e,f
1,2,3,4,5
21,22,23,24,25
I am trying to add a header to my CSV file.
I am importing data from a .csv file which has two columns of data, each containing float numbers. Example:
11 22
33 44
55 66
Now I want to add a header for both columns like:
ColA ColB
11 22
33 44
55 66
I have tried this:
with open('mycsvfile.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(('ColA', 'ColB'))
I used 'a' to append the data, but this added the values in the bottom row of the file instead of the first row. Is there any way I can fix it?
One way is to read all the data in, then overwrite the file with the header and write the data out again. This might not be practical with a large CSV file:
#!python3
import csv
with open('file.csv',newline='') as f:
r = csv.reader(f)
data = [line for line in r]
with open('file.csv','w',newline='') as f:
w = csv.writer(f)
w.writerow(['ColA','ColB'])
w.writerows(data)
i think you should use pandas to read the csv file, insert the column headers/labels, and emit out the new csv file. assuming your csv file is comma-delimited. something like this should work:
from pandas import read_csv
df = read_csv('test.csv')
df.columns = ['a', 'b']
df.to_csv('test_2.csv')
I know the question was asked a long time back. But for others stumbling across this question, here's an alternative to Python.
If you have access to sed (you do if you are working on Linux or Mac; you can also download Ubuntu Bash on Windows 10 and sed will come with it), you can use this one-liner:
sed -i 1i"ColA,ColB" mycsvfile.csv
The -i will ensure that sed will edit in-place, which means sed will overwrite the file with the header at the top. This is risky.
If you want to create a new file instead, do this
sed 1i"ColA,ColB" mycsvfile.csv > newcsvfile.csv
In this case, You don't need the CSV module. You need the fileinput module as it allows in-place editing:
import fileinput
for line in fileinput.input(files=['mycsvfile.csv'], inplace=True):
if fileinput.isfirstline():
print 'ColA,ColB'
print line,
In the above code, the print statement will print to the file because of the inplace=True parameter.
For the issue where the first row of the CSV file gets replaced by the header, we need to add an option.
import pandas as pd
df = pd.read_csv('file.csv', **header=None**)
df.to_csv('file.csv', header = ['col1', 'col2'])
You can set reader.fieldnames in your code as list
like in your case
with open('mycsvfile.csv', 'a') as fd:
reader = csv.DictReader(fd)
reader.fieldnames = ["ColA" , "ColB"]
for row in fd
I am looking for a a way to read just the header row of a large number of large CSV files.
Using Pandas, I have this method available, for each csv file:
>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns
I could do this with just the csv module:
>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.
My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.
How can I extract only the header row of a CSV file, quickly?
Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')
In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']
pandas can have the advantage that it deals more gracefully with CSV encodings.
I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.
import csv
with open(fpath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
Hopefully that helps!
I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:
import csv
from glob import iglob
unique_headers = set()
for filename in iglob('*.csv'):
with open(filename, 'rb') as fin:
csvin = csv.reader(fin)
unique_headers.update(next(csvin, []))
Here's one way. You get 1 row.
In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')
In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]:
a b c d
0 0.365453 0.633631 -1.917368 -1.996505
What about:
pandas.read_csv(PATH_TO_CSV, nrows=1).columns
That'll read the first row only and return the columns found.
you have missed nrows=1 param to read_csv
>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns
it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:
for filename in glob.glob(files_path+"\*.csv"):
with open(filename) as f:
first_line = f.readline()
it is easy you can use this:
df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()
In this case you can only read really few row for get your header
if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":
headers = pd.read_csv("test.csv", nrows=0)
import pandas as pd
get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)