Python Pandas - Read csv with commented header line - python

I want to read and process a csv file with pandas. The file (as seen below) contains multiple header lines which are indicated by a # tag. I can import that file easily by using
import pandas as pd
file = "data.csv"
data = pd.read_csv(file, delimiter="\s+",
names=["Time", "Cd", "Cs", "Cl", "CmRoll", "CmPitch", "CmYaw", "Cd(f)",
"Cd(r)", "Cs(f)", "Cs(r)", "Cl(f)", "Cl(r)"],
skiprows=13)
However, I have a lot of such files with different header names and I don't want to name them (Time Cd Cs ...) manually. Also the number of commented lines is different between each file. So I want to automate that task.
Do I have to use something like regular expression here, before passing the data into a pandas dataframe?
Thanks for any advise.
And yes, the header names are also beginning with an #.
data.csv:
# Force coefficients
# dragDir : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# sideDir : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# liftDir : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# rollAxis : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# pitchAxis : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# yawAxis : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# magUInf : 4.5000000000000000e+01
# lRef : 5.9399999999999997e-01
# Aref : 3.5639999999999999e-03
# CofR : (1.4999999999999999e-01 0.0000000000000000e+00 0.0000000000000000e+00)
#
# Time Cd Cs Cl CmRoll CmPitch CmYaw Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
5e-06 1.8990180226147195e+00 1.4919925634649792e-11 2.1950119509976829e+00 -1.1085971520784955e-02 -1.0863798447281650e+00 9.5910040927874810e-03 9.3842303978657482e-01 9.6059498282814471e-01 9.5910041002474442e-03 -9.5910040853275178e-03 1.1126130770676479e-02 2.1838858202270064e+00
1e-05 2.1428508927716594e+00 1.0045114197556737e-08 2.5051633252700962e+00 -1.2652317494411272e-02 -1.2367567798452046e+00 1.0822379290263353e-02 1.0587731288914184e+00 1.0840777638802410e+00 1.0822384312820453e-02 -1.0822374267706254e-02 1.5824882789843508e-02 2.4893384424802525e+00
...

What about extracting the header before you read the file?
We only assume that your header lines start with #. Extraction of the header as well as its position in the file is automated. We also ensure that no more lines than necessary are read (except the first data line).
with open(file) as f:
line = f.readline()
cnt = 0
while line.startswith('#'):
prev_line = line
line = f.readline()
cnt += 1
# print(prev_line)
header = prev_line.strip().lstrip('# ').split()
df = pd.read_csv(file, delimiter="\s+",
names=header,
skiprows=cnt
)
With this, you can also proccess the other header lines. It also gives you the position of the header in the file.

This should do, it's easy and efficient, it keeps variables at the minimum and it doesn't require any input aside from the filename.
with open(file, 'r') as f:
for line in f:
if line.startswith('#'):
header = line
else:
break #stop when there are no more #
header = header[1:].strip().split()
data = pd.read_csv(file, delimiter="\s+", comment='#', names=header)
You first open the file and read only the commented line (it will be fast and memory-efficient). The last valid line will be the final header, which will be cleaned and converted to a list. Finally, you open the file with pandas.read_csv() with comment='#', which will skip the commented lines, and names=header.

A little bit of regex could help. This is not the most beautiful of solutions so feel free to post a better solution.
Let's read the first 50 rows of any file to find the last occurrence of the hash which should be the column name.
^ asserts position at start of a line
# matches the character # literally (case sensitive)
Code:
import re
n_rows = 50
path_ = 'your_file_location'
with open(path_,'r') as f:
data = []
for i in range(n_rows): # read only 50 rows here.
for line in f:
if re.match('^#',line):
data.append(line)
start_col = max(enumerate(data))[0]
df = pd.read_csv(path_,sep='\s+',skiprows=start_col) # use your actual delimiter.
# Time Cd Cs Cl CmRoll CmPitch \
0 0.000005 1.899018 1.491993e-11 2.195012 -0.011086 -1.086380 0.009591
1 0.000010 2.142851 1.004511e-08 2.505163 -0.012652 -1.236757 0.010822
CmYaw Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
0 0.938423 0.960595 0.009591 -0.009591 0.011126 2.183886 NaN
1 1.058773 1.084078 0.010822 -0.010822 0.015825 2.489338 NaN
Edit, handling the # in the column name.
We can do this in two steps. We can read in 0 rows but slice the header column.
First read in the file from the header row, but set the header argument to None so no headers will be set.
We can then set the column headers manually:
df = pd.read_csv(path_,sep='\s+',skiprows=start_col + 1, header=None)
df.columns = pd.read_csv(path_,sep='\s+',skiprows=start_col,nrows=0).columns[1:]
print(df)
Time Cd Cs Cl CmRoll CmPitch CmYaw \
0 0.000005 1.899018 1.491993e-11 2.195012 -0.011086 -1.086380 0.009591
1 0.000010 2.142851 1.004511e-08 2.505163 -0.012652 -1.236757 0.010822
Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
0 0.938423 0.960595 0.009591 -0.009591 0.011126 2.183886
1 1.058773 1.084078 0.010822 -0.010822 0.015825 2.489338

To simplify it, and to save time without using loops, you can create 2 dataframes for # commented rows, and the rest.
From those commented rows take last one - that's your header, and then merge data dataframe and this title using concat() also if it's neccesary to assign first row as header you can use df.columns=df.iloc[0]
df = pd.DataFrame({
'A':['#test1 : (000000)','#test1 (000000)','#test1 (000000)','#test1 (000000)','#Time (000000)','5e-06','1e-05'],
})
print(df)
A
0 #test1 : (000000)
1 #test1 (000000)
2 #test1 (000000)
3 #test1 (000000)
4 #Time (000000)
5 5e-06
6 1e-05
df_header = df[df.A.str.contains('^#')]
print(df_header)
A
0 #test1 : (000000)
1 #test1 (000000)
2 #test1 (000000)
3 #test1 (000000)
4 #Time (000000)
df_data = df[~df.A.str.contains('^#')]
print(df_data)
A
5 5e-06
6 1e-05
df = (pd.concat([df_header.iloc[[-1]],df_data])).reset_index(drop=True)
df.A=df.A.str.replace(r'^#',"")
print(df)
A
0 Time (000000)
1 5e-06
2 1e-05

Assuming that comments always start with a single '#' and the header is in the last commented line:
import csv
def read_comments(csv_file):
for row in csv_file:
if row[0] == '#':
yield row.split('#')[1].strip()
def get_last_commented_line(filename):
with open(filename, 'r', newline='') as f:
decommented_lines = [line for line in csv.reader(read_comments(f))]
header = decommented_lines[-1]
skiprows = len(decommented_lines)
return header, skiprows
header, skiprows = get_last_commented_line(path)
pd.read_csv(path, names=header, skiprows=skiprows)

# Read the lines in file
with open(file) as f:
lines = f.readlines()
# Last commented line is header
header = [line for line in lines if line.startswith('#')][-1]
# Strip line and remove '#'
header = header[1:].strip().split()
df = pd.read_csv(file, delimiter="\s+", names=header, comment='#')

Related

Read CSV file with quotechar-comma combination in string - Python

I have got multiple csv files which look like this:
ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15
I am currently importing the data using the pandas read_csv() function.
df = pd.read_csv(filename, sep = ',', quotechar='"')
This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4
Is there a way to solve this issue?
Expected output:
ID Text Value
1 I play football 10
2 I am hungry 12
3 Unfortunately, I get an error 15
You can try to fix the CSV using re module:
import re
import pandas as pd
from io import StringIO
with open("your_file.csv", "r") as f_in:
s = re.sub(
r'"(.*)"',
lambda g: '"' + g.group(1).replace('"', "\\") + '"',
f_in.read(),
)
df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)
Prints:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15
One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:
import csv
# Specify the column index (0-based)
column_index = 1
# Open the input CSV file
with open('input.csv', 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
with open('output.csv', 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
This code creates a new modified csv file
Then your problematic csv row will look like that:
3,"Unfortunately,I get an error",15"
Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')
To automate this conversion for all csv files within a directory:
import csv
import glob
# Specify the column index (0-based)
column_index = 1
# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')
# Iterate through the CSV files
for csv_file in csv_files:
# Open the input CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
output_file = csv_file.replace('.csv', '_new.csv')
with open(output_file, 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".
A possible solution:
df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)
Another possible solution:
df = pd.read_csv(StringIO(text), sep='\t')
df[['ID', 'Text']] = df.iloc[:, 0].str.split(',', expand=True, n=1)
df[['Text', 'Value']] = df['Text'].str.rsplit(',', expand=True, n=1)
df = df.drop(df.columns[0], axis=1).assign(
Text=df['Text'].replace('\"', '', regex=True))
Output:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15

edit a specific colum in a text file

I have a file text with some content.
I want to edit only the column "Medicalization". For example with a program, by entring on keypad B the column "Medicalization" becomes B :
This column has coordinates 14 for each letter of medicalization.
I tried something but I get an "index out of range" error :
with open('d:/test.txt','r') as infile:
with open('d:/test2.txt','w') as outfile:
for line in infile :
line = line.split()
new_line = '"B"\n'.format(line[14])
outfile.write(new_line)
Is that possible to do that with Python ?
Since data is in tabular form so use pandas.read_csv with sep \s+ then use pandas.DataFrame.loc to replace A with B in medicalization.
import pandas as pd
df = pd.read_csv("test.txt", sep="\s+")
df.loc[df["medicalization"] == "A" ,"medicalization"] = "B"
print(df)
typtpt name medicalization
0 1 Entrance B
1 2 Departure B
2 3 Consultation B
3 4 Meeting B
4 5 Transfer B
And if you want to save it back then use:
df.to_csv('test.txt', sep='\t', index=False)
The 'A' value you wish to change cannot possibly be column 14 in every line. If you look at, for example, the 4th row (with 'Consultation' as the name), even with a single space separating the columns, the third column would be at column position 17. So your assumption about fixed column positions must be wrong. If there is, for example, a single space or tab character separating each column, then for the first row of actual data the 'A` value would be at offset 12 and this would explain your exception.
Assuming a single space is separating each column from one another, then you could use the csv module as follows:
import csv
with open('d:/test.txt') as infile:
with open('d:/test2.txt', 'w', newline='') as outfile:
rdr = csv.reader(infile, delimiter=' ')
wtr = csv.writer(outfile, delimiter=' ')
# just write out the first row:
header = next(rdr)
wtr.writerow(header)
for row in rdr:
row[2] = 'B'
wtr.writerow(row)
Or specify delimiter='\t' if a tab is used to separate the columns.
If an arbitrary number of whitespace characters (spaces or tabs) separates each column, then:
with open('test.txt') as infile:
with open('test2.txt', 'w') as outfile:
first_time = True
for row in infile:
columns = row.split()
if first_time:
first_time = False
else:
columns[2] = 'B'
print(' '.join(columns), file=outfile)
The index out of range error is because of the output you get from the line = line.split(). This splits by all the whitespace thus the output of the line.split() is a list like so ['01','Entrance','A'] for line 2 for example. So when you do the indexing you're indexing at 14 which does not exist within the list.
If you're data files format is consistent (all Medicalization data is in the 3rd column) you can achieve what you're after with pure python like so:
with open('test.txt','r') as infile:
with open('test2.txt','w') as outfile:
for idx, line in enumerate(infile) :
line = line.split()
# if idx is 0 its the headers so we don't want to change those
if idx != 0:
line[2] = '"B"'
outfile.write(' '.join(line) + '\n')
However, #Hamza's answer is potentially a nicer one using pandas.

How to count lines in a text file with specified values?

I'm working with a .csv file that lists Timestamps in one column and Wind Speeds in the second column. I need to read through this .csv file and calculate the percent of time where wind speed was above 2m/s. Here's what I have so far.
txtFile = r"C:\Data.csv"
line = o_txtFile.readline()[:-1]
while line:
line = oTextfile.readline()
for line in txtFile:
line = line.split(",")[:-1]
How do I get a count of the lines where the 2nd element in the line is greater than 2?
CSV File Sample
You will probably have to update slightly your CSV, depending on the chosen option (for option 1 and option 2, you will definitely want to remove all header rows, whereas for option 3, you will keep only the middle one, i.e. the one that starts with TIMESTAMP).
You actually have three options:
Option 1: Vanilla Python
count = 0
with open('data.csv', 'r') as file:
for line in file:
value = int(line.split(',')[1])
if value > 100:
count += 1
# Now you have the value in ``count`` variable
Option 2: CSV module
Here I use the Python's CSV module (you could as well use the DictReader, but I'll let you do the search yourself).
import csv
count = 0
with open('data.csv', 'r') as file:
reader = csv.read(file, delimiter=',')
for row in reader:
if int(row[1]) > 100:
count += 1
# Now you have the value in ``count`` variable
Option 3: Pandas
Pandas is a really cool, awesome library used by a lot of people to do data analysis. Doing what you want to do would look like:
import pandas as pd
df = pd.read_csv('data.csv')
# Here you are
count = len(df[df['WindSpd_ms'] > 100])
You can read in the file line by line, if something in it, split it.
You count the lines read and how many are above 10m/s - then calculate the percentage:
# create data file for processing with random data
import random
random.seed(42)
with open("data.txt","w") as f:
f.write("header\n")
f.write("header\n")
f.write("header\n")
f.write("header\n")
for sp in random.choices(range(10),k=200):
f.write(f"some date,{sp+3.5}, data,data,data\n")
# open/read/calculate percentage of data that has 10m/s speeds
days = 0
speedGreater10 = 0
with open("data.txt","r") as f:
for _ in range(4):
next(f) # ignore first 4 rows containing headers
for line in f:
if line: # not empty
_ , speed, *p = line.split(",")
# _ and *p are ignored (they take 'some date' + [data,data,data])
days += 1
if float(speed) > 10:
speedGreater10 += 1
print(f"{days} datapoints, of wich {speedGreater10} "+
f"got more then 10m/s: {speedGreater10/days}%")
Output:
200 datapoints, of wich 55 got more then 10m/s: 0.275%
Datafile:
header
header
header
header
some date,9.5, data,data,data
some date,3.5, data,data,data
some date,5.5, data,data,data
some date,5.5, data,data,data
some date,10.5, data,data,data
[... some more ...]
some date,8.5, data,data,data
some date,3.5, data,data,data
some date,12.5, data,data,data
some date,11.5, data,data,data

Python Pandas, Reading in file and skipping rows ahead of header

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?
For example,
some files require this:
f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])
And some require this:
f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])
Here is my chunk of code looping over my files:
for name,ID in stations:
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.
This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.
from io import StringIO
import pandas as pd
from copy import copy
mystr = StringIO("""dasfaf
kgafsda
Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")
mystr2 = copy(mystr)
# replace mystr with open('file.csv', 'r')
with mystr as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))
# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])
print(df)
Date/Time num1 num2
0 2018-01-01 0 1
1 2018-01-02 2 3
Wrap this in a function if you need to repeat the task:
def calc_skiprows(fname):
with fname as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1
df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])
The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:
for name,ID in stations:
#read in each stations .csv files, concatenate together, insert station id column
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
try:
f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
except:
f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

skipping unknown number of lines to read the header python pandas

i have an excel data that i read in with python pandas:
import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )
the mock data looks like this:
unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :
data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )
data looks like:
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.
If you know what the header startswith:
def skip_to(fle, line,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while not cur_line.startswith(line):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
Demo:
In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")
In [20]: df
Out[20]:
The header
0 foo bar
1 foobar foo
By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.
Or using the junk if they all started with something in common:
def skip_to(fle, junk,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while cur_line.startswith(junk):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
df = skip_to("test.txt", "junk",sep="\t")
Another simple way to achieve a dynamic skiprows would something like this which worked for me:
# Open the file
with open('test.csv', encoding='utf-8') as readfile:
ls_readfile = readfile.readlines()
#Find the skiprows number with ID as the startswith
skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
print(skip)
#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')

Categories

Resources