Python Pandas, Reading in file and skipping rows ahead of header - python

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?
For example,
some files require this:
f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])
And some require this:
f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])
Here is my chunk of code looping over my files:
for name,ID in stations:
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']

One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.
This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.
from io import StringIO
import pandas as pd
from copy import copy
mystr = StringIO("""dasfaf
kgafsda
Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")
mystr2 = copy(mystr)
# replace mystr with open('file.csv', 'r')
with mystr as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))
# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])
print(df)
Date/Time num1 num2
0 2018-01-01 0 1
1 2018-01-02 2 3
Wrap this in a function if you need to repeat the task:
def calc_skiprows(fname):
with fname as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1
df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])

The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:
for name,ID in stations:
#read in each stations .csv files, concatenate together, insert station id column
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
try:
f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
except:
f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Related

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

Add one column to a text file

I have multiple txt files and each of these txt files has 6 columns. What I want to do : add just one column as a last column, so at the end the txt file has maximum 7 columns and if i run the script again it shouldn't add a new one:
At the beginning each file has six columns:
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125
What I want is to add a new column of zeros only if the current number of columns is 6, after that it should prevent adding a new column when I run the script again (7 columns is the total number where the last one is zeros):
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089 0
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938 0
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125 0
My code works and add one additional column each time i run the script but i want to add just once when the number of columns 6. Where (a) give me the number of column and if the condition is fulfilled then add a new one:
from glob import glob
import numpy as np
new_column = [0] * 20
def get_new_line(t):
l, c = t
return '{} {}\n'.format(l.rstrip(), c)
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
**#if a==6: (here is the problem)**
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
writecolumn(path)
When i check the number of columns #if a==6 and shift the content inside the if statement I get error. without shifting the content inside the if every thing works fine and still adding one column each time i run it.
Any help is appreciated.
To test the code create two/one txt files with random number of six columns.
Could be an indentation problem, i.e. block below 'if'. writing new-lines should be indented properly --
This works --
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
if int(a)==6:
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
Use pandas to read your text file:
import pandas as of
df = pd.read_csv("whitespace.csv", header=None, delimiter=" ")
Add a column or more as needed
df['somecolname'] = 0
Save DataFrame with no header.

Quickly Remove Header from Large .csv Files

My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:
def sanitize(filepath): #Removes header information, leaving only column names and data. Outputs "sanitized" file.
with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
writer = csv.writer(san)
line_count = 0
headingrow = 0
datarow = 0
safety = 1
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Detect data end
if "DATA END" in str(row):
safety = 1
#Write data
if safety == 0:
if line_count == headingrow or line_count >= datarow:
writer.writerow(row)
line_count += 1
I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.
Example pseudo-data (rows I want to keep are indented):
[Header Start]
...48 lines of header data...
[Header End]
Blank Line
[Data Start]
Row with Column Names
Column Units
Column Variable Type
...180,000 lines of data...
I understand that I can't edit a .csv file as I iterate over it (Learned here:
How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?
Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists) instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?
UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.
import csv, pandas as pd
filepath = r'input'
with open(filepath) as unsan:
line_count = 0
headingrow = 0
datarow = 0
safety = 1
row_count = sum(1 for row in csv.reader(unsan, delimiter=','))
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Write data
if safety == 0:
if line_count == headingrow:
colnames = row
line_count +=1
break
line_count += 1
badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)
df.to_csv (r'output', index = None, header=True)
Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?
If it is not important that the file is read in Python, or with a CSV reader/writer, you can use other tools. On *nix you can use sed:
sed -n '/DATA START/,/DATA END/p' myfile.csv > headerless.csv
This will be very fast for millions of lines.
perl is more multi-platform:
perl -F -lane "print if /DATA START/ .. /DATA END/;" myfile.csv
To avoid editing the file, and read the file with headers straight into Python and then into Pandas, you can wrap the file in your own file-like object.
Given an input file called myfile.csv with this content:
HEADER
HEADER
HEADER
HEADER
HEADER
HEADER
now, some, data
1,2,3
4,5,6
7,8,9
You can read that file in directly using a wrapper class:
import io
class HeaderSkipCsv(io.TextIOBase):
def __init__(self, filename):
""" create an iterator from the filename """
self.data = self.yield_csv(filename)
def readable(self):
""" here for compatibility """
return True
def yield_csv(self, filename):
""" open filename and read past the first empty line
Then yield characters one by one. This reads just one
line at a time in memory
"""
with open(filename) as f:
for line in f:
if line.strip() == "":
break
for line in f:
for char in line:
yield char
def read(self, n=None):
""" called by Pandas with some 'n', this returns
the next 'n' characters since the last read as a string
"""
data = ""
for i in range(n):
try:
data += next(self.data)
except StopIteration:
break
return data
WANT_PANDAS=True #set to False to just write file
if WANT_PANDAS:
import pandas as pd
df = pd.read_csv(HeaderSkipCsv('myfile.csv'))
print(df.head(5))
else:
with open('myoutfile.csv', 'w') as fo:
with HeaderSkipCsv('myfile.csv') as fi:
c = fi.read(1024)
while c:
fo.write(c)
c = fi.read(1024)
which outputs:
now some data
0 1 2 3
1 4 5 6
2 7 8 9
Because Pandas allows any file-like object, we can provide our own! Pandas calls read on the HeaderSkipCsv object as it would on any file object. Pandas just cares about reading valid csv data from a file object when it calls read on it. Rather than providing Pandas with a clean file, we provide it with a file-like object that filters out the data Pandas does not like (i.e. the headers).
The yield_csv generator iterates over the file without reading it in, so only as much data as Pandas requests is loaded into memory. The first for loop in yield_csv advances f to beyond the first empty line. f represents a file pointer and is not reset at the end of a for loop while the file remains open. Since the second for loop receives f under the same with block, it starts consuming at the start of the csv data, where the first for loop left it.
Another way of writing the first for loop would be
next((line for line in f if line.isspace()), None)
which is more explicit about advancing the file pointer, but arguably harder to read.
Because we skip the lines up to and including the empty line, Pandas just gets the valid csv data. For the headers, no more than one line is ever loaded.

How do I combine large csv files in python?

I have 18 csv files, each is approximately 1.6Gb and each contain approximately 12 million rows. Each file represents one years' worth of data. I need to combine all of these files, extract data for certain geographies, and then analyse the time series. What is the best way to do this?
I have tired using pd.read_csv but i hit a memory limit. I have tried including a chunk size argument but this gives me a TextFileReader object and I don't know how to combine these to make a dataframe. I have also tried pd.concat but this does not work either.
Here is the elegant way of using pandas to combine a very large csv files.
The technique is to load number of rows (defined as CHUNK_SIZE) to memory per iteration until completed. These rows will be appended to output file in "append" mode.
import pandas as pd
CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"
for csv_file_name in csv_file_list:
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
But If your files contain headers than it makes sense to skip the header in the upcoming files except the first one. As repeating header is unexpected. In this case the solution is as the following:
import pandas as pd
CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"
first_one = True
for csv_file_name in csv_file_list:
if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
skip_row = [0]
else:
skip_row = []
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE, skiprows = skip_row)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
first_one = False
The memory limit is hit because you are trying to load the whole csv in memory. An easy solution would be to read the files line by line (assuming your files all have the same structure), control it, then write it to the target file:
filenames = ["file1.csv", "file2.csv", "file3.csv"]
sep = ";"
def check_data(data):
# ... your tests
return True # << True if data should be written into target file, else False
with open("/path/to/dir/result.csv", "a+") as targetfile:
for filename in filenames :
with open("/path/to/dir/"+filename, "r") as f:
next(f) # << only if the first line contains headers
for line in f:
data = line.split(sep)
if check_data(data):
targetfile.write(line)
Update: An example of the check_data method, following your comments:
def check_data(data):
return data[n] == 'USA' # < where n is the column holding the country
You can convert the TextFileReader object using pd.DataFrame like so: df = pd.DataFrame(chunk), where chunk is of type TextFileReader. You can then use pd.concat to concatenate the individual dataframes.

Python: To read files from a folder according to the date as specified on the filename and sort

I have files in a folder which named according to the date and time it recorded: for example: Test_20150925_181323.data [Here on the file name 20150925 is the date (2015/09/25) and 181323 is the time (18 hr 13 min 23 sec)]. Likewise I have more than 20 files.
I want to do the following:
Read the files according to increasing order of date and time.
From each file take the values between lines 11 and 21 (these are values recorded in two columns)
Put those values in two arrays saytimevalues=[] and yvalues=[].
Then read the next file do the same and append the values between line11 and line12 to timevalues and yvalues.
Finally I should have two arrays, timevalues and yvalues in which the data (between lines 11 and 21 from each file) appended according to the time it recorded.
My attempt:
import numpy as np
import re, os
import pandas as pd
from os import walk
path = r'C:\Users\Data1\\'
for data_file in sorted(os.listdir(path)):
print data_file
times = []
yvals = []
for line in data_file.readlines()[11:21]: # read lines from 11 to 21
column = line.split('\t')
times.append(column[0])
yvals.append(column[1])
#print times
#print yvals
This is always giving error message:
for line in data_file.readlines()[11:21]: # read lines from 11 to 21
AttributeError: 'str' object has no attribute 'readlines'.
Also, I am not sure if this is the right way to read the files according to the time on its filename.
os.listdir() doesn't return the full paths of files, so you need to join the directory path to each file.
You're trying to .readlines() from your filename string. You need to open() the file to use .readlines().
If all your files follow Test_YYYYMMDD_HHMMSS.data format they will sort properly without any datetime work needed.
The fileinput module makes line counting convenient and handles opening the files sequentially:
import os
import fileinput
path = r'C:\Users\Data'
files = [os.path.join(path,data_file) for data_file in sorted(os.listdir(path))]
times = []
yvals = []
f = fileinput.input(files=files)
for line in f:
if 11 <= f.filelineno() <= 21:
columns = line.split('\t')
times.append(column[0])
yvals.append(column[1])
f.close()
This is assuming your for data_file in sorted(os.listdir(path)): is working properly. What happens next is the file is opened and within that loop you're going through lines 11:21.
path = r'C:\Users\Data1\\'
times = []
yvals = []
for data_file in sorted(os.listdir(path)):
with open(data_file, 'r') as f:
for line in f.readlines()[11:21]: # read lines from 11 to 21
column = line.split('\t')
times.append(column[0])
yvals.append(column[1])
from datetime import datetime
def extract_datetime(filename):
datetime_string = filename.replace('Test_', '')
datetime_string = datetime_string.replace('.data', '')
return datetime.strptime(datetime_string, '%Y%m%d_%H%M%S')
# Sort files by date/time
log_files = os.listdir(path)
log_files.sort(key=extract_datetime)
for f in log_files:
with open(f, 'r') as infile:
lines = infile.readlines()[11:21]
# The rest of your original code here

Categories

Resources