Reading data in .txt file excluding header and footers

Reading data in .txt file excluding header and footers - python

I have a .txt file that looks like:
abcd this is the header
more header, nothing here I need
***********
column1 column2
========= =========
12.4 A
34.6 mm
1.3 um
=====================
footer, nothing that I need here
***** more text ******
I am trying to read the data in the columns, each into it's own list, col1 = [12.4, 34.6, 1.3] and col2 = ['A', 'mm', 'um'].
This is what I have so far, but the only thing that is returned when I run the code is 'None':
def readfile():
y = sys.argv[1]
z = open(y)
for line in z:
data = False
if data == True:
toks = line.split()
print toks
if line.startswith('========= ========='):
data = True
continue
if line.startswith('====================='):
data = False
break
print readfile()
Any suggestions?

There are many ways to do this.
One way involves:
Reading the file into lines
From the lines read, find the indices of the lines that have contain the column header delimiter (as this also matches against the footer header).
Then, store the data between these lines.
Parse these lines by splitting them based on whitespace and storing them into their respective columns.
Like this:
with open('data.dat', 'r') as f:
lines = f.readlines()
#This gets the limits of the lines that contain the header / footer delimiters
#We can use the Column header delimiters double-time as the footer delimiter:
#`=====================` also matches against this.
#Note, the output size is supposed to be 2. If there are lines than contain this delimiter, you'll get problems
limits = [idx for idx, data in enumerate(lines) if '=========' in data]
#`data` now contains all the lines between these limits
data = lines[limits[0]+1:limits[1]]
#Now, you can parse the lines into rows by splitting the line on whitespace
rows = [line.split() for line in data]
#Column 1 has float data, so we convert the string data to float
col1 = [float(row[0]) for row in rows]
#Column 2 is String data, so there is nothing further to do
col2 = [row[1] for row in rows]
print col1, col2
This outputs (from your example):
[12.4, 34.6, 1.3] #Column 1
['A', 'mm', 'um'] #Column 2

The method you are adopting might not be very efficient, but it is a bit buggy & hence your erroneous data extraction.
You need to trigger the boolen i.e. data right after line.startswith('========= =========') & thus, till then it should be kept False.
Thereon, your data will get extracted until the line.startswith('=====================').
Hope I got you right.
def readfile():
y = sys.argv[1]
toks = []
with open(y) as z:
data = False
for line in z:
if line.startswith('========= ========='):
data = True
continue
if line.startswith('====================='):
data = False
break
if data:
toks.append(line.split())
print toks
col1, col2 = zip(*toks) # Or just simply, return zip(*toks)
return col1, col2
print readfile()
The with statement is more pythonic & better than z = open(file).

If you know how many lines of header/footer the file has, then you can use this method.
path = r'path\to\file.csv'
header = 2
footer = 2
buffer = []
with open(path, 'r') as f:
for _ in range(header):
f.readline()
for _ in range(footer):
buffer.append(f.readline())
for line in f:
buffer.append(line)
line = buffer.pop(0)
# do stuff to line
print(line)
Skipping header lines are trivial I had problems skipping footer lines since:
I didn't want to change the file in any way manually
I didn't want to count the number of lines in the file
I didn't want to store the entire file in a list (ie, readlines()) ^
^ Note: If you don't mind storing the entire file in memory, you can use this:
path = r'path\to\file.csv'
header = 2
footer = 2
with open(path, 'r') as f:
for line in f.readlines()[header:-footer if footer else None]:
# do stuff to line
print(line)

Related

breakup large text file with beginning row

I need to break up a 1.3m text file to smaller text file based on the 1st row of a section. The data inputs will likely vary over time so I'd like to automate the process with a something that looks like, but open to any suggestions:
FirstLine test1
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 77.556335 114.23446 10.688052
2006-02-06T00:00:00Z 30.174097 20.363855 4.512633
2006-03-06T00:00:00Z 65.48971 146.99098 12.123984
2006-04-06T00:00:00Z 68.65635 335.42905 18.314722
2006-05-06T00:00:00Z 65.31086 121.24954 11.011337
2006-06-06T00:00:00Z 123.571075 172.97223 13.151891
FirstLine test2
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 66.34833 258.47723 16.077227
2006-02-06T00:00:00Z 16.08292 16.153652 4.0191607
2006-03-06T00:00:00Z 34.585014 185.23705 13.610182
I need the 1st row to be the FirstLine row, and all to the next row with FirstLine.
I've tried identifying the row number with this script:
def search_string_in_file(content, FirstLine):
line_number = 0
list_of_results = []
RowList = []
# Open the file in read only mode
with open('test.csv', 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if FirstLine in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
print(list_of_results)
# Return list of tuples containing line numbers and lines where string is found
RowList = pd.DataFrame.from_string(list_of_results)
return list_of_results
The above seems to run successfully, but there are no results and no errors.

Found a way to do this that actually cut some steps out.
found = re.findall(r'\n*(.*?\n\#)\n*', data, re.M | re.S)

Python: Trying to search through csv first column and spliting it if len > 30 in another file on the same row

The title isn't big enough for me to explain this so here it goes:
I have a csv file looking something like this:
Example csv containing
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
I want to go through the first column and if the length of the string is greater then 20 do this:
LINE 20 : long string with som, e special characters
split the string, modify first csv with first part of the string, and create a new csv and add the other part on the same line number, leaving the rest just whitespace
what i have for now is this:
this below doesn't do anything right now, its just what I made to try and explain to myself and figure out how could i do new file writing with splitString
fileName = file name
maxCollumnLength = number of rows in the whole set
lineNum = line number of a string that is greater then 20
splitString = second part of the split string that should be written on another file
def newopenfile(fileName, maxCollumnLength, lineNum, splitString):
with open(fileName, 'rw', encoding="utf8") as nf:
writer = csv.writer(fileName, quoting=csv.QUOTE_NONE)
for i in range(0, maxCollumnLength-1):
#write whitespace until reaching lineNum of a string thats bigger then 20 then write that part of the string to a csv
this goes through the first column and checks the length
fileName = 'uskrs.csv'
firstColList=[] #an empty list to store the second column
splitString=[]
i = 0
with open(fileName, 'rw', encoding="utf8") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
if len(row[0]) > 20:
i+=1
#split row and parse the other end of the row to newopenfile(fileName, len(reader), i, splitString )
#print(row[0])
#for debuging
firstColList.append(row[0])
from this point i am stuck at how to actualy change the string in the csv and how to split them
THE STRING COULD ALSO HAVE 60+ chars, so it would need splitting more then 2 times and storing it in more then 2 csvs
I suck at explaining the problem, so if you have any questions please do ask
Okay so i was sucessful in dividing the first column if it has length greater then 20, and replace the first column with first 20 chars
import csv
def checkLength(column, readFile, writeFile, maxLen):
counter = 0
i = 0
idxSplitItems = []
final = []
newSplits = 0
with open(readFile,'r', encoding="utf8", newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
final = your_list
for sublist in your_list:
#del sublist[-1] -remove last invisible element
i+=1
data = removeUnwanted(sublist[column])
print(data)
if len(data) > maxLen:
counter += 1 # Number of large
idxSplitItems.append(split_bylen(i,data,maxLen))
if len(idxSplitItems) > newSplits: newSplits = len(idxSplitItems)
final[i-1][column] = split_bylen(i,data,maxLen)[1]
final[i-1][column] = removeUnwanted(final[i-1][column])
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
writeSplitToCSV(writeFile, final)
checkCols(final, 6)
return final, idxSplitItems
def removeUnwanted(data):
data = data.replace(',',' ')
return data
def split_bylen(index, item, maxLen):
clean = removeUnwanted(item)
splitList = [clean[ind:ind+maxLen] for ind in range(0, len(item), maxLen)]
splitList.insert(0,index)
return splitList
def writeSplitToCSV(writeFile,data):
with open(writeFile,'w', encoding="utf8", newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
def checkCols(data, columns):
for sublist in data:
if len(sublist)-1!=columns:
print ("[X] This row doesnt have the same amount of columns as others: "+sublist)
else:
print("All okay")
#len(data) #how many split items
#print(your_list[0][0])
#print("Number of large: ", counter)
final, idxSplitItems = checkLength(0,'test.csv','final.csv', 30)
print("------------------------")
print(idxSplitItems)
print("-------------------------")
print(final)
Now I have a problem with this part of the code, notice this:
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
This is to check if removing comma worked.
with example of
"BUTKOVIĆ VESNA , DIPL.IUR."
data returns
BUTKOVIĆ VESNA DIPL.IUR.
but final returns
BUTKOVIĆ VESNA , DIPL.IUR.
why does my final return "," again but in data its gone, must be something done in "split_bylen()" that makes it do that

Dictionaries are fun!
To overwrite the original csv see here. You would have to use Dictreader & Dictwriter. I keep your method of reading just for clarity.
writecsvs = {} #store each line of each new csv
# e.g. {'csv1':[[row0_split1,row0_num,row0_str,row0_num],[row1_split1,row1_num,row1_str,row1_num],...],
# 'csv2':[[row0_split2,row0_num,row0_str,row0_num],[row1_split2,row1_num,row1_str,row1_num],...],
# .
# .
# .}
with open(fileName, mode='rw', encoding="utf-8-sig") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
col1 = row[0]
# check size & split
# decide number of new csvs
# overwrite original csv
# store new content in writecsvs dict
for # Loop over each csv in writecsvs:
writelines = # Get List of Lines
out_file = open('csv1.csv', mode='w') # use the keys in writecsvs for filenames
for line in writelines:
out_file.write(line)
Hope this helps.

Create an array from data in a table using Python

I want to get data from a table in a text file into a python array. The text file that I am using as an input has 7 columns and 31 rows. Here is an example of the first two rows:
10672 34.332875 5.360831 0.00004035881220 0.00000515052523 4.52E-07 6.5E-07
12709 40.837833 19.429158 0.00012010938453 -0.00000506426720 7.76E-06 2.9E-07
The code that I have tried to write isn't working as it is not reading one line at a time when it goes through the for loop.
data = []
f = open('hyadeserr.txt', 'r')
while True:
eof = "no"
array = []
for i in range(7):
line = f.readline()
word = line.split()
if len(word) == 0:
eof = "yes"
else:
array.append(float(word[0]))
print array
if eof == "yes": break
data.append(array)
Any help would be greatly appreciated.

A file with space-separated values is just a dialect of the classic comma-separated values (CSV) file where the delimiter is a space (), followed by more spaces, which can be ignored.
Happily, Python comes with a csv.reader class that understands dialects.
You should use this:
Example:
#!/usr/bin/env python
import csv
csv.register_dialect('ssv', delimiter=' ', skipinitialspace=True)
data = []
with open('hyadeserr.txt', 'r') as f:
reader = csv.reader(f, 'ssv')
for row in reader:
floats = [float(column) for column in row]
data.append(floats)
print data

If you don't want to use cvs here, since you don't really need it:
data = []
with open("hyadeserr.txt") as file:
for line in file:
data.append([float(f) for f in line.strip().split()])
Or, if you know for sure that the only extra chars are spaces and line ending \n, you can turn the last line into:
data.append([float(f) for f in line[:-1].split()])

How do I match 0,2,3,4 elements of one array to 0,2,3,4 elements of another array and print the 5th element from both the arrays in python?

I am trying to match the 0,2,3,4 elements of an array storing the columns of one tab delimited file to 0,2,3,4 elements of another array storing the columns of another tab delimited file and print out the element 5 (column 6) from both the input files in python.
Here is the code that I worked on but I guess that the code matches line by line between two files. However, I wanted to match the line of file1 to any line in file 2
#!/usr/bin/python
import sys
import itertools
import csv, pprint
from array import *
#print len(sys.argv)
if len(sys.argv) != 4:
print 'Usage: python scores.py <infile1> <infile2> <outfile>'
sys.exit(1)
f1=open("/home/user/ab/ab/ab/file1.txt", "r")
f2 = open ("/home/user/ab/ab/ab/file2.txt", "r")
f3 = open ("out.txt", "w")
lines1 = f1.readlines()
lines2 = f2.readlines()
for f1line, f2line in zip(lines1, lines2): ## for loop to read lines line by line simultaneously from two files
#for f1line, f2line in itertools.izip(lines1,lines2):
row1 = f1line.split('\t') #split on tab
row2 = f2line.split('\t') # split on tab
if ((row1[0:1] + row1[2:5]) == (row2[0:1] + row2[2:5])): # columns 0,2,3,4 matching between two infiles
writer = csv.writer(f3, delimiter = '\t')
writer.writerow((row1[0:1] + row1[2:5]) + row1[5:6] + (row2[0:1] + row2[2:5]) + row2[5:6])

For each line on file 1 to match
op = operator.itemgetter(0,2,3,4)
f2 = file2.readlines() # otherwise it won't work every loop
for line1 in file1:
... #split 1
for line2 in f2:
... #split 2
if op(row1) == op(row2):
...

So, just do what you said: for each line of file1, match each line of file2
for f1line in lines1:
row1 = f1line.split('\t') #split on tab
for f2line in lines2:
row2 = f2line.split('\t') # split on tab
if ((row1[0:1] + row1[2:5]) == (row2[0:1] + row2[2:5])):
...

This assumes that each key value (row[0,3,4,5]) is unique per file:
import sys
import csv
datalen = 12
keyfn = lambda row: tuple(row[0:1] + row[3:6])
datafn = lambda row: row[8:datalen]
def load_dict(fname, keyfn, datafn):
with open(fname, 'rb') as inf:
data = (row.split() for row in inf if not row.startswith('##'))
return {keyfn(row):datafn(row) for row in data if len(row) >= datalen}
def main(fname1, fname2, outfname):
data1 = load_dict(fname1, keyfn, datafn)
data2 = load_dict(fname2, keyfn, datafn)
common_keys = sorted(set(data1).intersection(data2))
with open(outfname, 'wb') as outf:
outcsv = csv.writer(outf, delimiter='\t')
outcsv.writerows(list(key) + data1[key] + data2[key] for key in common_keys)
if __name__=="__main__":
if len(sys.argv) != 4:
print 'Usage: python scores.py <infile1> <infile2> <outfile>'
sys.exit(1)
else:
main(*sys.argv[1:4])
Edit: problems found:
I made one mistake: the return value from the key function was a list; a list is not hashable, therefore cannot be a dictionary key. I have made the return value a tuple instead.
You, on the other hand, failed to mention that
your files begin with several lines of comments (I have modified the script to ignore comment rows, meaning anything starting with ##)
your file is NOT tab-delimited (or at least the file examples you provided are not). It actually seems to be columnar, separated with multiple spaces - this cannot be handled by the csv module. Luckily, the data seems simple enough to use .split() instead.
you are matching on the wrong columns; column 2 in your data files does not appear to match between files at all. I think you need to key on columns 0, 3, 4, 5 instead. I have updated keyfn to reflect this.
Columns 3 and 4 appear to be identical, but I am not certain of this. If columns 3 and 4 are always identical, you could save some memory and speed things up a bit by only keying on columns 0, 4, 5: keyfn = lambda row: tuple(row[0:1] + row[4:6])
I am guessing that columns 8,9,10,11 are the desired data; I have changed datafn to reflect this. The script should now work as required.

Read file from line 2 or skip header row

How can I skip the header row and start reading a file from line2?

with open(fname) as f:
next(f)
for line in f:
#do something

f = open(fname,'r')
lines = f.readlines()[1:]
f.close()

If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations

If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass

f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...

To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.

If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)

# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading data in .txt file excluding header and footers - python

Related

breakup large text file with beginning row

Python: Trying to search through csv first column and spliting it if len > 30 in another file on the same row

Create an array from data in a table using Python

How do I match 0,2,3,4 elements of one array to 0,2,3,4 elements of another array and print the 5th element from both the arrays in python?

Read file from line 2 or skip header row

Categories

Resources