I need to break up a 1.3m text file to smaller text file based on the 1st row of a section. The data inputs will likely vary over time so I'd like to automate the process with a something that looks like, but open to any suggestions:
FirstLine test1
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 77.556335 114.23446 10.688052
2006-02-06T00:00:00Z 30.174097 20.363855 4.512633
2006-03-06T00:00:00Z 65.48971 146.99098 12.123984
2006-04-06T00:00:00Z 68.65635 335.42905 18.314722
2006-05-06T00:00:00Z 65.31086 121.24954 11.011337
2006-06-06T00:00:00Z 123.571075 172.97223 13.151891
FirstLine test2
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 66.34833 258.47723 16.077227
2006-02-06T00:00:00Z 16.08292 16.153652 4.0191607
2006-03-06T00:00:00Z 34.585014 185.23705 13.610182
I need the 1st row to be the FirstLine row, and all to the next row with FirstLine.
I've tried identifying the row number with this script:
def search_string_in_file(content, FirstLine):
line_number = 0
list_of_results = []
RowList = []
# Open the file in read only mode
with open('test.csv', 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if FirstLine in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
print(list_of_results)
# Return list of tuples containing line numbers and lines where string is found
RowList = pd.DataFrame.from_string(list_of_results)
return list_of_results
The above seems to run successfully, but there are no results and no errors.
Found a way to do this that actually cut some steps out.
found = re.findall(r'\n*(.*?\n\#)\n*', data, re.M | re.S)
Related
I have an assignment that requires me to analyze data for presidential job creation without using dictionaries. I have to open a text file and average data that applies to the democratic and republican presidents. I am having trouble understanding how to skip over certain lines (in my case I don't want to include the first line and index position 0, the years and months). This is what I have so far and a bit of the input file:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1979,14090,14135,14152,14191,14221,14239,14288,14328,14422,14484,14532,14559
1980,14624,14747,14754,14795,14827,14784,14861,14870,14824,14900,14903,14946
1981,14969,14981,14987,14985,14971,14963,14993,15007,14971,15028,15073,15075
1982,15056,15056,15050,15075,15132,15207,15299,15328,15403,15463,15515,15538
def g_avg():
infile = open("government_employment_Windows.txt", 'r')
lines = []
for line in infile:
print(line)
lines.append(line)
infile.close()
print(lines)
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean / len(lines)
print(mean)
Its a very very pythonic way to calculate this
with open('filename') as f:
lines = f.readlines() #Read the lines of the txt
sum = 0
n = 0
for line in lines[1:]: #Use the [1:] to skip the first row with the months
row = line.split(',') #Use the split to convert the line in a list separated by comma
for element in row[1:]: #Use the [1:] to skip the years
sum += float(element)
n += 1
mean = sum/ n
This also looks like a csv file, in which case you can use the built in csv module
import csv
total = 0
count = 0
with open("government_employment_Windows.txt", 'r') as f:
reader = csv.reader(f)
next(reader) #skips the headers
for line in reader:
for item in line[1:]:
count += 1
total += float(item)
print(total)
print(count)
print('average: ', total/count)
Use a slice to skip over the first line, i.e file.readlines[1:]
# foo.txt
line, that, you, want, to, skip
important, stuff
other, important, stuff
with open('foo.txt') as file:
for line in file.readlines()[1:]:
print(line)
important, stuff
other, important, stuff
Note that since I have used with to open a file, Python will close it for me automatically. If you have just done file = open(..) you will have to also do file.close()
I have a file containing a block of introductory text for an unknown number of lines, then the rest of the file contains data. Before the data block begins, there are column titles and I want to skip those also. So the file looks something like this:
this is an introduction..
blah blah blah...
...
UniqueString
Time Position Count
0 35 12
1 48 6
2 96 8
...
1000 82 37
I want to record the Time Position and Count data to a separate file. Time Position and Count Data appears only after UniqueString.
Is it what you're looking for?
reduce(lambda x, line: (x and (outfile.write(line) or x)) or line=='UniqueString\n', infile)
How it works:
files can be iterated, so we can read infile line by line by simply doing [... for line in infile]
in the operation part, we use the fact that writeline() will not be triggered if the first operand for and is False.
in the or part, we set up the trigger if the desired line is found, so writeline will be fired for the next and consequent lines
default initial value for reduce is None, which evaluates to False
You could extract and write the data to another file like this:
with open("data.txt", "r") as infile:
x = infile.readlines()
x = [i.strip() for i in x[x.index('UniqueString\n') + 1:] if i != '\n' ]
with open("output.txt", "w") as outfile:
for i in x[1:]:
outfile.write(i+"\n")
It is pretty straight forward I think: The file is opened and all lines are read, a list comprehension slices the list beginning with the header string and the desired remaining lines are wrote to file again.
You could create a generator function (and more info here) that filtered the file for you.
It operates incrementally so doesn't require reading the entire file into memory at one time.
def extract_lines_following(file, marker=None):
"""Generator yielding all lines in file following the line following the marker.
"""
marker_seen = False
while True:
line = file.next()
if marker_seen:
yield line
elif line.strip() == marker:
marker_seen = True
file.next() # skip following line, too
# sample usage
with open('test_data.txt', 'r') as infile, open('cleaned_data.txt', 'w') as outfile:
outfile.writelines(extract_lines_following(infile, 'UniqueString'))
This could be optimized a little if you're using Python 3, but the basic idea would be the same.
I have two files as follows, the first objective is to get the rows which are not common among 1.csv and 2.csv by comparing the first columns first 14 digits.
Second objective is if the first column in 1.csv is matching with any of the first columns in 2.csv, compare the same rows second column with that of the second column of 1.csv and print the row which is not present in 1.csv and present in 2.csv
The script is below as follow but not able to get the desired output
import csv
t1 = open('1.csv', 'r')
t2 = open('2.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()
outFile = open('update.csv', 'w')
x = 0
for i in fileone:
if i != filetwo[x]:
outFile.write(filetwo[x])
x += 1
outFile.close()
If the format is given fixed it would be a solution to split each line into 2 pieces so you can compare only the first 14 digits as you requested.
The solution you have does only a line by line comparison. If you split up the lines you can iterate over data of either of the files and use simple 'in' to see if the line is in the other file.
First thing always use with when handling files, it will take one line less and will never forget to close the files:
with open('1.csv', 'r') as file1, open('2.csv', 'r') as file2:
file1_lines = file1.readlines()
file2_lines = file2.readlines()
file1_headers = [line[:14] for line in file1_lines]
file2_headers = [line[:14] for line in file2_lines]
with open('update.csv', 'w') as out_file:
# Objective 1: lines that have their first 14 digit in one file only
for line in file1_lines:
if line[:14] not in file2_headers:
out_file.write(line)
for line in file2_lines:
if line[:14] not in file1_headers:
out_file.write(line)
# Objective 2: Lines that are in file 2 but not 1
for line in file2_lines:
if line not in file1_lines:
out_file.write(line)
Your code doesn't mention 14 anywhere, that should alert you in the first place ;-) cheers!
I'm trying to determine the number of columns that are present in a CSV file in python v2.6. This has to be in general, as in, for any input that I pass, I should be able to obtain the number of columns in the file.
Sample input file: love hurt hit
Other input files: car speed beforeTune afterTune repair
So far, what I have tried to do is read the file (with lots of rows), get the first row, and then count the number of words in the first row. Delimiter is ,. I ran into a problem when I try to split headings based on the sample input, and next len(headings) gives me 14 which is wrong as it should give me 3. Any ideas? I am a beginner.
with open(filename1, 'r') as f1:
csvlines = csv.reader(f1, delimiter=',')
for lineNum, line in enumerate(csvlines):
if lineNum == 0:
#colCount = getColCount(line)
headings = ','.join(line) # gives me `love, hurt, hit`
print len(headings) # gives me 14; I need 3
else:
a.append(line[0])
b.append(line[1])
c.append(line[2])
len("love, hurt, hit") is 14 because it's a string.
The len you want is of line, which is a list:
print len(line)
This outputs the number of columns, rather than the number of characters
# old school
import csv
c=0
field={}
with open('csvmsdos.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
field[c]=row
print(field[c])
c=c+1
row=len (field[0])
column=len(field)
csvFile.close()
A simple solution:
with open(filename1) as file:
# for each row in a given file
for row in file:
# split that row into list elements
# using comma (",") as a separator,
# count the elements and print
print(len(row.split(",")))
# break out of the loop after
# first iteration
break
I have a .txt file that looks like:
abcd this is the header
more header, nothing here I need
***********
column1 column2
========= =========
12.4 A
34.6 mm
1.3 um
=====================
footer, nothing that I need here
***** more text ******
I am trying to read the data in the columns, each into it's own list, col1 = [12.4, 34.6, 1.3] and col2 = ['A', 'mm', 'um'].
This is what I have so far, but the only thing that is returned when I run the code is 'None':
def readfile():
y = sys.argv[1]
z = open(y)
for line in z:
data = False
if data == True:
toks = line.split()
print toks
if line.startswith('========= ========='):
data = True
continue
if line.startswith('====================='):
data = False
break
print readfile()
Any suggestions?
There are many ways to do this.
One way involves:
Reading the file into lines
From the lines read, find the indices of the lines that have contain the column header delimiter (as this also matches against the footer header).
Then, store the data between these lines.
Parse these lines by splitting them based on whitespace and storing them into their respective columns.
Like this:
with open('data.dat', 'r') as f:
lines = f.readlines()
#This gets the limits of the lines that contain the header / footer delimiters
#We can use the Column header delimiters double-time as the footer delimiter:
#`=====================` also matches against this.
#Note, the output size is supposed to be 2. If there are lines than contain this delimiter, you'll get problems
limits = [idx for idx, data in enumerate(lines) if '=========' in data]
#`data` now contains all the lines between these limits
data = lines[limits[0]+1:limits[1]]
#Now, you can parse the lines into rows by splitting the line on whitespace
rows = [line.split() for line in data]
#Column 1 has float data, so we convert the string data to float
col1 = [float(row[0]) for row in rows]
#Column 2 is String data, so there is nothing further to do
col2 = [row[1] for row in rows]
print col1, col2
This outputs (from your example):
[12.4, 34.6, 1.3] #Column 1
['A', 'mm', 'um'] #Column 2
The method you are adopting might not be very efficient, but it is a bit buggy & hence your erroneous data extraction.
You need to trigger the boolen i.e. data right after line.startswith('========= =========') & thus, till then it should be kept False.
Thereon, your data will get extracted until the line.startswith('=====================').
Hope I got you right.
def readfile():
y = sys.argv[1]
toks = []
with open(y) as z:
data = False
for line in z:
if line.startswith('========= ========='):
data = True
continue
if line.startswith('====================='):
data = False
break
if data:
toks.append(line.split())
print toks
col1, col2 = zip(*toks) # Or just simply, return zip(*toks)
return col1, col2
print readfile()
The with statement is more pythonic & better than z = open(file).
If you know how many lines of header/footer the file has, then you can use this method.
path = r'path\to\file.csv'
header = 2
footer = 2
buffer = []
with open(path, 'r') as f:
for _ in range(header):
f.readline()
for _ in range(footer):
buffer.append(f.readline())
for line in f:
buffer.append(line)
line = buffer.pop(0)
# do stuff to line
print(line)
Skipping header lines are trivial I had problems skipping footer lines since:
I didn't want to change the file in any way manually
I didn't want to count the number of lines in the file
I didn't want to store the entire file in a list (ie, readlines()) ^
^ Note: If you don't mind storing the entire file in memory, you can use this:
path = r'path\to\file.csv'
header = 2
footer = 2
with open(path, 'r') as f:
for line in f.readlines()[header:-footer if footer else None]:
# do stuff to line
print(line)