I have very long text files with running measurements. These measurements are divided by some information that has almost the same style within my text files. Here an original extract:
10:10 10 244.576 0 0
10:20 10 244.612 0 0
10:30 10 244.563 0 0
HBCHa 9990 Seite 4
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 2. Januar 2000 10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
I would like a running list only with the measurements. As you can see, one measurement stands within an information line, in the line that starts with Sonntag. My problem is that I want to break the line after 2000 and add the second part of the broken line, 10:40 10 244.555 0 0, as a separate line.
My target is this:
10:20 10 244.612 0 0
10:30 10 244.563 0 0
10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
Until now I managed to choose the lines only that start with the time:
if i.startswith("0") or i.startswith("1") or i.startswith("2"):
and add it to new list.
And I can select the lines that contain the expression "tag":
f = open(source_file, "r")
data = f.readlines()
for lines in data:
if re.match("(.*)tag(.*)", lines):
print lines
There are no other lines that match with "tag"!
There's no need to worry about the invalid information if you can precisely match the valid information. So we'll use a regular expression to match only the data we want.
import re
MEASUREMENT_RE = re.compile(r"\b\d{2}:\d{2} \d{2} \d{3}.\d{3} \d \d\b")
with open(source_file, mode="r") as f:
print "\n".join(MEASUREMENT_RE.findall(f.read()))
Changes:
context manager (with block) used to open the file so the file closes automatically
read used instead of readlines since there's no point in applying a regular expression to each line instead of to all lines
measurements found with a regular expression that checks for exactly the digits you're looking for (if you need to match more digits in any section, it should be altered)
word boundaries (\b) used in regular expression to enforce whitespace or beginning/end of string is found around the match
This one matches digit sequences of variable length separated by colon, space and full stop:
import re
p = re.compile(r'\d+:\d+ \d+ \d+.\d+ \d+ \d+')
with open(source_file, "r") as f:
for line in f:
line_clean = p.findall(line)
if any(line_clean):
print "".join(line_clean)
Related
I am writing a script to gather results from an output file of a programme. The file contains headers, captions, and data in scientific format. I only want the data and I need a script that can do this repeatedly for different output files with the same results format.
This is the data:
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
This is my code at the moment. I want it to open the file, search for the keyword 'INPUT:BETA' which indicates the start of the results I want to extract. It then takes the information between this input keyword and the end identifier that signals the end of the data I want. I don't think this section needs changing but I have included it just in case.
I have then tried to use regex to specify the lines that start with VELOCITY (m/s) as these contain the data I need. This works and extracts each line, whitespace and all, into an array. However, I want each numerical value to be a single element, so the next line is supposed to strip the whitespace out and split the lines into individual array elements.
with open(file_name) as f:
t=f.read()
t=t[t.find('INPUT:BETA'):]
t=t[t.find(start_identifier):t.find(end_identifier)]
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, t)
res = [s.split() for s in res]
print(res)
print(len(res))
This isn't working, here is the output:
[['33.2405E+06', '30.8868E+06', '27.9475E+06', '25.2880E+06', '22.8815E+06', '21.1951E+06', '20.1614E+06', '18.7338E+06'], ['16.9510E+06', '15.7017E+06', '14.9359E+06', '14.2075E+06', '13.5146E+06', '12.8555E+06', '11.6805E+06', '10.5252E+06']]
2
It's taking out the whitespace but not putting the values into separate elements, which I need for the next stage of the processing.
My question is therefore:
How can I extract each value into a separate array element, leaving the rest of the data behind, in a way that will work with different output files with different data?
Here is how you can flatten your list, which is your point 1.
import re
text = """
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
"""
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, text)
res = [s.split() for s in res]
res = [value for lst in res for value in lst]
print(res)
print(len(res))
Your regex isn't skipping your first line though. There must be an error in the rest of your code.
I have a bunch of files from a LAMMPS simulation made by a 9-lines header and a Nx5 data array (N is large, order 10000). A file looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 -68.737755560980844 1.190046376093027 122.754819323806714
2 1 -68.334493269859621 0.365731265115530 122.943111038981527
3 1 -68.413018326512173 -0.456802254452782 123.436843456292138
4 1 -68.821350328206080 -1.360098170077123 123.314784135612115
5 1 -67.876948635447775 -1.533699833382506 123.072964235308660
6 1 -67.062910322675322 -2.006415676993953 123.431518511867381
7 1 -67.069984116148134 -2.899068427170739 123.057125785834685
8 1 -66.207325578729183 -3.292545155979909 123.377770523297343
...
I would like to open every file, perform a certain operation on the numerical data and save the file with a different name leaving the header unchanged. My script is:
for f in files:
filename=path+"/"+f
with open(filename) as myfile:
header = ' '.join([next(myfile) for x in xrange(9)])
data=np.loadtxt(filename,skiprows=9)
data[:,2:5]%=L #Put everything inside the box...
savetxt(filename.replace("lammpstrj","fold.lammpstrj"),data,header=header,comments="",fmt="%d %d %.15f %.15f %.15f")
The output, though, looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 50.042244439019157 1.190046376093027 3.974819323806713
2 1 50.445506730140380 0.365731265115530 4.163111038981526
3 1 50.366981673487828 58.933197745547218 4.656843456292137
4 1 49.958649671793921 58.029901829922878 4.534784135612114
5 1 50.903051364552226 57.856300166617494 4.292964235308659
6 1 51.717089677324680 57.383584323006048 4.651518511867380
7 1 51.710015883851867 56.490931572829261 4.277125785834684
8 1 52.572674421270818 56.097454844020092 4.597770523297342
...
The header is not exactly the same: there are spaces at the beginning of every lines except for the first, and a newline after the last line of the header. I need to get rid of those, but I don't know how.
What am I doing wrong?
The issue is in the ' '.join(a):
a = ['sadf\n', 'sdfg\n']
' '.join(a)
>>>'sadf\n sdfg\n' # Note the space at the start of the second line.
Instead:
''.join(a)
>>>'sadf\nsdfg\n'
You will also need to trim the last '\n' in your header to prevent the empty line:
''.join(a).rstrip()
>>>'sadf\nsdfg'
The header parameter will add a newline after it automatically, so you can eliminate the original last '\n' as a redundant newline.
header = header.rstrip('\n')
The leading spaces occurs, since you join each line by an extra space character. you can solve it by the below command.
header = ''.join([next(myfile) for x in xrange(9)])
What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?
Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.
So I want to count the occurrences of certain words, per line, in a text file. How many times each specific word occurred doesnt matter, just how many times any of them occurred per line. I have a file containing a list of words, delimited by newline character. It looks like this:
amazingly
astoundingly
awful
bloody
exceptionally
frightfully
.....
very
I then have another text file containing lines of text. Lets say for example:
frightfully frightfully amazingly Male. Don't forget male
green flag stops? bloody bloody bloody bloody
I'm biased.
LOOKS like he was headed very
green flag stops?
amazingly exceptionally exceptionally
astoundingly
hello world
I want my output to look like:
3
4
0
1
0
3
1
Here's my code:
def checkLine(line):
count = 0
with open("intensifiers.txt") as f:
for word in f:
if word[:-1] in line:
count += 1
print count
for line in open("intense.txt", "r"):
checkLine(line)
Here's my actual output:
4
1
0
1
0
2
1
0
any ideas?
How about this:
def checkLine(line):
with open("intensifiers.txt") as fh:
line_words = line.rstrip().split(' ')
check_words = [word.rstrip() for word in fh]
print sum(line_words.count(w) for w in check_words)
for line in open("intense.txt", "r"):
checkLine(line)
Output:
3
4
0
1
0
3
1
0
I have been trying to replace a word in a text file with a value (say 1), but my outfile is blank.I am new to python (its only been a month since I have been learning it).
My file is relatively large, but I just want to replace a word with the value 1 for now.
Here is a segment of what the file looks like:
NAME SECOND_1
ATOM 1 6 0 0 0 # ORB 1
ATOM 2 2 0 12/24 0 # ORB 2
ATOM 3 2 12/24 0 0 # ORB 2
ATOM 4 2 0 0 4/24 # ORB 3
ATOM 5 2 0 0 20/24 # ORB 3
ATOM 6 2 0 0 8/24 # ORB 3
ATOM 7 2 0 0 16/24 # ORB 3
ATOM 8 6 0 0 12/24 # ORB 1
ATOM 9 2 12/24 0 12/24 # ORB 2
ATOM 10 2 0 12/24 12/24 # ORB 2
#1
#2
#3
I want to first replace the word ATOM with the value 1. Next I want to replace #ORB with a space. Here is what I am trying thus far.
input = open('SECOND_orbitsJ22.txt','r')
output=open('SECOND_orbitsJ22_out.txt','w')
for line in input:
word=line.split(',')
if(word[0]=='ATOM'):
word[0]='1'
output.write(','.join(word))
Can anyone offer any suggestions or help? Thanks so much.
The problem is that there's no comma after ATOM in the input, so word[0] doesn't equal ATOM. You should be splitting on spaces, not commas.
You could also just use split() without arguments.
Since you only do output.write when a match is found, the output stays empty.
P.S. Try to use with statements when opening files:
with open('SECOND_orbitsJ22.txt','r') as input,
open('SECOND_orbitsJ22_out.txt','w') as output:
...
Also, Alexander suggests the right tool for replacement: str.replace. However, use it with caution as it's not position-aware. re.sub is a more flexible alternative.
Use replace.
line.replace("ATOM", "1").replace("# ORB", " ")
Untested code:
input = open('inp.txt', 'r')
output = open('out.txt', 'w')
clean = input.read().replace("ATOM", "1").replace("# ORB", " ")
output.write(clean)
Working example.
Based on the file segment you've pasted here, you'll want to split each line on a space, rather than a comma. If there are no commas, line.split(',') has no effect, and word[0] is empty. Your output file is empty because you are never writing to it, as ATOM will never be equal to the empty string.