log file parsing python

log file parsing python - python

I have a log file with arbitrary number of lines. All I need is to extract is one line of data from the log file which starts with a string “Total”. I do not want any other lines from the file.
How do I write a simple python program for this?
This is how my input file looks
TestName id eno TPS GRE FNP
Test 1205 1 0 78.00 0.00 0.02
Test 1206 1 0 45.00 0.00 0.02
Test 1207 1 0 73400 0.00 0.02
Test 1208 1 0 34.00 0.00 0.02
Totals 64 0 129.61 145.64 1.12
I am trying to get an output file which looks like
TestName id TPS GRE
Totals 64 129.61 145.64
Ok.. So I wanted only the 1st, 2nd, 4th and 5th column from the input file but not others. I am trying the list[index] to achieve this but getting a IndexError: (list index out of range ). Also the space between 2 columns are not the same so i am not sure how to split the columns and select the ones that i want. Can somebody please help me with this. below is the program I used
newFile = open('sana.log','r')
for line in newFile.readlines():
if ('TestName' in line) or ('Totals' in line):
data = line.split('\t')
print data[0]+data[1]

theFile = open('thefile.txt','r')
FILE = theFile.readlines()
theFile.close()
printList = []
for line in FILE:
if ('TestName' in line) or ('Totals' in line):
# here you may want to do some splitting/concatenation/formatting to your string
printList.append(line)
for item in printList:
print item # or write it to another file... or whatever

for line in open('filename.txt', 'r'):
if line.startswith('TestName') or line.startswith('Totals'):
fields = line.rsplit(None, 5)
print '\t'.join(fields[:2] + fields[3:4])

Related

Using regex for extracting values from a text file

I have a text file from which I want to extract values at a specific distance from a string whenever the string is encountered. I'm completely new to this and got to know that these kinds of pattern matching problems can be solved using regular expressions.
<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04
<CH> Chan Doppler Code Track CdDoppler CodeRange
<CH> 0 1449.32 2914.6679 0.00 833359.36 -154.093
<CH> 1 1450.35 2414.8292 0.00 833951.94 -154.093
<CH> 2 1450.35 6387.2597 0.00 833951.94 -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------
The above structure is repeated multiple times inside the file. Is there any way I can derive out Doppler values (1449.32, 1450.35, 1450.35) and store it in a python list? Since it all starts with " AUTO,CHANSTATE", is there a way it can be taken as reference to get the values? Or any other way which probably I'm unable to think of.
Any help will be really appreciable.

A better approach is to parse the file line by line. Split the line over whitespace and capture the value of Doppler using list index 2. Advantage of this approach is that you can access other parameter values as well if required in future. Try this:
with open("sample.txt") as file: # Use file to refer to the file object
for line in file: # Parsing file line by line
data = line.split() # Split the line over whitespace
try:
if isinstance(float(data[2]), float):
print("Doppler = ", data[2])
except (IndexError, ValueError) as e:
pass
Output:
Doppler = 1449.32
Doppler = 1450.35
Doppler = 1450.35
Check this for demo: https://www.online-python.com/mgE32OXJW8

If you really want/need to use regex you could do this.
Code:
import re
text = '''<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04
<CH> Chan Doppler Code Track CdDoppler CodeRange
<CH> 0 1449.32 2914.6679 0.00 833359.36 -154.093
<CH> 1 1450.35 2414.8292 0.00 833951.94 -154.093
<CH> 2 1450.35 6387.2597 0.00 833951.94 -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------'''
find_this = re.findall('<CH>.*?[0-9].*?\s.*?([0-9].*?)\s', text)
print(find_this)
['1449.32', '1450.35', '1450.35']
[Program finished]
There is however other ways to do this without re as others have pointed out.

Or any other way...
No regex, just string functions
iterate over the lines in the file
check if the line (starts with,contains,or equals) '<BEGIN> AUTO,CHANSTATE'
when it does , skip the next two lines
keep iterating and for each line that starts with '<CH>',
split the line on whitespace, save the third item of the result (result[2])
continue till a line (starts with,contains,or equals) '<END>'
do it all over again.

Getting a list index out of range, while being able to print the exact index

I am writing some simple code to read in a file by line and split the lines. I can create a variable to hold each line as a list. I can then print each index of said list, but I am unable to assign the index to a new variable. Below is the example code.
fileList = [ f for f in os.listdir(outputDir) if f.endswith(".txt") ]
for dayFile in fileList:
print dayFile
with open(outputDir+dayFile) as openDayFile:
for line in openDayFile.readlines():
print line
If I simply print the line I get.
20.20 -100.36 0.60
26.98 -102.06 0.00
19.36 -90.72 0.00
16.65 -95.93 0.00
If I add in .
Three = line.split()
print Three
Gives me.
['20.20', '-100.36', '0.60']
['26.98', '-102.06', '0.00']
['19.36', '-90.72', '0.00'] etc...
Now when I assign
A= Three[0] #there is no error.
When I assign
A = Three[1]
or A= Three[2]
I get the error.
IndexError: list index out of range
If I simply do
print Three[1]
It prints...
-100.36
-102.06
-90.72
-95.93
-99.12
-97.93
-96.72
-96.10
-100.93
-98.14
Can anybody help me understand what the issue is?

python print particular lines from file

The background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into an output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene, and the last number in the survival column from each section.
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).
If anyone could correct the code so I can see what I did wrong I'd appreciate it.

Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file
can then do:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.

This worked when I tried it:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1

You could try something like this (I copied your data into foo.dat);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with makes sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines relies on the fact that the records are separated by lines containing only whitespace.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774

here is my solution:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744

Instead of checking for new line, simply print when you are done reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival

How to split and parse a big text file in python in a memory-efficient way?

I have quite a big text file to parse.
The main pattern is as follows:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
in other words:
A separator: step #
Headers of known length (line numbers, not bytes)
Data 3-dimensional shape: nz, ny, nx
Data: fortran formating, ~10 floats/line in the original dataset
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.
I've already done a bit of programming... The main idea is
to get the offsets of each separator first ("step X")
skip nX (n1, n2...) lines + 1 to reach the data
read bytes from there all the way to the next separator.
I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
The problem is that I'm basically using file.tell() method to get the separator positions:
[file.tell() - len(sep) for line in file if sep in line]
The problem is two-fold:
for smaller files, file.tell() gives the right separator positions, for longer files, it does not. I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator.
len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. sep is a string (bytes) containing the first line of the file (the first separator).
Does anyone knows how I should parse that?
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...
1- Finding the offsets
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
As I said, this is working in the simple example, but not on the big file.
New test:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell() do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.

I got the answer: for-loops on a file and tell() do not get along very well. Just like mixing for i in file and file.readline() raises an error.
So, use file.tell() with file.readline() or file.read() only.
Never ever use:
for line in file:
[do stuff]
offset = file.tell()
This is really a shame but that's the way it is.

ValueError: need more than 0 values to unpack

I am new to python and I am trying make a program that reads a file, and puts the information in its own vectors. the file is an xyz file that looks like this:
45
Fe -0.055 0.033 -0.047
N -0.012 -1.496 1.451
N 0.015 -1.462 -1.372
N 0.000 1.386 1.481
N 0.070 1.417 -1.339
C -0.096 -1.304 2.825
C 0.028 -1.241 -2.739
C -0.066 -2.872 1.251
C -0.0159 -2.838 -1.205
Starting from the 3rd line I need to place each in its own vectors, so far I have this:
file=open("Question4.xyz","r+")
A = []
B = []
C = []
D = []
counter=0
for line in file:
if counter>2: #information on particles start on the 2nd line
a,b,c,d=line.split()
A.append(a)
B.append(float(b))
C.append(float(c))
D.append(float(d))
counter=counter+1
I am getting this error:
File "<pyshell#72>", line 3, in <module>
a,b,c,d=line.split()
ValueError: need more than 0 values to unpack
Any ideas on where I am going wrong?
Thanks in advance!

It looks like you have lines in your that doesn't actually result in 4 items on splitting. Add a condition for that.
for line in file:
spl = line.strip().split()
if len(spl) == 4: # this will take care of both empty lines and
# lines containing greater than or less than four items
a, b, c, d = spl
A.append(a)
B.append(float(b))
C.append(float(c))
D.append(float(d))

Would you happen to have an empty line somewhere, by any chance (or with only a '\n') ?
You could force
if counter >= 2:
if line.strip():
(a,b,c,d) = line.strip().split()
An advantage of not checking whether your split line has a len of 4 is that it won't silently skip the line if it doesn't have the right number of fields (like you experienced yourself with the empty lines at the end of your files): you'll get an exception instead, which forces you to double-check your input (or your logic).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

log file parsing python - python

for line in open('filename.txt', 'r'): if line.startswith('TestName') or line.startswith('Totals'): fields = line.rsplit(None, 5) print '\t'.join(fields[:2] + fields[3:4])

Related

Using regex for extracting values from a text file

Getting a list index out of range, while being able to print the exact index

python print particular lines from file

How to split and parse a big text file in python in a memory-efficient way?

ValueError: need more than 0 values to unpack

Categories

Resources