Using regex for extracting values from a text file

Using regex for extracting values from a text file - python

I have a text file from which I want to extract values at a specific distance from a string whenever the string is encountered. I'm completely new to this and got to know that these kinds of pattern matching problems can be solved using regular expressions.
<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04
<CH> Chan Doppler Code Track CdDoppler CodeRange
<CH> 0 1449.32 2914.6679 0.00 833359.36 -154.093
<CH> 1 1450.35 2414.8292 0.00 833951.94 -154.093
<CH> 2 1450.35 6387.2597 0.00 833951.94 -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------
The above structure is repeated multiple times inside the file. Is there any way I can derive out Doppler values (1449.32, 1450.35, 1450.35) and store it in a python list? Since it all starts with " AUTO,CHANSTATE", is there a way it can be taken as reference to get the values? Or any other way which probably I'm unable to think of.
Any help will be really appreciable.

A better approach is to parse the file line by line. Split the line over whitespace and capture the value of Doppler using list index 2. Advantage of this approach is that you can access other parameter values as well if required in future. Try this:
with open("sample.txt") as file: # Use file to refer to the file object
for line in file: # Parsing file line by line
data = line.split() # Split the line over whitespace
try:
if isinstance(float(data[2]), float):
print("Doppler = ", data[2])
except (IndexError, ValueError) as e:
pass
Output:
Doppler = 1449.32
Doppler = 1450.35
Doppler = 1450.35
Check this for demo: https://www.online-python.com/mgE32OXJW8

If you really want/need to use regex you could do this.
Code:
import re
text = '''<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04
<CH> Chan Doppler Code Track CdDoppler CodeRange
<CH> 0 1449.32 2914.6679 0.00 833359.36 -154.093
<CH> 1 1450.35 2414.8292 0.00 833951.94 -154.093
<CH> 2 1450.35 6387.2597 0.00 833951.94 -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------'''
find_this = re.findall('<CH>.*?[0-9].*?\s.*?([0-9].*?)\s', text)
print(find_this)
['1449.32', '1450.35', '1450.35']
[Program finished]
There is however other ways to do this without re as others have pointed out.

Or any other way...
No regex, just string functions
iterate over the lines in the file
check if the line (starts with,contains,or equals) '<BEGIN> AUTO,CHANSTATE'
when it does , skip the next two lines
keep iterating and for each line that starts with '<CH>',
split the line on whitespace, save the third item of the result (result[2])
continue till a line (starts with,contains,or equals) '<END>'
do it all over again.

Related

Getting a list index out of range, while being able to print the exact index

I am writing some simple code to read in a file by line and split the lines. I can create a variable to hold each line as a list. I can then print each index of said list, but I am unable to assign the index to a new variable. Below is the example code.
fileList = [ f for f in os.listdir(outputDir) if f.endswith(".txt") ]
for dayFile in fileList:
print dayFile
with open(outputDir+dayFile) as openDayFile:
for line in openDayFile.readlines():
print line
If I simply print the line I get.
20.20 -100.36 0.60
26.98 -102.06 0.00
19.36 -90.72 0.00
16.65 -95.93 0.00
If I add in .
Three = line.split()
print Three
Gives me.
['20.20', '-100.36', '0.60']
['26.98', '-102.06', '0.00']
['19.36', '-90.72', '0.00'] etc...
Now when I assign
A= Three[0] #there is no error.
When I assign
A = Three[1]
or A= Three[2]
I get the error.
IndexError: list index out of range
If I simply do
print Three[1]
It prints...
-100.36
-102.06
-90.72
-95.93
-99.12
-97.93
-96.72
-96.10
-100.93
-98.14
Can anybody help me understand what the issue is?

log file parsing python

I have a log file with arbitrary number of lines. All I need is to extract is one line of data from the log file which starts with a string “Total”. I do not want any other lines from the file.
How do I write a simple python program for this?
This is how my input file looks
TestName id eno TPS GRE FNP
Test 1205 1 0 78.00 0.00 0.02
Test 1206 1 0 45.00 0.00 0.02
Test 1207 1 0 73400 0.00 0.02
Test 1208 1 0 34.00 0.00 0.02
Totals 64 0 129.61 145.64 1.12
I am trying to get an output file which looks like
TestName id TPS GRE
Totals 64 129.61 145.64
Ok.. So I wanted only the 1st, 2nd, 4th and 5th column from the input file but not others. I am trying the list[index] to achieve this but getting a IndexError: (list index out of range ). Also the space between 2 columns are not the same so i am not sure how to split the columns and select the ones that i want. Can somebody please help me with this. below is the program I used
newFile = open('sana.log','r')
for line in newFile.readlines():
if ('TestName' in line) or ('Totals' in line):
data = line.split('\t')
print data[0]+data[1]

theFile = open('thefile.txt','r')
FILE = theFile.readlines()
theFile.close()
printList = []
for line in FILE:
if ('TestName' in line) or ('Totals' in line):
# here you may want to do some splitting/concatenation/formatting to your string
printList.append(line)
for item in printList:
print item # or write it to another file... or whatever

for line in open('filename.txt', 'r'):
if line.startswith('TestName') or line.startswith('Totals'):
fields = line.rsplit(None, 5)
print '\t'.join(fields[:2] + fields[3:4])

How do I use index iteration to search in a list in Python?

This is for an assignment which I've nearly finished. So the goal is to be able to search the list based on CID, which is the first value in each line of the txt file.
The text file contains the following records, and is tab delimited:
0001 001 -- -- 1234.00 -- -- 148.08 148.08 13.21 1395.29
0002 011 -- 100.00 12000.00 -- 5.00 1440.00 1445.00 414.15 13959.15
0003 111 100.00 1000.00 1000.00 8.00 50.00 120.00 178.00 17.70 2295.70
0004 110 1200.00 100.00 -- 96.00 5.00 -- 101.00 6.15 1407.15
0005 101 100.00 -- 1300.00 8.00 -- 156.00 164.00 15.60 1579.60
0006 100 1200.00 -- -- 96.00 -- -- 96.00 5.40 1301.40
0007 010 -- 1500.00 -- -- 75.00 -- 75.00 2.25 1577.25
0008 001 -- -- 1000.00 -- -- 120.00 120.00 9.00 1129.00
0009 111 1000.00 1000.00 1000.00 80.00 50.00 120.00 250.00 28.50 3278.50
0010 111 100.00 10000.00 1000.00 8.00 500.00 120.00 628.00 123.90 11851.90
Text file can be found here.
I'm new to Python, and haven't got my head around it yet. I need to be able to somehow dynamically fill in lines[0] with other index positions. For example...'0002' is found in index [0], 0002 is found if I change to lines[1] and so forth. I've tried various whiles, enumerating, list-comprehension, but most of that is beyond my understanding. Or maybe there's an easier way to display the line for a particular 'customer'?
with open('customer.txt', 'r') as file:
for line in file:
lines = file.read().split('\n')
search = input("Please enter a CID to search for: ")
if search in lines[0]:
print(search, "was found in the database.")
CID = lines[0]
print(CID)
else:
print(search, "does not exist in the database.")

Not sure, are the lines supposed to be split into fields somehow?
search = input("Please enter a CID to search for: ")
with open('customer.txt', 'r') as file:
for line in file:
fields = line.split('\t')
if fields[0] == search:
print(search, "was found in the database.")
CID = fields[0]
print(line)
break
else:
print(search, "does not exist in the database.")

Here's how I think you should solve this problem. Comments below the code.
_MAX_CID = 9999
while True:
search = input("Please enter a CID to search for: ")
try:
cid = int(search)
except ValueError:
print("Please enter a valid number")
continue
if not 0 <= cid <= _MAX_CID:
print("Please enter a number within the range 0..%d"% _MAX_CID)
continue
else:
# number is good
break
with open("customer.txt", "r") as f:
for line in f:
if not line.strip():
continue # completely blank line so skip it
fields = line.split()
try:
line_cid = int(fields[0])
except ValueError:
continue # invalid line so skip it
if cid == line_cid:
print("%d was found in the database." % cid)
print(line.strip())
break
else:
# NOTE! This "else" goes with the "for"! This case
# will be executed if the for loop runs to the end
# without breaking. We break when the CID matches
# so this code runs when CID never matched.
print("%d does not exist in the database." % cid)
Instead of searching for a text match, we are parsing the user's input as a number and searching for a numeric match. So, if the user enters 0, a text match would match every single line of your example file, but a numeric match won't match anything!
We take input, then convert it to an integer. Then we check it to see if it makes sense (isn't negative or too large). If it fails any test we keep looping, making the user re-enter. Once it's a valid number we break out of the loop and continue. (Your teacher may not like the way I use break here. If it makes your teacher happier, add a variable called done that is initially set to False, and set it to True when the input validates, and make the loop while not done:).
You seem a bit confused about input. When you open a file, you get back an object that represents the opened file. You can do several things this object. One thing you can do is use method functions like .readlines() or .read(), but another thing you can do is just iterate it. To iterate it you just put it in a for loop; when you do that, each loop iteration gets one line of input from the file. So my code sample sets the variable line to a line from the file each time. If you use the .read() method, you slurp in the entire file into memory, all at once, which isn't needed; and then your loop isn't looping over lines of the file. Usually you should use the for line in f: sort of loop; sometimes you need to slurp the file with f.read(); you never do both at the same time.
It's a small point, but file is a built-in type in Python, and by assigning to that you are rebinding the name, and "shadowing" the built-in type. Why not simply use f as I did in my program? Or, use something like in_file. When I have both an input file and an output file at the same time I usually use in_file and out_file.
Once we have the line, we can split it into fields using the .split() method function. Then the code forces the 0th field to an integer and checks for an exact match.
This code checks the input lines, and if they don't work, silently skips the line. Is that what you want? Maybe not! Maybe it would be better for the code to blow up if the database file is malformed. Then instead of using the continue statement, maybe you would want to put in a raise statement, and raise an exception. Maybe define your own MalformedDatabase exception, which should be a subclass of ValueError I think.
This code uses a pretty unique feature of Python, the else statement on a for loop. This is for code that is only executed when the loop runs all the way to the end, without an early exit. When the loop finds the customer ID, it exits early with a break statement; when the customer ID is never found, the loop runs to the end and this code executes.
This code will actually work okay with Python 2.x, but the error checking isn't quite adequate. If you run it under Python 3.x it is pretty well-checked. I'm assuming you are using Python 3.x to run this. If you run this with Python 2.x, enter xxx or crazy junk like 0zz and you will get different exceptions than just the ValueError being tested! (If you actually wanted to use this with Python 2.x, you should change input() to raw_input(), or catch more exceptions in the try/except.)

Another approach. Since the file is tab delimited, you can use the csv module as well.
This approach, unlike #gnibbler's answer, will read the entire file and then search its contents (so it will load the file in memory).
import csv
with open('customer.txt') as file:
reader = csv.reader(file, delimiter='\t')
lines = list(reader)
search = input('Please enter the id: ')
result = [line for line in lines if search in line]
print '\t'.join(*result) if result else 'Not Found'

How to split and parse a big text file in python in a memory-efficient way?

I have quite a big text file to parse.
The main pattern is as follows:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
in other words:
A separator: step #
Headers of known length (line numbers, not bytes)
Data 3-dimensional shape: nz, ny, nx
Data: fortran formating, ~10 floats/line in the original dataset
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.
I've already done a bit of programming... The main idea is
to get the offsets of each separator first ("step X")
skip nX (n1, n2...) lines + 1 to reach the data
read bytes from there all the way to the next separator.
I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
The problem is that I'm basically using file.tell() method to get the separator positions:
[file.tell() - len(sep) for line in file if sep in line]
The problem is two-fold:
for smaller files, file.tell() gives the right separator positions, for longer files, it does not. I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator.
len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. sep is a string (bytes) containing the first line of the file (the first separator).
Does anyone knows how I should parse that?
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...
1- Finding the offsets
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
As I said, this is working in the simple example, but not on the big file.
New test:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell() do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.

I got the answer: for-loops on a file and tell() do not get along very well. Just like mixing for i in file and file.readline() raises an error.
So, use file.tell() with file.readline() or file.read() only.
Never ever use:
for line in file:
[do stuff]
offset = file.tell()
This is really a shame but that's the way it is.

How to get python to detect a space in a dictionary from an imported txt file?

Okay, the problem I have here is that I've got this .txt file Python imports into a dictionary.
What is does is it takes the values from the text files, that are in this Specific format:
a 0.01
b 0.11
c 1.11
d 0.02
^ (In code format because It wouldn't wouldn't stack it like in the .txt, not actually code)
and then puts them into a dictionary like this:
d = { 'a':'0.01', 'b':'0.11, ect....}
Well, I'm using this so that it will change whatever value the user inputs (Later in the script) into whatever is defined inside the dictionary.
The problem is if I try and make it incorporate a space, it just doesn't work.
Like, I finished the letters, and their corresponding values in the .txt and began going onto symbols:
For example:
& &
* *
(so that when entered into the dictionary, the corresponding values are printed when I have it print the translated message) (I could change them up, but I decided to leave them as they are)
The problem arises when I try and have it make a space in the user input correspond to a space or another value in the translated message.
I tried leaving a row blank in my .txt, so that (space) is to (space)
But later, when it tried to load the .txt, it gave me an error, saying that: "need more than one value to unpack"
Can someone help me out?
EDIT: Adding code as requested.
TRvalues = {}
with open(r"C:\Users\Owatch\Documents\Python\Unisung Net Send\nsed.txt") as f:
for line in f:
(key, val) = line.split()
TRvalues[key] = val
if TRvalues == False:
print("\n\tError encountered while attempting to load the .txt file")
print("\n\t The file does not contain any values")
else:
print("Dictionary Loaded-")
Sample text file:
a 0.01
b 0.11
c 1.11
d 0.02
e 0.22
f 2.22
g 0.03
h 0.33
i 3.33
j 0.04
k 0.44
l 4.44
m 0.05
n 0.55
o 5.55
p 0.06
q 0.66
r 6.66
s 0.07
t 0.77
u 7.77
v 0.08
w 0.88
x 8.88
y 0.09
z 0.99
I get this error when I attempt to run the script:
Traceback (most recent call last):
File "C:/Users/Owatch/Documents/Python/Unisung Net Send/Input Encryption 0.02.py", line 17, in <module>
(key, val) = line.split()
ValueError: need more than 0 values to unpack
EDIT: Thanks for downvoting everybody! I'm now no longer able to ask questions.
Believe it or not I DO know the rules for this website, and DID research this before asking on Stack Overflow. It IS helpful for other people as well. What a nice community. Hopefully the people who answered did not downvote it. I appreciate what they did.

If I understand correctly, you're trying to use a space both as an unquoted value and as a delimiter, which won't work. I'd use the csv module and its quoting rules. For example (assuming you're using Python 3 from your print functions):
import csv
with open('nsed.txt', newline='') as f:
reader = csv.reader((line.strip() for line in f), delimiter=' ')
TRvalues = dict(reader)
print(TRvalues)
with an input file of
a 0.01
b 0.11
c 1.11
d 0.02
" " " "
gives
{' ': ' ', 'a': '0.01', 'b': '0.11', 'c': '1.11', 'd': '0.02'}

It seems that you're in effect trying to do.
In [177]: "a b".split()
Out[177]: ['a', 'b']
In [178]: " ".split()
Out[178]: []
i.e. have a blank line and expect the spaces to be preserved when doing a split() but that won't work.
Therefore k, v = line.split() won't work.
EDIT Presuming I understand the problem.
Perhaps you need to encode the values when you put them in the file and then decode on the way out.
A naive approach might be to use urrlib.quote and urllib.unquote
On the write.
In [188]: urllib.quote(' ')
Out[188]: '%20'
Which makes it a bit trickier for the special case of space unless you do for all values on the write to the file.
fd.write("%s %s\n" % (urllib.encode(val1), urllib.encode(val2)))
Then on the read from the file.
k,v = map(urrlib.unquote, lines.split())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.