Help parsing text file in python - python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K

You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.

If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.

Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')

import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Related

python how to increment vars in regex replacements

I want to replace multiple patterns in a file with regex.
This is my (working) code so far:
import re
with open('test.txt', "r") as fp:
text = fp.read()
result = re.sub(r'pattern', 'replacement', str)
result2 = re.sub(r'anotherpattern', 'anotherreplacement2', result)
...
with open('results.txt', 'w') as fp:
fp.write(result_x)
This works. But it seems to be inelegant to increment the vars names manually in every new line. How can I increment them better? It must be a for loop, I think. But how?
You do not need the previous result once you used it. You can store the new result in the same variable:
text = re.sub(r'pattern1', 'replacement1', text) # str() is a string constructor!
text = re.sub(r'pattern2', 'replacement2', text)
You can also have a list of patterns and replacements and loop through it:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pattern,replacement in to_replace:
text = re.sub(pattern, replacement, text)
Or in an even more Pythonic way:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pr in to_replace:
text = re.sub(*pr, string=text)
I don't know Python too well, but I think if you want to combine the patterns,
you could do it in a single pass using a callback.
Example:
def repl(m):
contents = m.group(1)
if m.group(1) != '':
return sr1
if m.group(2) != '':
return sr2
if m.group(3) != '':
return sr3
return m.group(0)
print re.sub('(stuff1)|(stuff2)|(stuff3)', repl, text)
And, it could also be looped inside the callback.
For instance, a var holding the fixed number of patterns
which is looped to test the match object.
There must be a replacement array the same size (and position) of the
number of groups in the regex.
How much of a performance increase will this give you?
Doing this in a single pass, you gain exponential performance.
Note that it is almost an error to re-examine the same text over and over again. Imagine searching the library of congress one word at a time from the beginning each time.. How long would that take ?

How to read variating data into dictionary?

I need to extract the name of the constants and their corresponding values from a .txt file into a dictionary. Where key = NameOfConstants and Value=float.
The start of the file looks like this:
speed of light 299792458.0 m/s
gravitational constant 6.67259e-11 m**3/kg/s**2
Planck constant 6.6260755e-34 J*s
elementary charge 1.60217733e-19 C
How do I get the name of the constants easy?
This is my attempt:
with open('constants.txt', 'r') as infile:
file1 = infile.readlines()
constants = {i.split()[0]: i.split()[1] for i in file1[2:]}
I'm not getting it right with the split(), and I need a little correction!
{' '.join(line.split()[:-2]):' '.join(line.split()[-2:]) for line in lines}
From your text file I'm unable to get the correct value of no of spaces to split. So below code is designed to help you. Please have a look, it worked for you above stated file.
import string
valid_char = string.ascii_letters + ' '
valid_numbers = string.digits + '.'
constants = {}
with open('constants.txt') as file1:
for line in file1.readlines():
key = ''
for index, char in enumerate(line):
if char in valid_char:
key += char
else:
key = key.strip()
break
value = ''
for char in line[index:]:
if char in valid_numbers:
value += char
else:
break
constants[key] = float(value)
print constants
Have You tried using regular expressions?
for example
([a-z]|\s)*
matches the first part of a line until the digits of the constants begin.
Python provides a very good tutorial on regular expressions (regex)
https://docs.python.org/2/howto/regex.html
You can try out your regex online as well
https://regex101.com/
with open('constants.txt', 'r') as infile:
lines = infile.readlines()
constants = {' '.join(line.split()[:-2]):float(' '.join(line.split()[-2:-1])) for line in lines[2:]}
Since there were two lines above not needed.
This would best be solved using a regexp.
Focussing on your question (how to get the names) and your desires (have something shorter):
import re
# Regular expression fetches all characters
# until the first occurence of a number
REGEXP = re.compile('^([a-zA-Z\s]+)\d.*$')
with open('tst.txt', 'r') as f:
for line in f:
match = REGEXP.match(line)
if match:
# On a match the part between parentheses
# are copied to the first group
name = match.group(1).strip()
else:
# Raise something, or change regexp :)
pass
What about re.split-
import re
lines = open(r"C:\txt.txt",'r').readlines()
for line in lines:
data = re.split(r'\s{3,}',line)
print "{0} : {1}".format(data[0],''.join(data[1:]))
Or use oneliner to make dictionary-
{k:v.strip() for k,v in [(re.split(r'\s{3,}',line)[0],''.join(re.split(r'\s{3,}',line)[1:])) for line in open(r"C:\txt.txt",'r').readlines() ]}
Output-
gravitational constant : 6.67259e-11m**3/kg/s**2
Planck constant : 6.6260755e-34J*s
elementary charge : 1.60217733e-19C
Dictionary-
{'Planck constant': '6.6260755e-34J*s', 'elementary charge': '1.60217733e-19C', 'speed of light': '299792458.0m/s', 'gravitational constant': '6.67259e-11m**3/kg/s**2'}

Comparing multiple file items using re

Currently I have a script that finds all the lines across multiple input files that have something in the format of
Matches: 500 (54.3 %) and prints out the top 10 highest matches in percentage.
I want to be able to have it also output the top 10 lines for score ex: Score: 4000
import re
def get_values_from_file(filename):
f = open(filename)
winpat = re.compile("([\d\.]+)\%")
xinpat = re.compile("[\d]") #ISSUE, is this the right regex for it? Score: 500****
values = []
scores = []
for line in f.readlines():
if line.find("Matches") >=0:
percn = float(winpat.findall(line)[0])
values.append(percn)
elif line.find("Score") >=0:
hey = float(xinpat.findall(line)[0])
scores.append(hey)
return (scores,values)
all_values = []
all_scores = []
for filename in ["out0.txt", "out1.txt"]:#and so on
values = get_values_from_file(filename)
all_values += values
all_scores += scores
all_values.sort()
all_values.reverse()
all_scores.sort() #also for scores
all_scores.reverse()
print(all_values[0:10])
print(all_scores[0:10])
Is my regex for the score format correct? I believe that's where I am having the issue, as it doesn't output both correctly.
Any thoughts? Should I split it into two functions?
Thank you.
Is my regex for the score format correct?
No, it should be r"\d+".
You don't need []. Those brackets establish a character class representing all of the characters inside the brackets. Since you only have one character type inside the bracket, they do nothing.
You only match a single character. You need a * or a + to match a sequence of characters.
You have an unescaped backslash in your string. Use the r prefix to allow the regular expression engine to see the backslash.
Commentary:
If it were me, I'd let the regular expression do all of the work, and skip line.find() altogether:
#UNTESTED
def get_values_from_file(filename):
winpat = re.compile(r"Matches:\s*\d+\s*\(([\d\.]+)\%\)")
xinpat = re.compile(r"Score:\s*([\d]+)")
values = []
scores = []
# Note: "with open() as f" automatically closes f
with open(filename) as f:
# Note: "for line in f" more memory efficient
# than "for line in f.readlines()"
for line in f:
win = winpat.match(line)
xin = xinpat.match(line)
if win: values.append(float(win.group(0)))
if xin: scores.append(float(xin.group(0)))
return (scores,values)
Just for fun, here is a version of the routine which calls re.findall exactly once per file:
# TESTED
# Compile this only once to save time
pat = re.compile(r'''
(?mx) # multi-line, verbose
(?:Matches:\s*\d+\s*\(([\d\.]+)\s*%\)) # "Matches: 300 (43.2%)"
|
(?:Score:\s*(\d+)) # "Score: 4000"
''')
def get_values_from_file(filename):
with open(filename) as f:
values, scores = zip(*pat.findall(f.read()))
values = [float(value) for value in values if value]
scores = [float(score) for score in scores if score]
return scores, values
No. xinpat will only match single digits, so findall() will return a list of single digits, which is a bit messy. Change it to
xinpat = re.compile("[\d]+")
Actually, you don't need the square brackets here, so you could simplify it to
xinpat = re.compile("\d+")
BTW, the names winpat and xinpat are a bit opaque. The pat bit is ok, but win & xin? And hey isn't great either. But I guess xin and hey are just temporary names you made up when you decidd to expand the program.
Another thing I just noticed, you don't need to do
all_values.sort()
all_values.reverse()
You can (and should) do that in one hit:
all_values.sort(reverse=True)

Python next substring search

I am transmitting a message with a pre/postamble multiple times. I want to be able to extract the message between two valid pre/postambles. My curent code is
print(msgfile[msgfile.find(preamble) + len(preamble):msgfile.find(postamble, msgfile.find(preamble))])
The problem is that if the postamble is corrupt, it will print all data between the first valid preamble and the next valid postamble. An example received text file would be:
garbagePREAMBLEmessagePOSTcMBLEgarbage
garbagePRdAMBLEmessagePOSTAMBLEgarbage
garbagePREAMBLEmessagePOSTAMBLEgarbage
and it will print
messagePOSTcMBLEgarbage
garbagePRdEAMBLEmessage
but what i really want it to print is the message from the third line since it has both a valid pre/post amble. So I guess what i want is to be able to find and index from the next instance of a substring. Is there an easy way to do this?
edit: I dont expect my data to be in nice discrete lines. I just formatted it that way so it would be easier to see
Process it line by line:
>>> test = "garbagePREAMBLEmessagePOSTcMBLEgarbage\n"
>>> test += "garbagePRdAMBLEmessagePOSTAMBLEgarbage\n"
>>> test += "garbagePREAMBLEmessagePOSTAMBLEgarbage\n"
>>> for line in test.splitlines():
if line.find(preamble) != -1 and line.find(postamble) != -1:
print(line[line.find(preamble) + len(preamble):line.find(postamble)])
are all messages on single lines?
Then you can use regular expressions to identify lines with valid pre- and postamble:
input_file = open(yourfilename)
import re
pat = re.compile('PREAMBLE(.+)POSTAMBLE')
messages = [pat.search(line).group(1) for line in input_file
if pat.search(line)]
print messages
import re
lines = ["garbagePREAMBLEmessagePOSTcMBLEgarbage",
"garbagePRdAMBLEmessagePOSTAMBLEgarbage",
"garbagePREAMBLEmessagePOSTAMBLEgarbage"]
# you can use regex
my_regex = re.compile("garbagePREAMBLE(.*?)POSTAMBLEgarbage")
# get the match found between the preambles and print it
for line in lines:
found = re.match(my_regex,line)
# if there is a match print it
if found:
print(found.group(1))
# you can use string slicing
def validate(pre, post, message):
for line in lines:
# method would break on a string smaller than both preambles
if len(line) < len(pre) + len(post):
print("error line is too small")
# see if the message fits the pattern
if line[:len(pre)] == pre and line[-len(post):] == post:
# print message
print(line[len(pre):-len(post)])
validate("garbagePREAMBLE","POSTAMBLEgarbage", lines)

How do I perform binary search on a text file to search a keyword in python?

The text file contains two columns- index number(5 spaces) and characters(30 spaces).
It is arranged in lexicographic order. I want to perform binary search to search for the keyword.
Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located # line: ", bisect.bisect(searchable_file, wrapped_query)
Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345
If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set() as #kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict() above instead of set() to preserve index information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set() variant.
Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by #Mahmoud Abdelkader don't require to load the whole file in memory.
I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]
It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.
Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.
Hope this helps
========================================
# See if the input loc is in file
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
# Find midpoint and seek to it
loc = int((hi+lo)/2)
# print " hi = ",hi," lo = ",lo
# print "seek to: ",loc
f.seek(loc)
# Skip to beginning of line
while f.read(1) != '\n':
pass
# Now skip past lines that are headers
while 1:
# read line
line = f.readline()
# print "read_line: ", line
# Crude csv parsing, remove quotes, and split on ,
row=line.replace('"',"")
row=row.split(',')
# Make sure 1st fields is numeric
if row[0].isdigit():
break
s=int(row[0])
if lookfor < s:
# Split into lower half
hi=loc
continue
if lookfor > s:
# Split into higher half
lo=loc
continue
return row # Found
# If not found
return False
Consider using a set instead of a binary search for finding a keyword in your file.
Set:
O(n) to create, O(1) to find, O(1) to insert/delete
If your input file is separated by a space then:
f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()
my_word in keywords
<returns True or False>
Dictionary:
f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] )
f.close()
keywords[my_word]
<returns index of my_word>
Binary Search is:
O(n log n) create, O(log n) lookup
edit: for your case of 5 characters and 30 characters you can just use string slicing
f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()
myword_ in keywords
or
f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()
keywords[my_word]

Categories

Resources