Extract data between two lines from text file - python

Say I have hundreds of text files like this example :
NAME
John Doe
DATE OF BIRTH
1992-02-16
BIO
THIS is
a PRETTY
long sentence
without ANY structure
HOBBIES
//..etc..
NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.
I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.
This is what I turned up with :
lines = f.readlines()
for line_number, line in enumerate(lines):
if "NAME" in line:
name = lines[line_number + 1] # In all files, Name is one line long.
elif "DATE OF BIRTH" in line:
date = lines[line_number + 2] # Date is also always two lines after
elif "BIO" in line:
for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
if "HOBBIES" not in lines[x]:
bio += lines[x]
else:
break
elif "HOBBIES" in line:
#...
This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.
I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.
Is it possible?
Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).
That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.

I did it storing everything in a dictionary, see code below.
f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
text = line.replace("\n","")
dict_text[location].append(text)
else:
location = "".join((line.split()))

You could use a regular expression:
import re
keys = """
NAME
DATE OF BIRTH
BIO
HOBBIES
""".strip().splitlines()
key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)
# uncomment to see the pattern
# print(pattern)
with open(filename) as f:
text = f.read()
parts = pattern.split(text)
... process parts ...
parts will be a list strings. The odd indexed positions (parts[1], parts[3], ...) will be the keys ('NAME', etc) and the even indexed positions (parts[2], parts[4], ...) will be the text in between the keys. parts[0] will be whatever was before the first key.

Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.
Something like:
string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
phrase_start = string.index(phrase)
phrase_end = phrase_start + len(phrase)
if last_phrase is not None:
get_data(string, last_phrase, phrase_start)
last_phrase = phrase_end
def get_data(string, previous_end_index, current_start_index):
usable_data = string[previous_end_index: current_start_index]
return usable_data
Better/shorter variable names should probably be used

You can just read the text in as 1 long string. And then make use of .split()
This will only work if the categories are in order and don't repeat.
Like so;
Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
SplitText = Text.split(Categories[i])
Output.update({Categories[i-1] : SplitText[0] })
Text = SplitText[1]
Output.update({Categories[-1] : Text})

You can try the following.
keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]
f = open("data.txt", "r")
result = {}
for line in f:
line = line.strip('\n')
if any(v in line for v in keys):
last_key = line
else:
result[last_key] = result.get(last_key, "") + line
print(result)
Output
{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Related

Rearrange Dialogue Documents to Dataframe

I plan to analyze some conference documents, and before the analysis, I need to rearrange these documents into data frame. The format I expect is that for each row of the data, the first value is the speaker and the second value is the utterance of that speaker. For instance, ["Jo", "I just had tacos."]. The sample document can be reached here. Below is the progress so far:
file = open('the document','r')
Name = []
sentence = []
for line in file:
if line.find("Column") != -1:
continue
if line.find("Section") or line.find("Index") or line.find("Home Page"):
continue
if line.find(':') != -1:
tokens = line.split(":")
Name.append(tokens[0])
else:
sentence.append(line + " ")
My first question is that how I can combine the speaker and the utterance in one list and then search for the next person. The second question is that is there any better way to get rid of the content before Oral Answers to Questions and after The House divided: Ayes 240, Noes 329.Division No. 54][9.59 pm.
I appreciate any help.
Here, I have come up with a simple solution. This simple solution has three parts
When there is an empty line
When the line ends with :
Otherwise
Here is the code:
import re
from collections import defaultdict
def clean_speaker(sp):
sp = re.sub(r"(\(\w+\))", "", sp) #remove single words within parentheses
sp = re.sub(r"(\d+\.?)", "", sp) #remove digits such as 1. or 2.
return sp.strip()
document = []
with open('the document','r') as fin:
foundSpeaker = False
dialogue = defaultdict(str)
for line in fin.readlines():
line = line.strip() #remove white-spaces
#----- when line is empty -----
if not line:
dialogue = defaultdict(str)
foundSpeaker = False
#----- When line ends with : -----
elif line[-1] == ":":
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = True
dialogue["Speaker"] = clean_speaker(line[:-1])
#----- Otherwise -----
else:
if foundSpeaker:
dialogue["Sentence"] += " " + line
else:
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = False
continue
Now, the variable document has all the dialogue in the given file... it's a list of dictionary where each dictionary has just two keys (speaker, and sentence). So, we can see what's inside document like so:
for d in document:
for key, value in d.items():
print(key+":", value)
Or you can do something smarter where you convert that list into pandas.dataframe and write that dataframe into csv like so:
import pandas as pd
pd.DataFrame.from_dict(document).to_csv('document.csv')
Now, open document.csv and you will find everything in order... I hope this helps you

How do I parse a sequentially organized multiline string into a data structure using regex/python?

I need to parse a multi-line string into a data structure containing (1) the identifier and (2) the text after the identifier (but before the next > symbol). the identifier always comes on its own line, but the text can take up multiple lines.
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
after execution I might have the data structured something like this:
id = ['identifier1', 'identifier2', 'identifier3']
and
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
It seems I would want to use regex to find (1) things after > but before carriage return, and (2) things between >'s, having temporarily deleted the identifier string and EOL, replacing with "".
The thing is I will have hundreds of these identifiers so I need to run the regex sequentially. Any ideas on how to attack this problem? I am working in python but feel free to use whatever language you want in your response.
*Update 1: code from slater getting closer but things are still not partitioned sequentially into id, text, id, text, etc *
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
but the output was:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
note: it needs to work for a multiline string, dealing with all the \n's. a better test case might be:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
current output:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
Personally, I feel that you should use regex as little as possible. It's slow, difficult to maintain, and generally unreadable.
That said, solving this in python is extremely straightforward. I'm a little unclear on what exactly you mean by running this "sequentially", but let me know if this solution doesn't fit your needs.
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
Obviously, you could make the code more efficient, but if you're only dealing with hundreds of identifiers it really shouldn't be needed.
If you want to remove any blank entries that might occur, you can do the following:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
Let me know if you have any more questions.
Unless I misunderstood the question, it's as easy as
for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
Edit: to concatenate multiple lines:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
I found a solution. It's certainly not very pythonic but it works.
======================================================================
======================================================================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
this gives the following output:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3

Splitting or stripping a variable number of characters from a line of text in Python?

I have a large amount of data of this type:
array(14) {
["ap_id"]=>
string(5) "22755"
["user_id"]=>
string(4) "8872"
["exam_type"]=>
string(32) "PV Technical Sales Certification"
["cert_no"]=>
string(12) "PVTS081112-2"
["explevel"]=>
string(1) "0"
["public_state"]=>
string(2) "NY"
["public_zip"]=>
string(5) "11790"
["email"]=>
string(19) "ivorabey#zeroeh.com"
["full_name"]=>
string(15) "Ivor Abeysekera"
["org_name"]=>
string(21) "Zero Energy Homes LLC"
["org_website"]=>
string(14) "www.zeroeh.com"
["city"]=>
string(11) "Stony Brook"
["state"]=>
string(2) "NY"
["zip"]=>
string(5) "11790"
}
I wrote a for loop in python which reads through the file, creating a dictionary for each array and storing elements like thus:
a = 0
data = [{}]
with open( "mess.txt" ) as messy:
lines = messy.readlines()
for i in range( 1, len(lines) ):
line = lines[i]
if "public_state" in line:
data[a]['state'] = lines[i + 1]
elif "public_zip" in line:
data[a]['zip'] = lines[i + 1]
elif "email" in line:
data[a]['email'] = lines[i + 1]
elif "full_name" in line:
data[a]['contact'] = lines[i + 1]
elif "org_name" in line:
data[a]['name'] = lines[i + 1]
elif "org_website" in line:
data[a]['website'] = lines[i + 1]
elif "city" in line:
data[a]['city'] = lines[i + 1]
elif "}" in line:
a += 1
data.append({})
I know my code is terrible, but I am fairly new to Python. As you can see, the bulk of my project is complete. What's left is to strip away the code tags from the actual data. For example, I need string(15) "Ivor Abeysekera" to become Ivor Abeysekera".
After some research, I considered .lstrip(), but since the preceding text is always different.. I got stuck.
Does anyone have a clever way of solving this problem? Cheers!
Edit: I am using Python 2.7 on Windows 7.
Depending on how the code tags are formatted, you could split the line on " then pick out the second element.
s = 'string(15) "Ivor Abeysekera"'
temp = s.split('"')[1]
# temp is 'Ivor Abeysekera'
Note that this will get rid of the trailing ", if you need it you can always just add it back on. In your example this would look like:
data[a]['state'] = lines[i + 1].split('"')[1]
# etc. for each call of lines[i + 1]
Because you are calling it so much (regardless of what answer you use) you should probably turn it into a function:
def prepare_data(line_to_fix):
return line_to_fix.split('"')[1]
# latter on...
data[a]['state'] = prepare_data(lines[i + 1])
This will give you some more flexibility.
BAD SOLUTION Based on current question
but to answer your question just use
info_string = lines[i + 1]
value_str = info_string.split(" ",1)[-1].strip(" \"")
BETTER SOLUTION
do you have access to the php generating that .... if you do just do echo json_encode($data); instead of using var_dump
if instead you have them output json it(the json output) will look like
{"variable":"value","variable2","value2"}
you can then read it in like
import json
json_str = requests.get("http://url.com/json_dump").text # or however you get the original text
data = json.loads(json_str)
print data
You should use regular expressions (regex) for this:
http://docs.python.org/2/library/re.html
What you intend to do can be easily done with the following code:
# Import the library
import re
# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'
# Create the regex
p = re.compile('[^"]+"(.*)"$')
# Find a match
m = p.match(a)
# Your result will be now in s
s = m.group(1)
Hope this helps!
You can do this statefully by looping across all the lines and keeping track of where you are in a block:
# Make field names to dict keys
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
for line in messy.split('\n'):
line = line.lstrip()
if line.startswith('}'):
data.append(current)
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
This avoids having to keep track of your position in the file, and also means that you could work across enormous data files (if you process the dictionary after each record) without having to load the whole thing into memory at once. In fact, let's restructure that as a generator that processes blocks of data at a time and yields dicts for you to work with:
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
def dict_maker(fileobj):
current = {}
key = None
for line in fileobj:
line = line.lstrip()
if line.startswith('}'):
yield current
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
with open("mess.txt") as messy:
for d in dict_maker(messy):
print d
That makes your main loop tiny and understandable: you loop across the potentially enormous set of dicts, one at a time, and do something with them. It totally separates the act of making the dictionaries from the act of consuming them. And since the generator is stateful, and only processes one line at a time, you could pass in anything that looks like a file, like a list of strings, the output of a web request, input from another programming writing to sys.stdin, or whatever.

capturing the usernames after List: tag

I am trying to create a list named "userlist" with all the usernames listed beside "List:",
my idea is to parse the line with "List:" and then split based on "," and put them in a list,
however am not able to capture the line ,any inputs on how can this be achieved?
output=""" alias: tech.sw.host
name: tech.sw.host
email: tech.sw.host
email2: tech.sw.amss
type: email list
look_elsewhere: /usr/local/mailing-lists/tech.sw.host
text: List tech SW team
list_supervisor: <username>
List: username1,username2,username3,username4,
: username5
Members: User1,User2,
: User3,User4,
: User5 """
#print output
userlist = []
for line in output :
if "List" in line:
print line
If it were me, I'd parse the entire input so as to have easy access to every field:
inFile = StringIO.StringIO(ph)
d = collections.defaultdict(list)
for line in inFile:
line = line.partition(':')
key = line[0].strip() or key
d[key] += [part.strip() for part in line[2].split(',')]
print d['List']
Using regex, str.translate and str.split :
>>> import re
>>> from string import whitespace
>>> strs = re.search(r'List:(.*)(\s\S*\w+):', ph, re.DOTALL).group(1)
>>> strs.translate(None, ':'+whitespace).split(',')
['username1', 'username2', 'username3', 'username4', 'username5']
You can also create a dict here, which will allow you to access any attribute:
def func(lis):
return ''.join(lis).translate(None, ':'+whitespace)
lis = [x.split() for x in re.split(r'(?<=\w):',ph.strip(), re.DOTALL)]
dic = {}
for x, y in zip(lis[:-1], lis[1:-1]):
dic[x[-1]] = func(y[:-1]).split(',')
dic[lis[-2][-1]] = func(lis[-1]).split(',')
print dic['List']
print dic['Members']
print dic['alias']
Output:
['username1', 'username2', 'username3', 'username4', 'username5']
['User1', 'User2', 'User3', 'User4', 'User5']
['tech.sw.host']
Try this:
for line in output.split("\n"):
if "List" in line:
print line
When Python is asked to treat a string like a collection, it'll treat each character in that string as a member of that collection (as opposed to each line, which is what you're trying to accomplish).
You can tell this by printing each line:
>>> for line in ph:
... print line
...
a
l
i
a
s
:
t
e
...
By the way, there are far better ways of handling this. I'd recommend taking a look at Python's built-in RegEx library: http://docs.python.org/2/library/re.html
Try using strip() to remove the white spaces and line breakers before doing the check:
if 'List:' == line.strip()[:5]:
this should capture the line you need, then you can extract the usernames using split(','):
usernames = [i for i in line[5:].split(',')]
Here is my two solutions, which are essentially the same, but the first is easier to understand.
import re
output = """ ... """
# First solution: join continuation lines, the look for List
# Join lines such as username5 with previous line
# List: username1,username2,username3,username4,
# : username5
# becomes
# List: username1,username2,username3,username4,username5
lines = re.sub(r',\s*:\s*', ',', output)
for line in lines.splitlines():
label, values = [token.strip() for token in line.split(':')]
if label == 'List':
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)
# Second solution, same logic as above
# Different means
tokens, = [line for line in re.sub(r',\s*:\s*', ',', output).splitlines()
if 'List:' in line]
label, values = [token.strip() for token in tokens.split(':')]
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources