Parsing Text Document Lines in Python on Wildcards/Patterns

Parsing Text Document Lines in Python on Wildcards/Patterns - python

What I Have
I'm working on parsing a .txt file that contains scheduling information for who works when on a given day. The .txt file looks like this:
START PAGE 0
XYZ Schedule for: Saturday, March 30, 2013
Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech
3/28/2013 2:21:17 PM
END PAGE 0
So the first two and last two lines are not relevant but every other line in the middle is the schedule for one person.
What I Want
I want to parse out the pieces of each line so that I can write it to a .csv file. I can use line.partition(',')[0] to get the last name (the first piece on each line) but after that I'm at a loss. I need to communicate the following to Python:
The part after the , to a number is a section (first
name)
The part from the first number to either an a or a p
(for am or pm) is another section (start time)
The part from the
number just after that a or p to the next a or p is another
section (end time)
Finally, the remaining section is another
section (the type/position of the shift.)
A line in my resulting csv file might look like this:
Barnes,Michael,8:00a,10:00a,Tech
Things to Note
1) One person can have more than one shift during a day.
2) Some people have a nickname in parentheses but some don't.
3) If Python had wild cards like # for a number and * for anything I could see how I might be able to keep using partition and keep splitting the remaining pieces, something like this:
for line in input:
name = str(line.partition(',')[0]+','+str(line.partition(',')[2].split(#)[0]))
output.write("".join(x for x in name))
output.write("\r\n")
However, Python doesn't seem to use wildcards like that. Also, this seems like a very inelegant solution.

This should be enough to get you started:
import re
data = '''Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech'''
print re.findall(r'(.*?)(\d{1,2}:\d\d[ap])(\d{1,2}:\d\d[ap])(.*)', data)
prints
[('Barnes, Michael', '8:00a', '10:00a', 'Tech'),
('Collins, Jessica', '8:00a', '4:00p', 'Supervisor'),
('Hamilton, Patricia', '8:00a', '10:00a', 'Tech'),
('Smith, Jan', '8:00a', '10:00a', 'Tech'),
('Park, Kimberly', '8:00a', '10:00a', 'Tech'),
('Edwards, Terrell', '10:00a', '12:00p', 'Tech'),
('Green, Harrold', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '2:00p', '4:00p', 'Tech'),
('Hernandez, William (Monte)', '4:00p', '6:30p', 'Supervisor'),
('Tait, Chioma', '4:00p', '6:00p', 'Tech'),
('Hernandez, William (Monte)', '6:30p', '7:00p', 'Supervisor'),
('Hernandez, William (Monte)', '7:00p', '9:00p', 'Supervisor'),
('Tailor, Thomas (Jason)', '9:00p', '12:00a', 'Supervisor'),
('Jones, Deslynne', '10:00p', '12:00a', 'Tech')]
Read the documentation of the re module to understand the regular expression. You can parse the names as a separate step or expand the regex to be more specific. I recommend using the csv module to write to a csv file.
If you get stuck, post specific questions with code.

Assuming that you you know how to remove the first two and last two lines, and that the rest is in a string called s, here is how I would do what you want:
entries = [x.strip() for x in s.split('\n') if x]
for entry in entries:
ind = [i for i,x in enumerate(entry) if x.isdigit() and not entry[i-1].isdigit()]
name = entry[0:ind[0]]
name = name.split(',')
other = entry[ind[0]:]
ind = [-1]+[i for i,x in enumerate(other) if x in ('a', 'p') and other[i-1].isdigit()]
shifts = []
for i in xrange(1, len(ind)):
shifts.append(other[ind[i-1]+1:ind[i]+1])
position = other[ind[-1]+1:]
print(name, shifts, position)
This will work on an arbitrary number of shifts.
Output:
['Barnes', ' Michael'] ['8:00a', '10:00a'] Tech
['Collins', ' Jessica'] ['8:00a', '4:00p'] Supervisor
['Hamilton', ' Patricia'] ['8:00a', '10:00a'] Tech
['Smith', ' Jan'] ['8:00a', '10:00a'] Tech
['Park', ' Kimberly'] ['8:00a', '10:00a'] Tech
['Edwards', ' Terrell'] ['10:00a', '12:00p'] Tech
['Green', ' Harrold'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['2:00p', '4:00p'] Tech
['Hernandez', ' William (Monte)'] ['4:00p', '6:30p'] Supervisor
['Tait', ' Chioma'] ['4:00p', '6:00p'] Tech
['Hernandez', ' William (Monte)'] ['6:30p', '7:00p'] Supervisor
['Hernandez', ' William (Monte)'] ['7:00p', '9:00p'] Supervisor
['Tailor', ' Thomas (Jason)'] ['9:00p', '12:00a'] Supervisor
['Jones', ' Deslynne'] ['10:00p', '12:00a'] Tech

Related

How to replace every second space with a comma the Pythonic way

I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith

You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.

One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.

Parse a very large text file with Python?

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford                                56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling                 56764
North Italian Folk, by Alice Vansittart Strettel Carr                    56763
 and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.

Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!

As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

How can I split lines based on characters in python?

I've recently started working with Python 2.7 and I've got an assignment in which I get a text file containing data separated with space. I would need to split every line into strings containing only one type of data. Here's an example:
Bruce Wayne 10012-34321 2016.02.20. 231231
John Doe 10201-11021 2016.01.10. 2310456
Chris Taylor 10001-31021 2015.12.30. 524432
James Michael Kent 10210-41011 2016.02.03. 3235332
I want to separate them by name, id, date, balance but the only thing I know is split which I can't really use because the last given name has three parts instead of two. How can I split a line based on charactersWhat could be the solution in this case?
Any help is appreciated.

You'll want to use str.rsplit() and supply a max number of splits, like this:
>>> s = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
>>> s.rsplit(' ', 3)
['James Michael Kent', '10210-41011', '2016.02.03.', '3235332']
>>> s = 'Chris Taylor 10001-31021 2015.12.30. 524432'
>>> s.rsplit(' ', 3)
['Chris Taylor', '10001-31021', '2015.12.30.', '524432']

What you need is to look up in list created by split from last:
To get details
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[-3:]
['10210-41011', '2016.02.03.', '3235332']
ln = 'Bruce Wayne 10012-34321 2016.02.20. 231231'
ln.split()[-3:]
['10012-34321', '2016.02.20.', '231231']
To get names:
ln.split()[:-3]
['Bruce', 'Wayne']
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[:-3]
['James', 'Michael', 'Kent']

How to split a string by commas positioned outside of parenthesis?

I got a string of such format:
"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).
My goal is to split this string into a list of pairs - (actor name, actor role).
One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...
I was thinking about spliting it using a regexp: first split the string by parenthesis:
import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x)
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']
The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.
Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:
>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
The regex above matches one or more:
non-comma, non-open-paren characters
strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.
Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:
"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"
If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

s = re.split(r',\s*(?=[^)]*(?:\(|$))', x)
The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.

I think the best way to approach this would be to use python's built-in csv module.
Because the csv module only allows a one character quotechar, you would need to do a replace on your inputs to convert () to something like | or ". Then make sure you are using an appropriate dialect and off you go.

An attempt on human-readable regex:
import re
regex = re.compile(r"""
# name starts and ends on word boundary
# no '(' or commas in the name
(?P<name>\b[^(,]+\b)
\s*
# everything inside parentheses is a role
(?:\(
(?P<role>[^)]+)
\))? # role is optional
""", re.VERBOSE)
s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley,"
"Jane Doe (Jane Doe)")
print re.findall(regex, s)
Output:
[('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'),
('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]

My answer will not use regex.
I think simple character scanner with state "in_actor_name" should work. Remember then state "in_actor_name" is terminated either by ')' or by comma in this state.
My try:
s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)'
in_actor_name = 1
role = ''
name = ''
for c in s:
if c == ')' or (c == ',' and in_actor_name):
in_actor_name = 1
name = name.strip()
if name:
print "%s: %s" % (name, role)
name = ''
role = ''
elif c == '(':
in_actor_name = 0
else:
if in_actor_name:
name += c
else:
role += c
if name:
print "%s: %s" % (name, role)
Output:
Wilbur Smith: Billy, son of John
Eddie Murphy: John
Elvis Presley:
Jane Doe: Jane Doe

Here's a general technique I've used in the past for such cases:
Use the sub function of the re module with a function as replacement argument. The function keeps track of opening and closing parens, brackets and braces, as well as single and double quotes, and performs a replacement only outside of such bracketed and quoted substrings. You can then replace the non-bracketed/quoted commas with another character which you're sure doesn't appear in the string (I use the ASCII/Unicode group-separator: chr(29) code), then do a simple string.split on that character. Here's the code:
import re
def srchrepl(srch, repl, string):
"""Replace non-bracketed/quoted occurrences of srch with repl in string"""
resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
+ srch + """])|(?P<rbrkt>[)\]}])""")
return resrchrepl.sub(_subfact(repl), string)
def _subfact(repl):
"""Replacement function factory for regex sub method in srchrepl."""
level = 0
qtflags = 0
def subf(mo):
nonlocal level, qtflags
sepfound = mo.group('sep')
if sepfound:
if level == 0 and qtflags == 0:
return repl
else:
return mo.group(0)
elif mo.group('lbrkt'):
level += 1
return mo.group(0)
elif mo.group('quote') == "'":
qtflags ^= 1 # toggle bit 1
return "'"
elif mo.group('quote') == '"':
qtflags ^= 2 # toggle bit 2
return '"'
elif mo.group('rbrkt'):
level -= 1
return mo.group(0)
return subf
If you don't have nonlocal in your version of Python, just change it to global and define level and qtflags at the module level.
Here's how it's used:
>>> GRPSEP = chr(29)
>>> string = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> lst = srchrepl(',', GRPSEP, string).split(GRPSEP)
>>> lst
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

This post helped me a lot. I was looking to split a string by commas positioned outside quotes. I used this as a starter. My final line of code was regEx = re.compile(r'(?:[^,"]|"[^"]*")+') This did the trick. Thanks a ton.

I certainly agree with #Wogan above, that using the CSV moudle is a good approach. Having said that if you still want to try a regex solution give this a try, but you will have to adapt it to Python dialect
string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
HTH

split by ")"
>>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> s.split(")")
['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', '']
>>> for i in s.split(")"):
... print i.split("(")
...
['Wilbur Smith ', 'Billy, son of John']
[', Eddie Murphy ', 'John']
[', Elvis Presley, Jane Doe ', 'Jane Doe']
['']
you can do further checking to get those names that doesn't come with ().

None of the answers above are correct if there are any errors or noise in your data.
It's easy to come up with a good solution if you know the data is right every time. But what happens if there are formatting errors? What do you want to have happen?
Suppose there are nesting parentheses? Suppose there are unmatched parentheses? Suppose the string ends with or begins with a comma, or has two in a row?
All of the above solutions will produce more or less garbage and not report it to you.
Were it up to me, I'd start with a pretty strict restriction on what "correct" data was - no nesting parentheses, no unmatched parentheses, and no empty segments before, between or after comments - validate as I went, and then raise an exception if I wasn't able to validate.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Text Document Lines in Python on Wildcards/Patterns - python

Related

How to replace every second space with a comma the Pythonic way

Parse a very large text file with Python?

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

How can I split lines based on characters in python?

How to split a string by commas positioned outside of parenthesis?

Categories

Resources