How can I split lines based on characters in python? - python

I've recently started working with Python 2.7 and I've got an assignment in which I get a text file containing data separated with space. I would need to split every line into strings containing only one type of data. Here's an example:
Bruce Wayne 10012-34321 2016.02.20. 231231
John Doe 10201-11021 2016.01.10. 2310456
Chris Taylor 10001-31021 2015.12.30. 524432
James Michael Kent 10210-41011 2016.02.03. 3235332
I want to separate them by name, id, date, balance but the only thing I know is split which I can't really use because the last given name has three parts instead of two. How can I split a line based on charactersWhat could be the solution in this case?
Any help is appreciated.

You'll want to use str.rsplit() and supply a max number of splits, like this:
>>> s = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
>>> s.rsplit(' ', 3)
['James Michael Kent', '10210-41011', '2016.02.03.', '3235332']
>>> s = 'Chris Taylor 10001-31021 2015.12.30. 524432'
>>> s.rsplit(' ', 3)
['Chris Taylor', '10001-31021', '2015.12.30.', '524432']

What you need is to look up in list created by split from last:
To get details
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[-3:]
['10210-41011', '2016.02.03.', '3235332']
ln = 'Bruce Wayne 10012-34321 2016.02.20. 231231'
ln.split()[-3:]
['10012-34321', '2016.02.20.', '231231']
To get names:
ln.split()[:-3]
['Bruce', 'Wayne']
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[:-3]
['James', 'Michael', 'Kent']

Related

Parse a list of names from 'First Last' to 'First.Last'

I'm trying to take a text based list of names and convert it to a format for active directory.
My goal is to have a string like this:
"Bob Smith, John Long, Matt Ball"
into
"Bob.Smith; John.Long; Matt.Ball"
So far I've done:
names = 'Bob Smith, John Long, Matt Ball'
x = names.replace(',', ';')
print(x.replace(' ', ''))
With the result:
"BobSmith;JohnLong;MattBall"
How can I achieve what I'm looking for?
Change
You have to make just a small adjustment on what is been replaced:
string = 'Bob Smith, John Long, Matt Ball'
string = string.replace(' ', '.').replace(',.', '; ')
print(string)
Result
'Bob.Smith; John.Long; Matt.Ball'
'; '.join(['.'.join(n.strip().split(' ')) for n in names.split(',')])

Python - Applying a function to separate string in column every two words

I want to add a separator (,) every two words capture/better delineate the full names of the row.
For example df['Names'] is currently:
John Smith David Smith Golden Brown Austin James
and I would like to be:
John Smith, David Smith, Golden Brown, Austin James
I was able to find some code which splits the string every x words which would be perfect for my purposes shown below:
def splitText(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 2]) for i in range(0, len(words), 2)]
return grouped_words
However I'm not sure how to apply this to the column of choice.
I tried the following:
df['Names'].apply(splitText())
This gives me a missing positional argument.
Asking for any advice on either modifying the function or my application of it to a column dataframe. I'm pretty new to this stuff so any advice would be great!
Cheers
You can pass only function without ():
df['Names'].apply(splitText)
Working like using lambda function:
df['Names'].apply(lambda x: splitText(x))

How to replace every second space with a comma the Pythonic way

I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith
You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.
One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

Parsing Text Document Lines in Python on Wildcards/Patterns

What I Have
I'm working on parsing a .txt file that contains scheduling information for who works when on a given day. The .txt file looks like this:
START PAGE 0
XYZ Schedule for: Saturday, March 30, 2013
Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech
3/28/2013 2:21:17 PM
END PAGE 0
So the first two and last two lines are not relevant but every other line in the middle is the schedule for one person.
What I Want
I want to parse out the pieces of each line so that I can write it to a .csv file. I can use line.partition(',')[0] to get the last name (the first piece on each line) but after that I'm at a loss. I need to communicate the following to Python:
The part after the , to a number is a section (first
name)
The part from the first number to either an a or a p
(for am or pm) is another section (start time)
The part from the
number just after that a or p to the next a or p is another
section (end time)
Finally, the remaining section is another
section (the type/position of the shift.)
A line in my resulting csv file might look like this:
Barnes,Michael,8:00a,10:00a,Tech
Things to Note
1) One person can have more than one shift during a day.
2) Some people have a nickname in parentheses but some don't.
3) If Python had wild cards like # for a number and * for anything I could see how I might be able to keep using partition and keep splitting the remaining pieces, something like this:
for line in input:
name = str(line.partition(',')[0]+','+str(line.partition(',')[2].split(#)[0]))
output.write("".join(x for x in name))
output.write("\r\n")
However, Python doesn't seem to use wildcards like that. Also, this seems like a very inelegant solution.
This should be enough to get you started:
import re
data = '''Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech'''
print re.findall(r'(.*?)(\d{1,2}:\d\d[ap])(\d{1,2}:\d\d[ap])(.*)', data)
prints
[('Barnes, Michael', '8:00a', '10:00a', 'Tech'),
('Collins, Jessica', '8:00a', '4:00p', 'Supervisor'),
('Hamilton, Patricia', '8:00a', '10:00a', 'Tech'),
('Smith, Jan', '8:00a', '10:00a', 'Tech'),
('Park, Kimberly', '8:00a', '10:00a', 'Tech'),
('Edwards, Terrell', '10:00a', '12:00p', 'Tech'),
('Green, Harrold', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '2:00p', '4:00p', 'Tech'),
('Hernandez, William (Monte)', '4:00p', '6:30p', 'Supervisor'),
('Tait, Chioma', '4:00p', '6:00p', 'Tech'),
('Hernandez, William (Monte)', '6:30p', '7:00p', 'Supervisor'),
('Hernandez, William (Monte)', '7:00p', '9:00p', 'Supervisor'),
('Tailor, Thomas (Jason)', '9:00p', '12:00a', 'Supervisor'),
('Jones, Deslynne', '10:00p', '12:00a', 'Tech')]
Read the documentation of the re module to understand the regular expression. You can parse the names as a separate step or expand the regex to be more specific. I recommend using the csv module to write to a csv file.
If you get stuck, post specific questions with code.
Assuming that you you know how to remove the first two and last two lines, and that the rest is in a string called s, here is how I would do what you want:
entries = [x.strip() for x in s.split('\n') if x]
for entry in entries:
ind = [i for i,x in enumerate(entry) if x.isdigit() and not entry[i-1].isdigit()]
name = entry[0:ind[0]]
name = name.split(',')
other = entry[ind[0]:]
ind = [-1]+[i for i,x in enumerate(other) if x in ('a', 'p') and other[i-1].isdigit()]
shifts = []
for i in xrange(1, len(ind)):
shifts.append(other[ind[i-1]+1:ind[i]+1])
position = other[ind[-1]+1:]
print(name, shifts, position)
This will work on an arbitrary number of shifts.
Output:
['Barnes', ' Michael'] ['8:00a', '10:00a'] Tech
['Collins', ' Jessica'] ['8:00a', '4:00p'] Supervisor
['Hamilton', ' Patricia'] ['8:00a', '10:00a'] Tech
['Smith', ' Jan'] ['8:00a', '10:00a'] Tech
['Park', ' Kimberly'] ['8:00a', '10:00a'] Tech
['Edwards', ' Terrell'] ['10:00a', '12:00p'] Tech
['Green', ' Harrold'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['2:00p', '4:00p'] Tech
['Hernandez', ' William (Monte)'] ['4:00p', '6:30p'] Supervisor
['Tait', ' Chioma'] ['4:00p', '6:00p'] Tech
['Hernandez', ' William (Monte)'] ['6:30p', '7:00p'] Supervisor
['Hernandez', ' William (Monte)'] ['7:00p', '9:00p'] Supervisor
['Tailor', ' Thomas (Jason)'] ['9:00p', '12:00a'] Supervisor
['Jones', ' Deslynne'] ['10:00p', '12:00a'] Tech

Categories

Resources