I'm trying to take a text based list of names and convert it to a format for active directory.
My goal is to have a string like this:
"Bob Smith, John Long, Matt Ball"
into
"Bob.Smith; John.Long; Matt.Ball"
So far I've done:
names = 'Bob Smith, John Long, Matt Ball'
x = names.replace(',', ';')
print(x.replace(' ', ''))
With the result:
"BobSmith;JohnLong;MattBall"
How can I achieve what I'm looking for?
Change
You have to make just a small adjustment on what is been replaced:
string = 'Bob Smith, John Long, Matt Ball'
string = string.replace(' ', '.').replace(',.', '; ')
print(string)
Result
'Bob.Smith; John.Long; Matt.Ball'
'; '.join(['.'.join(n.strip().split(' ')) for n in names.split(',')])
I want to add a separator (,) every two words capture/better delineate the full names of the row.
For example df['Names'] is currently:
John Smith David Smith Golden Brown Austin James
and I would like to be:
John Smith, David Smith, Golden Brown, Austin James
I was able to find some code which splits the string every x words which would be perfect for my purposes shown below:
def splitText(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 2]) for i in range(0, len(words), 2)]
return grouped_words
However I'm not sure how to apply this to the column of choice.
I tried the following:
df['Names'].apply(splitText())
This gives me a missing positional argument.
Asking for any advice on either modifying the function or my application of it to a column dataframe. I'm pretty new to this stuff so any advice would be great!
Cheers
You can pass only function without ():
df['Names'].apply(splitText)
Working like using lambda function:
df['Names'].apply(lambda x: splitText(x))
I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith
You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.
One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.
I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond
What I Have
I'm working on parsing a .txt file that contains scheduling information for who works when on a given day. The .txt file looks like this:
START PAGE 0
XYZ Schedule for: Saturday, March 30, 2013
Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech
3/28/2013 2:21:17 PM
END PAGE 0
So the first two and last two lines are not relevant but every other line in the middle is the schedule for one person.
What I Want
I want to parse out the pieces of each line so that I can write it to a .csv file. I can use line.partition(',')[0] to get the last name (the first piece on each line) but after that I'm at a loss. I need to communicate the following to Python:
The part after the , to a number is a section (first
name)
The part from the first number to either an a or a p
(for am or pm) is another section (start time)
The part from the
number just after that a or p to the next a or p is another
section (end time)
Finally, the remaining section is another
section (the type/position of the shift.)
A line in my resulting csv file might look like this:
Barnes,Michael,8:00a,10:00a,Tech
Things to Note
1) One person can have more than one shift during a day.
2) Some people have a nickname in parentheses but some don't.
3) If Python had wild cards like # for a number and * for anything I could see how I might be able to keep using partition and keep splitting the remaining pieces, something like this:
for line in input:
name = str(line.partition(',')[0]+','+str(line.partition(',')[2].split(#)[0]))
output.write("".join(x for x in name))
output.write("\r\n")
However, Python doesn't seem to use wildcards like that. Also, this seems like a very inelegant solution.
This should be enough to get you started:
import re
data = '''Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech'''
print re.findall(r'(.*?)(\d{1,2}:\d\d[ap])(\d{1,2}:\d\d[ap])(.*)', data)
prints
[('Barnes, Michael', '8:00a', '10:00a', 'Tech'),
('Collins, Jessica', '8:00a', '4:00p', 'Supervisor'),
('Hamilton, Patricia', '8:00a', '10:00a', 'Tech'),
('Smith, Jan', '8:00a', '10:00a', 'Tech'),
('Park, Kimberly', '8:00a', '10:00a', 'Tech'),
('Edwards, Terrell', '10:00a', '12:00p', 'Tech'),
('Green, Harrold', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '2:00p', '4:00p', 'Tech'),
('Hernandez, William (Monte)', '4:00p', '6:30p', 'Supervisor'),
('Tait, Chioma', '4:00p', '6:00p', 'Tech'),
('Hernandez, William (Monte)', '6:30p', '7:00p', 'Supervisor'),
('Hernandez, William (Monte)', '7:00p', '9:00p', 'Supervisor'),
('Tailor, Thomas (Jason)', '9:00p', '12:00a', 'Supervisor'),
('Jones, Deslynne', '10:00p', '12:00a', 'Tech')]
Read the documentation of the re module to understand the regular expression. You can parse the names as a separate step or expand the regex to be more specific. I recommend using the csv module to write to a csv file.
If you get stuck, post specific questions with code.
Assuming that you you know how to remove the first two and last two lines, and that the rest is in a string called s, here is how I would do what you want:
entries = [x.strip() for x in s.split('\n') if x]
for entry in entries:
ind = [i for i,x in enumerate(entry) if x.isdigit() and not entry[i-1].isdigit()]
name = entry[0:ind[0]]
name = name.split(',')
other = entry[ind[0]:]
ind = [-1]+[i for i,x in enumerate(other) if x in ('a', 'p') and other[i-1].isdigit()]
shifts = []
for i in xrange(1, len(ind)):
shifts.append(other[ind[i-1]+1:ind[i]+1])
position = other[ind[-1]+1:]
print(name, shifts, position)
This will work on an arbitrary number of shifts.
Output:
['Barnes', ' Michael'] ['8:00a', '10:00a'] Tech
['Collins', ' Jessica'] ['8:00a', '4:00p'] Supervisor
['Hamilton', ' Patricia'] ['8:00a', '10:00a'] Tech
['Smith', ' Jan'] ['8:00a', '10:00a'] Tech
['Park', ' Kimberly'] ['8:00a', '10:00a'] Tech
['Edwards', ' Terrell'] ['10:00a', '12:00p'] Tech
['Green', ' Harrold'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['2:00p', '4:00p'] Tech
['Hernandez', ' William (Monte)'] ['4:00p', '6:30p'] Supervisor
['Tait', ' Chioma'] ['4:00p', '6:00p'] Tech
['Hernandez', ' William (Monte)'] ['6:30p', '7:00p'] Supervisor
['Hernandez', ' William (Monte)'] ['7:00p', '9:00p'] Supervisor
['Tailor', ' Thomas (Jason)'] ['9:00p', '12:00a'] Supervisor
['Jones', ' Deslynne'] ['10:00p', '12:00a'] Tech