Using regex to split a column - python

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.

You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.

I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])

Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth

df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

Related

Dealing with comma and fullstops as per convention

I have various instance of strings such as:
- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000
For dealing with first case, I came up with two step regex:
re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to
But this breaks the text in some different ways and other cases are not covered as well. Can anyone suggest a more general regex that solves all cases properly. I've tried using replace, but it does some unintended replacements which in turn raise some other problems. I'm not an expert in regex, would appreciate pointers.
This approach covers your cases above by breaking the text into tokens:
in_list = [
'hello world,i am 2000to',
'the state was 56,869,12th',
'covering.2%',
'fiji,295,000',
'and another example with a decimal 12.3not4,5 is right out',
'parrot,, is100.00% dead'
'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
print(' '.join(re.findall(pattern=tokenizer, string=s)))
# hello world, i am 2000 to
# the state was 56,869, 12th
# covering. 2%
# fiji, 295,000
# and another example with a decimal 12.3 not 4, 5 is right out
# parrot, is 100.00% dead
# Holy grail runs for this portion of 100 minutes, 91%. Fascinating
Breaking up the regex, each token is the longest available substring with:
Only letters with or without a period or comma,[a-zA-Z]+[\.,]?
OR |
A number-ish expression which could be
1 to 3 digits \d{1,3} followed by any number of groups of comma + 3 digits (?:,\d{3})+
OR | any number of comma-free digits \d+
optionally a decimal place followed by at least one digit (?:\.\d+),
optionally a suffix (percent, 'st', 'nd', 'rd', 'th') (?:[\.,%]|st|nd|rd|th)?
optionally period or comma [\.]?
Note the (?:blah) is used to suppress re.findall's natural desire to tell you how every parenthesized group matches up on an individual basis. In this case we just want it to walk forward through the string, and the ?: accomplishes this.

Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.

Insert space to separate conjoined alpha and numeric strings - Python RegEx

In Python, I need to create a regex that inserts a space between any concatenated AlphaNum combinations. For example, this is what I want:
8min15sec ==> 8 min 15 sec
7m12s ==> 7 m 12 s
15mi25s ==> 15 mi 25 s
RegEx101 demo
I am blundering around with solutions found online, but they are a bit too complex for me to parse/modify. For example, I have this:
[a-zA-Z][a-zA-Z\d]*
but it only identifies the first insertion point: 8Xmin15sec (the X)
And this
(?<=[a-z])(?=[A-Z0-9])|(?<=[0-9])(?=[A-Z])
but it only finds this point: 8minX15sec (the X)
I could sure use a hand with the full syntax for finding each insertion point and inserting the spaces.
RegEx101 demo (same link as above)
How about the following approach:
import re
for test in ['8min15sec', '7m12s', '15mi25s']:
print(re.sub(r'(\d+|\D+)', r'\1 ', test).strip())
Which would give you:
8 min 15 sec
7 m 12 s
15 mi 25 s
You can use this regex, which marks the point which are boundaries of numbers and alphabets with either order i.e. number first then alphabets or vice versa.
(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)
This regex (?<=\d)(?=[a-zA-Z]) marks a point with positive lookahead to look for an alphabet and positive look behind to look for a digit.
Similarly, (?<=[a-zA-Z])(?=\d) does same but in opposite order.
And then just replace that mark by a space.
Demo
Here is sample python code for same.
import re
arr = ['8min15sec', '7m12s', '15mi25s']
for s in arr:
print (s + ' --> ' + re.sub('(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)', ' ',s))
Which prints following output,
8min15sec --> 8 min 15 sec
7m12s --> 7 m 12 s
15mi25s --> 15 mi 25 s
How about:
"(\d+)([a-zA-Z]+)"
to
"\1 \2 "
https://regex101.com/r/yvqCtQ/2
And in python:
In [59]: re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2 ', '8min15sec')
Out[59]: '8 min 15 sec '

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Categories

Resources