Extracting multiple digits from pandas non-uniform column with classification

Extracting multiple digits from pandas non-uniform column with classification - python

df with a text column 'DescCol' that has been entered variously with no consistent but, somewhat similar patterns. I need to:
(a) extract all substrings within parentheses
(b) if the extracted substring contains numbers then:
(b.i) if (b) the beginning text is in ('Up to', '<', 'Tolerance') - so mark a boolean column 'isToleranceSpec' Yes for 'Tolerance', No otherwise
(b.ii) extract the digit following the beginning text of the substring (which may or may not have comma separator) into a column called 'BandLimit'
(b.iii) then check if there is further follow-on text ('thereafter' AFAIK)
(b.iv) if (b.iii) then extract the number following 'thereafter' into a column called 'Marginal' else continue
(c) if not (b): continue
So the result df will look like below ('Remarks' columns highlights some of the peculiarities I've noticed in the data so far):
df = pd.DataFrame({"DescCol":["beginning text (Up to 1,234 days, thereafter 11d each) ending text",
"beginning text (Up to 1234 days, thereafter 11d each) ending text",
"beginning text (Tolerance 4,567 days, thereafter 12d each) ending text",
"beginning text (Tolerance 4567 days, thereafter 12d each) ending text",
"beginning text (Tolerance 891011 days) ending text",
"beginning text (<1,112 days, thereafter 13d each) ending text",
"beginning text (no numbers within parentheses) ending text"],
"Remarks": ["comma in number",
"no comma in number",
"tolerance with thereafter, comma in large number",
"tolerance with thereafter, no comma in large number",
"tolerance without thereafter",
"less than sign used + comma in number",
"non-relevant row"],
"isToleranceSpec": ["No", "No", "Yes", "Yes", "Yes", "No", ''],
"BandLimit": [1234, 1234, 4567, 4567, 891011, 1112, ''],
"Marginal": [11, 11, 12, 12, '', 13, '']})
I can uppercase DescCol and extract the sub-string b/w '(' and ')', any pithy solutions post that v welcome. Thanks

Not sure this is what you want, but here is a pity solution:
def extract_infos(row):
# check numbers in parentheses
m = re.findall('\(.*\d.*\)', row.DescCol)
if len(m) != 1:
return
t = m[0][1:-1] # strip the ()
# tolerance and bandlimit
row['isToleranceSpec'] = 'Yes' if any(t.startswith(x) for x in ('Up to', '<', 'Tolerance')) else 'No'
row['BandLimit'] = int(re.findall('\d+,?\d*', t)[0].replace(',', ''))
# marginal
m = re.search('thereafter (\d+)', t)
if m is not None:
row['Marginal'] = int(m.groups()[0])
return row
This method can then be used like this:
# start with a DataFrame that has only DescCol
start = your_example_df[['DescCol']].copy()
# add default column values
for c in ['isToleranceSpec', 'BandLimit', 'Marginal']:
start[c] = '' # weird to have empty strings in int columns, but...
# Do the magic !
_ = start.apply(extract_infos, axis=1)
It does work for your example, but you might want to add some additional checks (e.g.: I suppose that if there is a thereafter, there is necessarily a number after, etc.)

Related

regex - 1. add space between string, 2. ignore certain pattern

I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM

You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.

Replace N digit numbers in a sentence with specific strings for different values of N

I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.

Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none

You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)

Parsing complicated list of strings using regex, loops, enumerate, to produce a pandas dataframe

I have a long list of many elements, each element is a string. See below sample:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']
My desired output is ultimately a Pandas DataFrame as follows:

As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT\.A\.\d{3})(?!\1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.
Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer and constructing the rows of the dataframe
final_df:

How to replace all numbers (with letters/symbols attached, i.e. 43$) in a dataframe column?

I have a dataframe of online comments related to the stock market.
Here's an example:
df = pd.DataFrame({'id': [1, 2, 3],
'comment': ["I made $425",
"I got mine at 42c. per share",
"Stocks saw a 12% increase"]})
I would like to replace all numbers in the dataframe (including the symbols and letters) with NUMBER to achieve:
"I made NUMBER",
"I got mine at NUMBER per share",
"Stocks saw a NUMBER increase"
I found a close solution in a previous comment, but this solution still leaves me with the remaining letters and symbols.
def repl(x):
return re.sub(r'\d+', lambda m: "NUMBER", x)
repl("I made 428c with a 52% increase")
>> I made NUMBERc with a NUMBER% increase
Any help would be appreciated, thanks.

This should work:
import re
def repl(x):
return re.sub(r'\S*\d+\S*', lambda m: "NUMBER", x)
print(repl("I made 428c with a 52% increase"))
Output:
I made NUMBER with a NUMBER increase

You can use a [^\d\s]*\d\S* regex to match any chunk of 0 or more chars other than digit and whitespace, then a digit, and then any amount of non-whitespace chars, and replace with NUMBER using a vectorized Series.str.replace method.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'comment': ["I made $425",
"I got mine at 42c. per share",
"Stocks saw a 12% increase"]})
df['comment'] = df['comment'].str.replace(r'[^\s\d]*\d\S*', 'NUMBER')
df
# => id comment
# => 0 1 I made NUMBER
# => 1 2 I got mine at NUMBER per share
# => 2 3 Stocks saw a NUMBER increase
See the regex demo, too. Details:
[^\d\s]* - zero or more (*) occurrences of any char but a digit and whitespace ([^\d\s] is a negated character class)
\d - any digit char
\S* - zero or more non-whitespace chars.

Try this
def repl(l):
s=""
for i in l.split():
if any([str(_) in i for _ in range(11)]):
s+="Number"+' '
else:
s+=i+' '
return s.strip()

Group naming with group and nested regex (unit conversion from text file)

Basic Question:
How can you name a python regex group with another group value and nest this within a larger regex group?
Origin of Question:
Given a string such as 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'
What is an elegant solution for extracting the times and converted to a given unit?
Attempted Solution:
My best guess at a solution would be to create a dictionary and then perform operations on the dictionary to convert to the desired unit.
i.e. convert the given string to this:
string[0]:
{'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}
string[1]:
{'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}
I have a regex solution, but it isn't doing what I would like:
import re
test_string = ['Your favorite song is 1 hour 23 seconds long. My phone only records for 1h 30 mins and 10 secs.',
'This video is 4 days 2h 3min 6sec 30ms']
year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))
# pattern = r"""(?P<time> # time group beginning
# (?P<value>[\d]+) # value of time unit
# \s* # may or may not be space between digit and unit
# (?P<unit>%s) # unit measurement of time
# \s* # may or may not be space between digit and unit
# )
# \w+""" % all_units
pattern = r""".*(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
).* # may be words in between the times
""" % (all_units)
regex = re.compile(pattern)
for val in test_string:
match = regex.search(val)
print(match)
print(match.groupdict())
This fails miserably due to not being able to properly deal with nested groupings and not being able to assign a name with the value of a group.

First of all, you can't just write a multiline regex with comments and expect it to match anything if you don't use the re.VERBOSE flag:
regex = re.compile(pattern, re.VERBOSE)
Like you said, the best solution is probably to use a dict
for val in test_string:
while True: #find all times
match = regex.search(val) #find the first unit
if not match:
break
matches= {} # keep track of all units and their values
while True:
matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
m= regex.search(val)
if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
break
match= m
print matches # the finished dict
# output will be like {'h': 1, 'secs': 10, 'mins': 30}
However, the code above won't work just yet. We need to make two adjustments:
The pattern cannot allow just any text between matches. To allow only whitespace and the word "and" between two matches, you can use
pattern = r"""(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
(?:\band\s+)? # allow the word "and" between numbers
) # may be words in between the times
""" % (all_units)
You have to change the order of your units like so:
year_units = ['years', 'year', 'y'] # yearS before year
day_units = ['days', 'day', 'd'] # dayS before day, etc...
Why? Because if you have a text like 3 years and 1 day, then it would match 3 year instead of 3 years and.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting multiple digits from pandas non-uniform column with classification - python

Related

regex - 1. add space between string, 2. ignore certain pattern

Replace N digit numbers in a sentence with specific strings for different values of N

Parsing complicated list of strings using regex, loops, enumerate, to produce a pandas dataframe

How to replace all numbers (with letters/symbols attached, i.e. 43$) in a dataframe column?

Group naming with group and nested regex (unit conversion from text file)

Categories

Resources