Group naming with group and nested regex (unit conversion from text file) - python

Basic Question:
How can you name a python regex group with another group value and nest this within a larger regex group?
Origin of Question:
Given a string such as 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'
What is an elegant solution for extracting the times and converted to a given unit?
Attempted Solution:
My best guess at a solution would be to create a dictionary and then perform operations on the dictionary to convert to the desired unit.
i.e. convert the given string to this:
string[0]:
{'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}
string[1]:
{'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}
I have a regex solution, but it isn't doing what I would like:
import re
test_string = ['Your favorite song is 1 hour 23 seconds long. My phone only records for 1h 30 mins and 10 secs.',
'This video is 4 days 2h 3min 6sec 30ms']
year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))
# pattern = r"""(?P<time> # time group beginning
# (?P<value>[\d]+) # value of time unit
# \s* # may or may not be space between digit and unit
# (?P<unit>%s) # unit measurement of time
# \s* # may or may not be space between digit and unit
# )
# \w+""" % all_units
pattern = r""".*(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
).* # may be words in between the times
""" % (all_units)
regex = re.compile(pattern)
for val in test_string:
match = regex.search(val)
print(match)
print(match.groupdict())
This fails miserably due to not being able to properly deal with nested groupings and not being able to assign a name with the value of a group.

First of all, you can't just write a multiline regex with comments and expect it to match anything if you don't use the re.VERBOSE flag:
regex = re.compile(pattern, re.VERBOSE)
Like you said, the best solution is probably to use a dict
for val in test_string:
while True: #find all times
match = regex.search(val) #find the first unit
if not match:
break
matches= {} # keep track of all units and their values
while True:
matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
m= regex.search(val)
if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
break
match= m
print matches # the finished dict
# output will be like {'h': 1, 'secs': 10, 'mins': 30}
However, the code above won't work just yet. We need to make two adjustments:
The pattern cannot allow just any text between matches. To allow only whitespace and the word "and" between two matches, you can use
pattern = r"""(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
(?:\band\s+)? # allow the word "and" between numbers
) # may be words in between the times
""" % (all_units)
You have to change the order of your units like so:
year_units = ['years', 'year', 'y'] # yearS before year
day_units = ['days', 'day', 'd'] # dayS before day, etc...
Why? Because if you have a text like 3 years and 1 day, then it would match 3 year instead of 3 years and.

Related

Split column into multiple columns based on content of column in Pandas

I have a column with data like this
Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;
Ticket No. 132123456, Location ID:ABC444; Type:App
Tickt#222256789 ; Location ID:AMC121; Type:Network;
I am trying like this
new = data["Description"].str.split(";", n = 1, expand = True)
data["Ticket"]= new[0]
data["Location"]= new[1]
data["Type"]= new[2]
# Dropping old columns
data.drop(columns =["Description"], inplace = True)
I can separate based on ";" but how to do for both ";" and "," ?
A more general solution, that allows you to perform as much processing as you like comfortably. Let's start by defining an example dataframe for easy debugging:
df = pd.DataFrame({'Description': [
'Ticket NO: 123456789 , Location ID:ABC123; Type:Network;',
'Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;']})
Then, let's define our processing function, where you can do anything you like:
def process(row):
parts = re.split(r'[,;]', row)
return pd.Series({'Ticket': parts[0], 'Location': parts[1], 'Type': parts[2]})
In addition to splitting by ,; and then separating into the 3 sections, you can add code that will strip whitespace characters, remove whatever is on the left of the colons etc. For example, try:
def process(row):
parts = re.split(r'[,;]', row)
data = {}
for part in parts:
for field in ['Ticket', 'Location', 'Type']:
if field.lower() in part.lower():
data[field] = part.split(':')[1].strip()
return pd.Series(data)
Finally, apply to get the result:
df['Description'].apply(process)
This is much more readable and easily maintainable than doing everything in a single regex, especially as you might end up needing additional processing.
The output of this application will look like this:
To add this output to the original dataframe, simply run:
df[['Ticket', 'Location', 'Type']] = df['Description'].apply(process)
One approach using str.extract
Ex:
df[['Ticket', 'Location', 'Type']] = df['Description'].str.extract(r"[Ticket\sNO:.#](\d+).*ID:([A-Z0-9]+).*Type:([A-Za-z]+)", flags=re.I)
print(df[['Ticket', 'Location', 'Type']])
Output:
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network
You can use
new = data["Description"].str.split("[;,]", n = 2, expand = True)
new.columns = ['Ticket', 'Location', 'Type']
Output:
>>> new
Ticket Location Type
0 Ticket NO: 123456789 Location ID:ABC123 Type:Network;
1 Ticket No. 132123456 Location ID:ABC444 Type:App
2 Tickt#222256789 Location ID:AMC121 Type:Network;
The [;,] regex matches either a ; or a , char, and n=2 sets max split to two times.
Another regex Series.str.extract solution:
new[['Ticket', 'Location', 'Type']] = data['Description'].str.extract(r"(?i)Ticke?t\D*(\d+)\W*Location ID\W*(\w+)\W*Type:(\w+)")
>>> new
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network
>>>
See the regex demo. Details:
(?i) - case insensitive flag
Ticke?t - Ticket with an optional e
\D* - zero or more non-digit chars
(\d+) - Group 1: one or more digits
\W* - zero or more non-word chars
Location ID - a string
\W* - zero or more non-word chars
(\w+)- Group 2: one or more word chars
\W* - zero or more non-word chars
Type: - a string
(\w+)- Group 3: one or more word chars

Replace N digit numbers in a sentence with specific strings for different values of N

I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)

How to replace all numbers (with letters/symbols attached, i.e. 43$) in a dataframe column?

I have a dataframe of online comments related to the stock market.
Here's an example:
df = pd.DataFrame({'id': [1, 2, 3],
'comment': ["I made $425",
"I got mine at 42c. per share",
"Stocks saw a 12% increase"]})
I would like to replace all numbers in the dataframe (including the symbols and letters) with NUMBER to achieve:
"I made NUMBER",
"I got mine at NUMBER per share",
"Stocks saw a NUMBER increase"
I found a close solution in a previous comment, but this solution still leaves me with the remaining letters and symbols.
def repl(x):
return re.sub(r'\d+', lambda m: "NUMBER", x)
repl("I made 428c with a 52% increase")
>> I made NUMBERc with a NUMBER% increase
Any help would be appreciated, thanks.
This should work:
import re
def repl(x):
return re.sub(r'\S*\d+\S*', lambda m: "NUMBER", x)
print(repl("I made 428c with a 52% increase"))
Output:
I made NUMBER with a NUMBER increase
You can use a [^\d\s]*\d\S* regex to match any chunk of 0 or more chars other than digit and whitespace, then a digit, and then any amount of non-whitespace chars, and replace with NUMBER using a vectorized Series.str.replace method.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'comment': ["I made $425",
"I got mine at 42c. per share",
"Stocks saw a 12% increase"]})
df['comment'] = df['comment'].str.replace(r'[^\s\d]*\d\S*', 'NUMBER')
df
# => id comment
# => 0 1 I made NUMBER
# => 1 2 I got mine at NUMBER per share
# => 2 3 Stocks saw a NUMBER increase
See the regex demo, too. Details:
[^\d\s]* - zero or more (*) occurrences of any char but a digit and whitespace ([^\d\s] is a negated character class)
\d - any digit char
\S* - zero or more non-whitespace chars.
Try this
def repl(l):
s=""
for i in l.split():
if any([str(_) in i for _ in range(11)]):
s+="Number"+' '
else:
s+=i+' '
return s.strip()

For each word in a string column in pandas dataframe find 5 surrounding words before and after and insert the new columns in a new dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am working on text analytics. I am stuck with one problem. I need a solution for that.
I am trying to find surrounding words (5 or more) for each word in a string column in a pandas dataframe. Dummy dataframe shown in a screenshot. I have id column and I have text column. I am trying to create a new dataframe which has four columns ( id column, before, Word, After) as shown in the second screenshot(result dataframe) attached.
For example
dummy dataframe
result dataframe
Initially I thought about using df.Text.extractall(...),
with 3 capturing groups (Before, Word and After), but the downside
was that e.g. the After group in one match could consume the content
that in the next match could be either the Word or at least the Before
group.
So I decided to do it other way:
Apply to each row a function, returning "partial" result for this row.
Gather results in a list of DataFrames.
Concatenate them.
Setup
Source DataFrame:
ID Text
0 ID1 The Company sells its products worldwide through its wide network of
1 ID2 Provides one of most often used search engines for HTTP sites
2 ID3 The most known of its products is the greatest airliner of the world
3 ID4 Xyz nothing
Note that I added a "no match" row (ID4).
Words to match:
words = ['products', 'most', 'for']
No of words before / after:
wNo = 3
In your code change it to whatever number you want.
The solution
The function finding matches in the current row:
def find(row, wanted, wNo):
wList = re.split(r'\W+', row.Text)
wListLC = list(map(lambda x: x.lower(), wList))
res = []
for wd in wanted: # Check each "wanted" word
for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
# For each index of "wd" in "wList"
wdBef = ''
if indW > 0:
indBefBeg = indW - wNo if indW >= wNo else 0
wdBef = ' '.join(wList[indBefBeg : indW])
indAftBeg = indW + 1
indAftEnd = indAftBeg + wNo
wdAft = ' '.join(wList[indAftBeg : indAftEnd])
res.append([row.ID, wdBef, wd, wdAft])
return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])
Parameters are:
row - the source row,
wanted - the list of "wanted" words (lower case),
wNo - number of words before / after the wanted word.
For each match found, the result contains a row with:
ID - from the current row,
Before, Word, After - respective parts of the current match.
Of course, the actual number of words in Before / After group can be
smaller, if there is no enough such words in the current row.
Note that this function splits the source row into two lists:
wList - "original" words, to return later,
wListLC - words converted to lower case, to match (remember that the
"wanted" list should also be in lower case).
The result is a "partial" DataFrame (for this row, if no match then empty),
to be later concatenated with other partial results.
And now, how to use this function: To gather partial results, as a list
of DataFrames run:
tbl = df.apply(find, axis=1, wanted=words, wNo=wNo).tolist()
And to generate the final result, run:
pd.concat(tbl, ignore_index=True)
For my source data, the result is:
ID Before Word After
0 ID1 Company sells its products worldwide through its
1 ID2 Provides one of most often used search
2 ID2 used search engines for HTTP sites
3 ID3 known of its products is the greatest
4 ID3 The most known of its
Note that Before / After group can be an empty string, but only
in cases when the Word was either the first or the last in the current row.
How to speed up this solution
Some increase in speed can be achieved with the following steps:
Compile the regex in advance (pat = re.compile(r'\W+')) and use
it in the function finding matches.
Drop additional parameters and use global variables instead.
So the function can be:
def find2(row):
wList = re.split(pat, row.Text)
wListLC = list(map(lambda x: x.lower(), wList))
res = []
for wd in words: # Check each "wanted" word
for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
# For each index of "wd" in "wList"
wdBef = ''
if indW > 0:
indBefBeg = indW - wNo if indW >= wNo else 0
wdBef = ' '.join(wList[indBefBeg : indW])
indAftBeg = indW + 1
indAftEnd = indAftBeg + wNo
wdAft = ' '.join(wList[indAftBeg : indAftEnd])
res.append([row.ID, wdBef, wd, wdAft])
return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])
And to call it, run:
tbl = df.apply(find2, axis=1).tolist()
pd.concat(tbl, ignore_index=True)
I compared both variants using %timeit (for my test data) and
the average execution time dropped from 46 to 39 ms (16 % shorter).
For larger dataset the difference should be more significant.

In Python, how to parse a string representing a set of keyword arguments such that the order does not matter

I'm writing a class RecurringInterval which - based on the dateutil.rrule object - represents a recurring interval in time. I have defined a custom, human-readable __str__ method for it and would like to also define a parse method which (similar to the rrulestr() function) parses the string back into an object.
Here is the parse method and some test cases to go with it:
import re
from dateutil.rrule import FREQNAMES
import pytest
class RecurringInterval(object):
freq_fmt = "{freq}"
start_fmt = "from {start}"
end_fmt = "till {end}"
byweekday_fmt = "by weekday {byweekday}"
bymonth_fmt = "by month {bymonth}"
#classmethod
def match_pattern(cls, string):
SPACES = r'\s*'
freq_names = [freq.lower() for freq in FREQNAMES] + [freq.title() for freq in FREQNAMES] # The frequencies may be either lowercase or start with a capital letter
FREQ_PATTERN = '(?P<freq>{})?'.format("|".join(freq_names))
# Start and end are required (their regular expressions match 1 repetition)
START_PATTERN = cls.start_fmt.format(start=SPACES + r'(?P<start>.+?)')
END_PATTERN = cls.end_fmt.format(end=SPACES + r'(?P<end>.+?)')
# The remaining tokens are optional (their regular expressions match 0 or 1 repetitions)
BYWEEKDAY_PATTERN = cls.optional(cls.byweekday_fmt.format(byweekday=SPACES + r'(?P<byweekday>.+?)'))
BYMONTH_PATTERN = cls.optional(cls.bymonth_fmt.format(bymonth=SPACES + r'(?P<bymonth>.+?)'))
PATTERN = SPACES + FREQ_PATTERN \
+ SPACES + START_PATTERN \
+ SPACES + END_PATTERN \
+ SPACES + BYWEEKDAY_PATTERN \
+ SPACES + BYMONTH_PATTERN \
+ SPACES + "$" # The character '$' is needed to make the non-greedy regular expressions parse till the end of the string
return re.match(PATTERN, string).groupdict()
#staticmethod
def optional(pattern):
'''Encloses the given regular expression in an optional group (i.e., one that matches 0 or 1 repetitions of the original regular expression).'''
return '({})?'.format(pattern)
'''Tests'''
def test_match_pattern_with_byweekday_and_bymonth():
string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by weekday Monday, Tuesday by month January, February"
groups = RecurringInterval.match_pattern(string)
assert groups['freq'] == "Weekly"
assert groups['start'].strip() == "2017-11-03 15:00:00"
assert groups['end'].strip() == "2017-11-03 16:00:00"
assert groups['byweekday'].strip() == "Monday, Tuesday"
assert groups['bymonth'].strip() == "January, February"
def test_match_pattern_with_bymonth_and_byweekday():
string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by month January, February by weekday Monday, Tuesday "
groups = RecurringInterval.match_pattern(string)
assert groups['freq'] == "Weekly"
assert groups['start'].strip() == "2017-11-03 15:00:00"
assert groups['end'].strip() == "2017-11-03 16:00:00"
assert groups['byweekday'].strip() == "Monday, Tuesday"
assert groups['bymonth'].strip() == "January, February"
if __name__ == "__main__":
# pytest.main([__file__])
pytest.main([__file__+"::test_match_pattern_with_byweekday_and_bymonth"]) # This passes
# pytest.main([__file__+"::test_match_pattern_with_bymonth_and_byweekday"]) # This fails
Although the parser works if you specify the arguments in the 'right' order, it is 'inflexible' in that it doesn't allow the optional arguments to be given in arbitrary order. This is why the second test fails.
What would be a way to make the parser parse the 'optional' fields in any order, such that both tests pass? (I was thinking of making an iterator with all permutations of the regular expressions and trying re.match on each one, but this does not seem like an elegant solution).
At this point, your language is getting complex enough that it's time to ditch regular expressions and learn how to use a proper parsing library. I threw this together using pyparsing, and I've annotated it heavily to try and explain what's going on, but if anything's unclear do ask and I'll try to explain.
from pyparsing import Regex, oneOf, OneOrMore
# Boring old constants, I'm sure you know how to fill these out...
months = ['January', 'February']
weekdays = ['Monday', 'Tuesday']
frequencies = ['Daily', 'Weekly']
# A datetime expression is anything matching this regex. We could split it down
# even further to get day, month, year attributes in our results object if we felt
# like it
datetime_expr = Regex(r'(\d{4})-(\d\d?)-(\d\d?) (\d{2}):(\d{2}):(\d{2})')
# A from or till expression is the word "from" or "till" followed by any valid datetime
from_expr = 'from' + datetime_expr.setResultsName('from_')
till_expr = 'till' + datetime_expr.setResultsName('till')
# A range expression is a from expression followed by a till expression
range_expr = from_expr + till_expr
# A weekday is any old weekday
weekday_expr = oneOf(weekdays)
month_expr = oneOf(months)
frequency_expr = oneOf(frequencies)
# A by weekday expression is the words "by weekday" followed by one or more weekdays
by_weekday_expr = 'by weekday' + OneOrMore(weekday_expr).setResultsName('weekdays')
by_month_expr = 'by month' + OneOrMore(month_expr).setResultsName('months')
# A recurring interval, then, is a frequency, followed by a range, followed by
# a weekday and a month, in any order
recurring_interval = frequency_expr + range_expr + (by_weekday_expr & by_month_expr)
# Let's parse!
if __name__ == '__main__':
res = recurring_interval.parseString('Daily from 1111-11-11 11:00:00 till 1111-11-11 12:00:00 by weekday Monday by month January February')
# Note that setResultsName causes everything to get packed neatly into
# attributes for us, so we can pluck all the bits and pieces out with no
# difficulty at all
print res
print res.from_
print res.till
print res.weekdays
print res.months
You have many options here, each with different downsides.
One approach would be to use a repeated alternation, like (by weekday|by month)*:
(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?)(?:\s+by weekday (?P<byweekday>.+?)|\s+by month (?P<bymonth>.+?))*$
This will match strings of the form week month and month week, but also week week or month week month etc.
Another option would be use lookaheads, like (?=.*by weekday)?(?=.*by month)?:
(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?(?=$| by))(?=.*\s+by weekday (?P<byweekday>.+?(?=$| by))|)(?=.*\s+by month (?P<month>.+?(?=$| by))|)
However, this requires a known delimiter (I used " by") to know how far to match. Also, it'll silently ignore any extra characters (meaning it'll match strings of the form by weekday [some gargabe] by month).

Categories

Resources