Regex - Matching String values and Date together - python

I have a string that looks like the following and am supposed to extract the key : value and i am using Regex for the same.
line ="Date : 20/20/20 Date1 : 15/15/15 Name : Hello World Day : Month Weekday : Monday"
1) Extracting the key or attributes only.
re.findall(r'\w+\s?(?=:)',line)
#['Date ', 'Date1 ', 'Name ', 'Day ', 'Weekday ']
2)Extracting the dates only
re.findall(r'(?<=:)\s?\d{2}/\d{2}/\d{2}',line)
#[' 20/20/20', ' 15/15/15']
3)Extracting the strings perfectly but also some wrong format dates.
re.findall(r'(?<=:)\s?\w+\s?\w+',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
But when I try to use the OR operator to pull both the strings and dates I get wrong output. I believe the piping has not worked properly.
re.findall(r'(?<=:)\s?\w+\s?\w+|\s?\d{2}/\d{2}/\d{2}',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
Any help on the above command to extract both the dates (dd/mm/yy) format and the string values will be highly appreciated.

You need to flip it around.
\s?\d{2}/\d{2}/\d{2}|(?<=:)\s?\w+\s?\w+
Live preview
Regex will first try and match the first part. If it succeeds it will not try the next part. The reason it then breaks is because \w results in the first number of the date being matched. Since / isn't a \w (word character) it stops at that point.
Flipping it around makes it first try matching the date. If it doesn't match then it tries matching an attribute. Thus avoiding the problem.

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

Split a string in python with side by side delimiters

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!
This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

Python regex not giving desired output

I'm scraping a site which contains the following string
"1 Year+ in Category"
or in some cases
"1 Year+ by user in Category
I want to separate the Year, Category and the User. I tried using regular split but it doesn't work in this case because there are two delimiters 'in' and 'by'.
So, I used regex. It kinda works but not properly. Here is the snippet
dateandcat=re.split(r'.\s[in , by]',rightside[0])
rightside[0] contains date,category and user.
It results in the following output:
['1 Year', 'n Movies']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'n Movies']
I could just trim off first two characters in [1] and [2] but I want to fix the regex. Why is second character of 'in' and 'by' still showing? How do I fix this?
Try using:
import re
value = "1 Year+ in Category by User"
match = re.match(r"(\d+ \w+\+?) in (\w+)(?: by (\w+)*)?", value)
if match:
print(match.groups())
Output:
('1 Year+', 'Category', 'User')
You can use regex101 to learn more about that regex and others.

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

Split string based on delimiter and word using re

I am using Python for natural language processing. I am trying to split my input string using re. I want to split using ;,. as well as word but.
import re
print (re.split("[;,.]", 'i am; working here but you are. working here, as well'))
['i am', ' working here but you are', ' working here', ' as well']
How to do that? When I put in word but in regex, it treats every character as splitting criterion. How do I get following output?
['i am', ' working here', 'you are', ' working here', ' as well']
you can filter as it : but | [;,.]
It will search for char ; , and . but also for word but !
import re
print (re.split("but |[;,.]", 'i am; working here but you are. working here, as well'))
hope this help.
Even this one works:
import re
print (re.split('; |, |\. | but', 'i am; working here but you are. working here, as well'))
Output:
['i am', 'working here', ' you are', 'working here', 'as well']

Categories

Resources