Python regex not giving desired output - python

I'm scraping a site which contains the following string
"1 Year+ in Category"
or in some cases
"1 Year+ by user in Category
I want to separate the Year, Category and the User. I tried using regular split but it doesn't work in this case because there are two delimiters 'in' and 'by'.
So, I used regex. It kinda works but not properly. Here is the snippet
dateandcat=re.split(r'.\s[in , by]',rightside[0])
rightside[0] contains date,category and user.
It results in the following output:
['1 Year', 'n Movies']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'n Movies']
I could just trim off first two characters in [1] and [2] but I want to fix the regex. Why is second character of 'in' and 'by' still showing? How do I fix this?

Try using:
import re
value = "1 Year+ in Category by User"
match = re.match(r"(\d+ \w+\+?) in (\w+)(?: by (\w+)*)?", value)
if match:
print(match.groups())
Output:
('1 Year+', 'Category', 'User')
You can use regex101 to learn more about that regex and others.

Related

Python replace in JSON list with for loop?

I'm quite new to python 3, and I'm developing a REST API to format some characters in a JSON of numerous strings (thousands sometimes) , the JSON has this structure:
[
[
"city",
"Street 158 No 96"
],
[
"city",
"st 144 11a 11 ap 104"
],
[
"city",
"Street83 # 85 - 22"
],
[
"city",
"str13 #153 - 81"
],
[
"city",
"street1h # 24 - 29"
]
]
So what I did to replace this on excel macros was.
text = Replace(text, "st", " street ", , , vbTextCompare)
For i = 0 To 9 Step 1
text = Replace(text, "street" & i, " street " & i, , , vbTextCompare)
text = Replace(text, "st" & i, " street " & i, , , vbTextCompare)
This would format every cell to 'street #' no matter the number, now the problem is when I try to do this with python, right now I have learned how to replace multiple values on a list like so:
addressList= []
for address in request.json:
address = [element
.replace('st', 'street ')
.replace('street1­', 'street 1')
.replace('street2', 'street 2')
.replace('street3', 'street 3')
.replace('street4', 'street 4')
.replace('street5­', 'street 5')
#and so on for st too
for element in address]
addressList.append(address)
This method is not just long but also really ugly, I'd like to do something like what I had before, but I can't seem to be able to use a for inside the replace, should I do it outside?
Thank you for helping.
--EDIT--
edited the json format so it's valid.
tried both revliscano and The fourth bird's replies they both work, currently i'm using revliscano's method as it allows me to create the list from my original Json in just 'one line'
Instead of using multiple replace calls, you could use a pattern matching st with optional reet and an optional space, then capture 1+ digits in a group.
\bst(?:reet)? ?(\d+)\b
Regex demo | Python demo
In the replacement use the capturing group street \1 using re.sub
Example code for a single element
import re
element = re.sub(r"\bst(?:reet)? ?(\d+)\b", r"street \1", "st 5")
print (element)
Output
street 5
I would use a regular expression to solve this. Try with the following
import re
address_list = [[re.sub(r'(?:st ?(\d)?\b)|(?:street(\d))', r'street \1', element)
for element in address]
for address in request.json]
You can use regular expressions mixed with dictionary to make it faster.
I use a function like this in one of my programs
import re
def multiple_replace(adict, text):
regex = re.compile("|".join(map(re.escape, adict.keys())))
return regex.sub(lambda match: adict[match.group(0)], text)
adict is the dictionary in which you have the mappings of the charaters you want to replace.
For you, it can be
adict = {
'street1­': 'street 1'
'street2':'street 2',
'street3': 'street 3',
'street4': 'street 4',
'street5­': 'street 5',
}
Of course you can't use the exact same function. You will have to write another regular expression according to your needs, like #The fourth bird did

Remove numbers from list but not those in a string

I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]
Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo
One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]
It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.
You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']
Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

Extracting strings inside a list of strings

I have a string which is a list of strings.
Like below:
a= "['expert executive', 'internal committee period', 'report name', 'entry']"
type(a)
Out[23]:
str
Now if I want to extract all the inside strings and store it in a list I was using regular expression like below:
re.findall(r"\w+\s+\w+",a)
Out[24]:
['expert executive',
'internal committee',
'report name',
'entry']
If you see it will only extract two words inside the string and if a string is more than two characters it won't extract it as I have only two words in my pattern. How do I make it for any no. of words inside a string and it extracts all of that. Like output should be:
['expert executive',
'internal committee period',
'report name',
'entry']
The no. of words inside a string in the list can be variable.
This regular expression uses positive lookahead ((?<=')) and lookbehind ((?=')) to match the ' character at beginning and end of each match, without including it in the resulting match:
>>> re.findall(r"(?<=')[\w\s]*(?=')",a)
['expert executive', 'internal committee period', 'report name', 'entry']

Regex - Matching String values and Date together

I have a string that looks like the following and am supposed to extract the key : value and i am using Regex for the same.
line ="Date : 20/20/20 Date1 : 15/15/15 Name : Hello World Day : Month Weekday : Monday"
1) Extracting the key or attributes only.
re.findall(r'\w+\s?(?=:)',line)
#['Date ', 'Date1 ', 'Name ', 'Day ', 'Weekday ']
2)Extracting the dates only
re.findall(r'(?<=:)\s?\d{2}/\d{2}/\d{2}',line)
#[' 20/20/20', ' 15/15/15']
3)Extracting the strings perfectly but also some wrong format dates.
re.findall(r'(?<=:)\s?\w+\s?\w+',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
But when I try to use the OR operator to pull both the strings and dates I get wrong output. I believe the piping has not worked properly.
re.findall(r'(?<=:)\s?\w+\s?\w+|\s?\d{2}/\d{2}/\d{2}',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
Any help on the above command to extract both the dates (dd/mm/yy) format and the string values will be highly appreciated.
You need to flip it around.
\s?\d{2}/\d{2}/\d{2}|(?<=:)\s?\w+\s?\w+
Live preview
Regex will first try and match the first part. If it succeeds it will not try the next part. The reason it then breaks is because \w results in the first number of the date being matched. Since / isn't a \w (word character) it stops at that point.
Flipping it around makes it first try matching the date. If it doesn't match then it tries matching an attribute. Thus avoiding the problem.

Categories

Resources