Python Regex re.compile query

Python Regex re.compile query - python

I'm trying to find get a list of required names from list of names using a regex query.
csv file: FYI, I converted Countries from Capital to small letters
searchList:
['AU.LS1_james.aus',
'AU.LS1_scott.aus',
'AP.LS1_amanda.usa',
'AP.LS1_john.usa',
'LA.LS1_harsha.ind',
'LA.LS1_vardhan.ind',
'IECAU13_peter-tu.can',
'LONSA13_smith.gbp']
Format of the searchList: [(region)(Category)]_[EmployeeName].[country]
(region)(Category) is concatenated.
I'm trying to get a list of each group like this,
[
['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
]
Using the following regex query: \<({region}).*\{country}\>
for region, country in regionCountry:
query = f"\<({region}).*\{country}\>"
r = re.compile(query)
group = list(filter(r.match, searchList))
I tried re.search as well, but the group is always None
FYI: I also tried this query in notepad++ find using regex functionality.
Can Anyone Tell where it's going wrong in my script.? Thank you

Without regex:
split
And a dictionary to group the entries:
Data
entries = ['AU.LS1_james.aus', 'AU.LS1_scott.aus', 'AP.LS1_amanda.usa', 'AP.LS1_john.usa', 'LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
Solution 1: simple dict and setdefault
d = {}
for entry in entries:
d.setdefault(entry.split('.',1)[0], []).append(entry)
Solution 2: defaultdict
from collections import defaultdict
d = defaultdict(list)
for entry in entries:
d[entry.split('.',1)[0]].append(entry)
Result is in d.values()
>>> list(d.values())
[['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']]

I thank you all for trying to assist my question. This answer worked out well for my usage. For some reason python doesn't like \< and \>. so i just removed them and it worked fine. I didn't expect that there could be some limitations using re library.
Answer:
f({region}).*\{country}

Related

is there a more efficient way to iterate over a dataframe?

books_over10['Keywords'] = ""
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
for index, row in books_over10.iterrows():
a=r.extract_keywords_from_text(row['bookTitle'])
c=r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
books_over10.at[index, 'Keywords'] = c
books_over10.head()
I am using the above code, in order to process all rows and extract keywords from each row from the column bookTitle and then insert them as a list into a new column named Keywords on the same row. The question is if there is a more efficient way to do this without iterating over all rows because it takes a lot of time. Any help would be appreciated. Thanks in advance !
Solution by Changming:
def extractor(row):
a=r.extract_keywords_from_text(row)
return r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda row : extractor(row))

Try looking into map. Not sure exactly what Rake you are using, and the way you have it coded is a bit confusing, but the general syntax would be.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda a: FUNCTION(a))

Python and regex: create a template

I need to find a lot of substrings in string but It takes a lot of time, so I need to combine it in pattern:
I should find string
003.ru/%[KEYWORD]%
1click.ru/%[KEYWORD]%
3dnews.ru/%[KEYWORD]%
where % - is an any symbols
and [KEYWORD] - can be ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
I try to do a search with
keywords = ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
for i, key in enumerate(keywords):
coding['keyword_url'] = coding.url.apply(lambda x: x.replace('[KEYWORD]', key).replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+') if '[KEYWORD]' in x else x.replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+'))
for (domain, keyword_url) in zip(coding.domain.values.tolist(), coding.keyword_url.values.tolist()):
df.loc[df.event_address.str.contains(keyword_url), 'domain'] = domain
Where df contains only event_address (urls)
coding
domain url
003.ru 003.ru/%[KEYWORD]%
1CLICK 1click.ru/%[KEYWORD]%
33033.ru 33033.ru/%[KEYWORD]%
3D NEWS 3dnews.ru/%[KEYWORD]%
96telefonov.ru 96telefonov.ru/%[KEYWORD]%
How can I improve my pattern to do it faster?

First, you should consider using re module. Look at the re.compile function for your patterns and then you can match them.

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?

The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)

The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.

The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Finding max value of float in a string - regex

I'm confronted with such a challenge right now. I've read some web classes and Dive Into Python on regex and nothing I found on my issue, thus I'm not sure if this is even possible to achieve.
Given this dict-alike string:
"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]
I'd like to compare values in brackets at corresponding positions and take only the greatest one. So comparing 11.76, 11.76, 11.91 should result in 11.91.
My alternative is to get all the values and compare them afterwards but I'm wondering whether regex could cope?

>>> import ast
>>> text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
>>> rows = ast.literal_eval('{' + text + '}').values()
>>> [max(col) for col in zip(*rows)]
[11.91, 7.19]

Try this:
import re
text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
values = re.findall(r'\[(.*?)\]', text)
values = map(lambda x: x.split(','), values)
values = zip(*values)
print max(map(float, values[0]))
print max(map(float, values[1]))
Output:
11.91
7.19

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.

given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.

import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'

If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.

>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example

Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex re.compile query - python

I thank you all for trying to assist my question. This answer worked out well for my usage. For some reason python doesn't like \< and \>. so i just removed them and it worked fine. I didn't expect that there could be some limitations using re library. Answer: f({region}).*\{country}

Related

is there a more efficient way to iterate over a dataframe?

Python and regex: create a template

Multiple distinct replaces using RegEx

Finding max value of float in a string - regex

strip sides of a string in python

Categories

Resources