Python and regex: create a template - python

I need to find a lot of substrings in string but It takes a lot of time, so I need to combine it in pattern:
I should find string
003.ru/%[KEYWORD]%
1click.ru/%[KEYWORD]%
3dnews.ru/%[KEYWORD]%
where % - is an any symbols
and [KEYWORD] - can be ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
I try to do a search with
keywords = ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
for i, key in enumerate(keywords):
coding['keyword_url'] = coding.url.apply(lambda x: x.replace('[KEYWORD]', key).replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+') if '[KEYWORD]' in x else x.replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+'))
for (domain, keyword_url) in zip(coding.domain.values.tolist(), coding.keyword_url.values.tolist()):
df.loc[df.event_address.str.contains(keyword_url), 'domain'] = domain
Where df contains only event_address (urls)
coding
domain url
003.ru 003.ru/%[KEYWORD]%
1CLICK 1click.ru/%[KEYWORD]%
33033.ru 33033.ru/%[KEYWORD]%
3D NEWS 3dnews.ru/%[KEYWORD]%
96telefonov.ru 96telefonov.ru/%[KEYWORD]%
How can I improve my pattern to do it faster?

First, you should consider using re module. Look at the re.compile function for your patterns and then you can match them.

Related

Making multiple "any" more efficient

I am using any to see if a string in a longer string (description) matches with any strings across several lists. I have the code working, but I feel like it's an inefficient way of doing a comparison, and would like feedback on how I can make it more efficient.
def convert_category(description):
categoryFood = ['COUNTDOWN', 'BAKE', 'MCDONALDS', 'ST PIERRE', 'PAK N SAVE', 'NEW WORLD']
categoryDIY = ['BUNNINGS', 'MITRE10']
containsFood = any(keyword in description for keyword in categoryFood)
containsDIY = any(keyword in description for keyword in categoryDIY)
if(containsFood):
return 'Food and Groceries'
elif(containsDIY):
return 'Home and DIY'
return ''
I would use a regular expression. They are optimized for this kind of problem - searching for any of multiple strings - and the hot part of the code is pushed into a fast library. With big enough strings you should notice the difference.
import re
foodPattern = '|'.join(map(re.escape, categoryFood))
diyPattern = '|'.join(map(re.escape, categoryDIY))
containsFood = re.search(foodPattern, description) is not None
containsDiy = re.search(diyPattern, description) is not None
You can easily extend this with word boundary or similar features to make the keyword matching be smarter/only match whole words.
The only way to make this faster is some negligible work to return some statements easier from the sounds of things. Marking as answered and closing.

Python Regex re.compile query

I'm trying to find get a list of required names from list of names using a regex query.
csv file: FYI, I converted Countries from Capital to small letters
searchList:
['AU.LS1_james.aus',
'AU.LS1_scott.aus',
'AP.LS1_amanda.usa',
'AP.LS1_john.usa',
'LA.LS1_harsha.ind',
'LA.LS1_vardhan.ind',
'IECAU13_peter-tu.can',
'LONSA13_smith.gbp']
Format of the searchList: [(region)(Category)]_[EmployeeName].[country]
(region)(Category) is concatenated.
I'm trying to get a list of each group like this,
[
['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
]
Using the following regex query: \<({region}).*\{country}\>
for region, country in regionCountry:
query = f"\<({region}).*\{country}\>"
r = re.compile(query)
group = list(filter(r.match, searchList))
I tried re.search as well, but the group is always None
FYI: I also tried this query in notepad++ find using regex functionality.
Can Anyone Tell where it's going wrong in my script.? Thank you
Without regex:
split
And a dictionary to group the entries:
Data
entries = ['AU.LS1_james.aus', 'AU.LS1_scott.aus', 'AP.LS1_amanda.usa', 'AP.LS1_john.usa', 'LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
Solution 1: simple dict and setdefault
d = {}
for entry in entries:
d.setdefault(entry.split('.',1)[0], []).append(entry)
Solution 2: defaultdict
from collections import defaultdict
d = defaultdict(list)
for entry in entries:
d[entry.split('.',1)[0]].append(entry)
Result is in d.values()
>>> list(d.values())
[['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']]
I thank you all for trying to assist my question. This answer worked out well for my usage. For some reason python doesn't like \< and \>. so i just removed them and it worked fine. I didn't expect that there could be some limitations using re library.
Answer:
f({region}).*\{country}

Find date and time in a text column in a dataframe Python

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234

Array has multi strings against text with multiline ( regular expression) Python

I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"
Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse

Python - delete part of link in array

I have a list of links in an array, such as
results = [link1/1254245,
'q%(random part)cache:link2/1254245& (random part) Dclnk',
'link3/1254245]
whereas link = http://www.whatever.com.
I want to replace the term q%3(random part)cache and &(random part)Dclnk with nothing so that the "clean" link2 is "cut" out and left over among the other "clean" links. The random part changes always in content and length. The q%3 : and & Dclnk stay the same.
How do I do that? I could not find a straight answer to that so far.
You could achieve this through re.sub and list comprehension.
>>> l = ['link1/1254245', 'q%(random part)cache:link2/1254245& (random part) Dclnk', 'link3/1254245']
>>> [re.sub(r'q%[^(]*\([^()]*\)cache:|&\s*\([^()]*\)\s*Dclnk', r'', i) for i in l]
['link1/1254245', 'link2/1254245', 'link3/1254245']
[^()]* matches any character but not of ( or ) zero or more times. Specify an | alteration operator to use multiple patterns.

Categories

Resources