Python re.findall() purpose in this code - python

I am currently learning Python and I am trying to decipher the code I found online. The point of the code is to compare the raw string with user input key and if it matches, it returns raw string.
I am having problem trying to understand what does re.findall() is doing in this code
So head[0] contains a data string
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
Key contains a raw string
key=r'Nike|Ultra'
head = self.data
for k in key:
print k
flag=re.findall(k,str(head[0]),flags=re.I)
print len(flag)
if len(flag)>4:
print head[0]
From my understanding, the purpose of the code is to loop through key and see if it matches head[0]. If it matches, it returns head[0]. However, it is still returning, head[0]
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
even if though it doesn't match.

It is suppose to print items in head if it match key regex.
Use the following code then:
import re
head = ('2016-12-22 06:28:36', 'nike item', 'ultra item', 'Kith x New Era K 59FIFTY Cap - Pink', 'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
key=r'Nike|Ultra' # This is a regex pattern, matches `Nike` or `Ultra`
for s in head: # Iterate the items in head
if re.search(key, s, flags=re.I): # Search for a match in each item, case insensitively
print(s) # Print if found
Output: nike item and ultra item.
In your code, you loop through the characters of the pattern with for k in key:. With re.findall, all non-overlapping matches were searched for that match a single char in k, and only head[0] was checked, all other items were not considered.

Related

Specifying word boundaries for multiple string replacement with regex?

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().
def sub_mult_regex(text, keys, tag_type):
'''
Replaces/masks multiple words at once
Parameters:
Text: TIU note
Keys: a list of words to be replaced by the regex
Tag_type: string you want the words to be replaced with
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St., PAddress PAddress PAddress.,}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
def mask_multiword_cities(text_string):
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
return sub_mult_regex(text_string, multi_word_cities, "PAddress")
The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:
add_string = "The cities are Round O , NJ and around others"
mask_multiword_cities(add_string)
#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])
The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.
For testing, assume that:
us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].
Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.
I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:
def sub_mult_regex(text, keys, tag_type, city):
'''
Replaces/masks multiple words at once
Parameters:
text: TIU note
keys: a list of words to be replaced by the regex
tag_type: string you want the words to be replaced with
city: bool, True if replacing cities, False if replacing anything else
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St, PAddress PAddress PAddress}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
if city:
# If we're masking a city, handle word boundaries
# This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
add_vals = []
for val in keys:
# Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
# Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
add_vals = [re.sub(r'\\b', "", val) for val in add_vals]
elif not city:
# If we're not masking a city, we don't do the word boundary step
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
print("add_dict:", add_dict)
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same
# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo
Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.
You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)
Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

python if/else list comprehension

I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']

Repeated regex groups of arbitrary number

I have this example text snippet
headline:
Status[apphmi]: blubb, 'Statustext1'
Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'
Popup[apphmi]: blaaa, 'Popuptext1'
and I want to extract the words within '', but sorted with the context (status, main, popup).
My current regex is (example at pythex.org):
headline:(?:\n +Status\[apphmi\]:.* '(.*)')*(?:\n +Main\[apphmi\]:.* '(.*)')*(?:\n +Popup\[apphmi\]:.* '(.*)')*
but with this I only get 'Maintext2' and not both. I don't know how to repeat the groups to an arbitrary number.
You can try with this:
r"(.*?]):(?:[^']*)'([^']*)'"g
Look here
Group1 and Group 2 for each match contains your key value pair
You can not merge the second match as one by using regex, once you get all the pairs... you can apply some programming here to merge duplicate keys as one.
Here I have used dictionary of list, if a key already exists in the dictionary then you should append the value to the list , otherwise insert a new key with a new list having the value.
This is how it should be done (tested in python 3+)
import re
d = dict()
regex = r"(.*?]):(?:[^']*)'([^']*)'"
test_str = ("headline: \n"
"Status[apphmi]: blubb, 'Statustext1'\n"
"Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'\n"
"Popup[apphmi]: blaaa, 'Popuptext1'")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
if match.group(1) in d:
d[match.group(1)].append(match.group(2))
else:
d[match.group(1)] = [match.group(2),]
print(d)
Output:
{
'Popup[apphmi]': ['Popuptext1'],
'Main[apphmi]': ['Maintext1', 'Maintext2'],
'Status[apphmi]': ['Statustext1']
}

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Categories

Resources