Python - delete part of link in array - python

I have a list of links in an array, such as
results = [link1/1254245,
'q%(random part)cache:link2/1254245& (random part) Dclnk',
'link3/1254245]
whereas link = http://www.whatever.com.
I want to replace the term q%3(random part)cache and &(random part)Dclnk with nothing so that the "clean" link2 is "cut" out and left over among the other "clean" links. The random part changes always in content and length. The q%3 : and & Dclnk stay the same.
How do I do that? I could not find a straight answer to that so far.

You could achieve this through re.sub and list comprehension.
>>> l = ['link1/1254245', 'q%(random part)cache:link2/1254245& (random part) Dclnk', 'link3/1254245']
>>> [re.sub(r'q%[^(]*\([^()]*\)cache:|&\s*\([^()]*\)\s*Dclnk', r'', i) for i in l]
['link1/1254245', 'link2/1254245', 'link3/1254245']
[^()]* matches any character but not of ( or ) zero or more times. Specify an | alteration operator to use multiple patterns.

Related

Reformat a string with special tokens to a list/dictionary containing its tokens as elements?

I have a string (as an output of a model that generates sequences) in the format --
<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>
Because this is a collection of elements generated as a sentence/sequence, I would like to reformat it to a list/dictionary (to evaluate the quality of responses) --
[ [ent1, rel1_ent1, rel2_ent1], [ent2, rel1_ent2] ] or
{ "ent1" : ["rel1_ent1", "rel2_ent1"], "ent2" : ["rel1_ent2"] }
So far, the way I have been looking at this is via splitting the string by <bos> and/or <eos> special tokens -- test_string.split("<bos>")[1].split("<eos>")[0].split("<rel>")[1:]. But I am not sure how to handle generality if I do this across a large set of sequences with varying length (i.e. # of rel_ents associated with a given ent).
Also, I feel there might be a more optimal way to do this (without ugly splitting and looping) -- maybe regex?. Either way, I am entirely unsure and looking for a more optimal solution.
Added note: the special tokens <bos>, <new_gen>, <gen>, <eos> can be entirely removed from the generated output if that helps.
Well, there could be a smoother way without, as you mentioned it "ugly splitting and looping", but maybe re.finditer could be a good option here. Find each substring of interest with pattern:
<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)
See an online demo. We then can use capture group 1 as our key values and capture group 2 as a substring we need to split again into lists:
import regex as re
s = '<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>'
result = re.finditer(r'<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)', s)
d = {}
for match_obj in result:
d[match_obj.group(1)] = match_obj.group(2).split(' <gen> ')
print(d)
Prints:
{'ent1': ['rel1_ent1', 'rel2_ent1'], 'ent2': ['rel1_ent2']}

Splitting strings in a list by the first 10 characters, Python

I am trying to get the ASIN number for each product on Amazon which is the first ten digits after dp/. I have gotten to the point where I have the digits but still have the junk after it. Any help?
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
product_lst = url.split("dp/")
for url in product_lst:
del product_lst[::2]
print(product_lst)
Output:
['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ']
['B08SLYY1WD/?encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref=pd_gw_deals']
['B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae']
['B089RDSML3']
['B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2']
For searches in text the module re (regex) is a good choice:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
import re
results = []
for url in product_lst:
m = re.search(r"/dp/([^/?]+)",url)
if m:
results.append(m.groups()[0])
print(results)
Output:
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
I use r"/dp/([^/?]+)" as pattern wich boils down to a grouped match for anything after /dp/ and then matches all things up to the next / or ?.
You can test regexes online - I use http://regex101.com (for complex ones) - it can even provide python code based on what you insert in its fields (not using that though ;o) )
You can change your own code to
for url in product_lst:
part = url.split("dp/")
if len(part) > 1: # blablubb dp/ more things => 2 or more parts
print(part[1]) # print whats is left after dp/
to avoid overwriting your list product_lst - but you will still need to trim stuff after / and ? with it.
After you split() on the 'dp/', there is absolutely no reason to loop. You know exactly where the data is that you want, so just get it directly:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
split_lst = url.split("dp/")
print(split_lst[1][:10]
I assume that the ASIN is always 10 characters. Adjust the splice if there are more characters and it is always fixed. Otherwise you will need to find a different appproach.
You can directly get the ASIN without splitting the data.
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
ASIN=[]
for url in product_lst:
idx = url.find("/dp/")
ASIN.append(url[idx+4:idx+14])
print(ASIN)
output
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']

Extract strings between two words that are supplied from two lists respectively

I have a text which looks like an email body as follows.
To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22
PM
Tony Stark wrote:
Then I have a list of key words as follows.
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
Objective:
I want to first search the text string as given above and if any of the key words as listed above is (or are) present then I want to extract the strings in between those key words. In this case, we have Not and don't keywords matched from no_wds. Also we have Proceed key word matched from yes_wds list. Thus the text I want to be extracted as list as follows
txt = ['seen any abnormalities and thus I don't think we should','think we should']
My approach:
I have tried
re.findall(r'{}(.*){}'.format(re.escape('|'.join(no_wds)),re.escape('|'.join(yes_wds))),text,re.I)
Or
text_f = []
for i in no_wds:
for j in yes_wds:
t = re.findall(r'{}(.*){}'.format(re.escape(i),re.escape(j)),text, re.I)
text_f.append(t)
Didn't get any suitable result. Then I tried str.find() method, there also no success.
I tried to get a clue from here.
Can anybody help in solving this? Any non-regex solution is somewhat I am keen to see, as regex at times are not a good fit. Having said the same, if any one can come up with regex based solution where I can iterate the lists it is welcome.
Loop through the list containing the keys, use the iterator as a splitter (whatever.split(yourIterator)).
EDIT:
I am not doing your homework, but this should get you on your way:
I decided to loop through the splitted at every space list of the message, search for the key words and add the index of the hits into a list, then I used those indexes to slice the message, probably worth trying to slice the message without splitting it, but I am not going to do your homework. And you must find a way to automate the process when there are more indexes, tip: check if the size is even or you are going to have a bad time slicing.
*Note that you should replace the \n characters and find a way to sort the key lists.
message = """To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22"""
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
splittedMessage = message.split( ' ' )
msg = []
for i in range( 0, len( splittedMessage ) ):
temp = splittedMessage[i]
for j, k in zip( no_wds, yes_wds ):
tempJ = j.lower()
tempK = k.lower()
if( tempJ == temp or tempK == temp ):
msg.append( i )
found = ' '.join( splittedMessage[msg[0]:msg[1]] )
print( found )

python if/else list comprehension

I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']

Extract weight of an item from its description using regex in python

I have a list of product descriptions. for example:
items = ['avuhovi Grillikaapeli 320g','Savuhovi Kisamakkara 320g',
'Savuhovi Raivo 250g', 'AitoMaku str.garl.sal.dres.330ml', 'Rydbergs
225ml Hollandaise sauce']
I want to extract the weights that is, 320g, 320g, 250ml, 330ml. I know we can use regex for this but do not know how to buil regex to extract that. You can see that weights are sometimes in the middle of the description and sometimes having dot(.) as separator rather than space. So, I am confused how to extract.
Thanks for help in advance :)
Here is one solution that may work (using search and group suggested by Wiktor):
>>> for t in items :
... re.search(r'([0-9]+(g|ml))', t).group(1)
...
'320g'
'320g'
'250g'
'330ml'
'225ml'
Indeed a better solution (thanks Wiktor) would be to test if there is a match :
>>> res = []
>>> for t in items :
... m = re.search(r'(\d+(g|ml))', t)
... if m:
... res.append(m.group(1))
print res
https://regex101.com/r/gy5YTp/4
Match any digit with \d+ then create a matching but non selecting group with (?:ml|g) this will match ml or g.
import re
items = ['avuhovi Grillikaapeli 320g', 'Savuhovi 333ml Kisamakkara 320g', 'Savuhovi Raivo 250g', 'AitoMaku str.garl.sal.dres.330ml', 'Rydbergs 225ml Hollandaise sauce']
groupedWeights = [re.findall('(\d+(?:ml|g))', i) for i in items]
flattenedWeights = [y for x in groupedWeights for y in x]
print(flattenedWeights)
The match that we make returns a list of lists of weights found so we need to flatten that with [y for x in groupedWeights for y in x]
That is if you ever have more than one weight in an element. Otherwise we can take the first element of each list like this.
weights = [re.findall('(\d+(?:ml|g))', i)[0] for i in items]

Categories

Resources