Scrapy regexp for sitemap_follow

Scrapy regexp for sitemap_follow - python

if you have a sitemap.xml containing:
abc.com/sitemap-1.xml
abc.com/sitemap-2.xml
abc.com/image-sitemap.xml
how do i write sitemap_follow to read only the sitemap-xxx sitemaps and not image-sitemap.xml?
I tried with
^sitemap
with no luck. What should I do? negate "image"? How?
Edit:
Scrapy code:
self._follow = [regex(x) for x in self.sitemap_follow]
and
if any(x.search(loc) for x in self._follow):
The regex is applied to the whole url. The only way I see a solution without modifying Scrapy is to have a Scraper just for abc.com and add it to the regex OR just add the / to the regex

To answer your question naively and directly I offer this code. In other words, I can match each of the items in the sitemap index file using the regex ^.$.
>>> import re
>>> sitemap_index_file_content = [
... 'abc.com/sitemap-1.xml',
... 'abc.com/sitemap-2.xml',
... 'abc.com/image-sitemap.xml'
... ]
>>> for s in sitemap_index_file_content:
... m = re.match(r'^.*$', s)
... if m:
... m.group()
...
'abc.com/sitemap-1.xml'
'abc.com/sitemap-2.xml'
'abc.com/image-sitemap.xml'
This implies that you would set sitemap_follow in the following way, since the spiders documentation says that this variable expects to receive a list.
>>> sitemap_follow = ['^.$']
But then the same page of documentation says, 'By default, all sitemaps are followed.' Thus, this would appear to be entirely unnecessary.
I wonder what you are trying to do.
EDIT: In response to a comment. You might be able to do this using what is called a 'negative lookbehind assertion', in this cases that's the (?<!image-). My reservation about this is that you need to be able to scan over stuff like abc.com at the beginnings of the URLs which could present quite fascinating challenges.
>>> for s in sitemap_index_file_content:
... m = re.match(r'[^\/]*\/(?<!image-)sitemap.*', s)
... if m:
... m.group()
...
'abc.com/sitemap-1.xml'
'abc.com/sitemap-2.xml'

One option to skip urls is to override sitemap_filter() on your class:
class MySpider(SitemapSpider):
name = "scraperapi_sitemap"
/* Your current code goes here ... */
def sitemap_filter(self, entries):
"""This method can be used to filter sitemap entries by their
attributes, for example, you can filter locs with lastmod greater
than a given date or (see docs) or skipping by complex regex.
"""
image_url_pattern = '.*/image-.*'
for entry in entries:
result = re.match(image_url_pattern, entry['loc'])
if result:
logging.info("Skipping "+ str(entry))
else:
yield entry

Related

Splitting strings in a list by the first 10 characters, Python

I am trying to get the ASIN number for each product on Amazon which is the first ten digits after dp/. I have gotten to the point where I have the digits but still have the junk after it. Any help?
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
product_lst = url.split("dp/")
for url in product_lst:
del product_lst[::2]
print(product_lst)
Output:
['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ']
['B08SLYY1WD/?encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref=pd_gw_deals']
['B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae']
['B089RDSML3']
['B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2']

For searches in text the module re (regex) is a good choice:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
import re
results = []
for url in product_lst:
m = re.search(r"/dp/([^/?]+)",url)
if m:
results.append(m.groups()[0])
print(results)
Output:
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
I use r"/dp/([^/?]+)" as pattern wich boils down to a grouped match for anything after /dp/ and then matches all things up to the next / or ?.
You can test regexes online - I use http://regex101.com (for complex ones) - it can even provide python code based on what you insert in its fields (not using that though ;o) )
You can change your own code to
for url in product_lst:
part = url.split("dp/")
if len(part) > 1: # blablubb dp/ more things => 2 or more parts
print(part[1]) # print whats is left after dp/
to avoid overwriting your list product_lst - but you will still need to trim stuff after / and ? with it.

After you split() on the 'dp/', there is absolutely no reason to loop. You know exactly where the data is that you want, so just get it directly:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
split_lst = url.split("dp/")
print(split_lst[1][:10]
I assume that the ASIN is always 10 characters. Adjust the splice if there are more characters and it is always fixed. Otherwise you will need to find a different appproach.

You can directly get the ASIN without splitting the data.
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
ASIN=[]
for url in product_lst:
idx = url.find("/dp/")
ASIN.append(url[idx+4:idx+14])
print(ASIN)
output
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']

Extracting multiple text with regex and matching the existence as a condition

I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?

I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.

Python3 replace tags based on condition of the type of tag

I want all the tags in a text that look like <Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela> to be replaced with <a my-inner-type="CR:1234">Bob Alice</a> and <a my-inner-type="BS:5678">Nelson Mandela</a> respectively. So basically, depending on the Type whether TypeA or TypeB, I want to replace the text accordingly in a text string using Python3 and regex.
I tried doing the following in python but not sure if that's the right approach to go forward:
import re
def my_replace():
re.sub(r'\<(.*?)\>', replace_function, data)
With the above, I am trying to do a regex of the< > tag and every tag I find, I pass that to a function called replace_function to split the text between the tag and determine if it is a TypeA or a TypeB and compute the stuff and return the replacement tag dynamically. I am not even sure if this is even possible using the re.sub but any leads would help. Thank you.
Examples:
<Car:1234|Bob Alice> becomes <a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela> becomes <a my-inner-type="BS:5678">Nelson Mandela</a>

This is perfectly possible with re.sub, and you're on the right track with using a replacement function (which is designed to allow dynamic replacements). See below for an example that works with the examples you give - probably have to modify to suit your use case depending on what other data is present in the text (ie. other tags you need to ignore)
import re
def replace_function(m):
# note: to not modify the text (ie if you want to ignore this tag),
# simply do (return the entire original match):
# return m.group(0)
inner = m.group(1)
t, name = inner.split('|')
# process type here - the following will only work if types always follow
# the pattern given in the question
typename = t[4:]
# EDIT: based on your edits, you will probably need more processing here
# eg:
if t.split(':')[0] == 'Car':
typename = 'CR'
# etc
return '<a my-inner-type="{}">{}</a>'.format(typename, name)
def my_replace(data):
return re.sub(r'\<(.*?)\>', replace_function, data)
# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))
Warning: if this text is actually full html, regex matching will not be reliable - use an html processor like beautifulsoup. ;)

Probably an extension to #swalladge's answer but here we use the advantage of a dictionary, if we know a mapping. (Think replace dictionary with a custom mapping function.
import re
d={'TypeA':'A',
'TypeB':'B',
'Car':'CR',
'Bus':'BS'}
def repl(m):
return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'
s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
OUTPUT
<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>
<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>
Working example here.
regex
We capture what we need in 3 groups and refer to them through match object.Highlighted in bold are the three groups that we captured in the regex.
<(.*?)(:\d+)\|(.*?)>
We use these 3 groups in our repl function to return the right string.

Sorry this isn't a complete answer but I'm falling asleep at the computer, but this is the regex that'll match either of the strings you provided, (<Type)(\w:)(\d+\|)(\w+\s\w+>). Check out https://pythex.org/ for testing your regex stuff.

Try with:
import re
def get_tag(match):
base = '<a my-inner-type="{}">{}</a>'
inner_type = match.group(1).upper()
my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
return base.format(my_inner_type, match.group(3))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))

This code will work if you have it in the form <Type:num|name>:
def replaceupdate(tag):
replace = ''
t = ''
i = 1
ident = ''
name = ''
typex = ''
while t != ':':
typex += tag[i]
t = tag[i]
i += 1
t = ''
while t != '|':
if tag[i] == '|':
break
ident += tag[i]
t = tag[i]
i += 1
t = ''
i += 1
while t != '>':
name += tag[i]
t = tag[i]
i += 1
replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
return replace
I know it does not use regex and it has to split the text some other way, but this is the main bulk.

Python Regex Google App Engine

I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?

>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'

How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'

Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m

The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']

For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo

And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078

How to eliminate email formatting in received email?

I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character

Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().

If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy regexp for sitemap_follow - python

Related

Splitting strings in a list by the first 10 characters, Python

Extracting multiple text with regex and matching the existence as a condition

Python3 replace tags based on condition of the type of tag

Python Regex Google App Engine

How to eliminate email formatting in received email?

Categories

Resources