How to extract some url from html? - python

I need to extract all image links from a local html file. Unfortunately, I can't install bs4 and cssutils to process html.
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
I tried to extract data using a regex:
images = []
for line in html.split('\n'):
images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)
[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]
I suppose my regular expression is greedy because I used .*?
How to get the following outcome?
images = ['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
If it can help all links are enclosed by src="..." or url(...)
Thanks for your help.

import re
indeces_start = sorted(
[m.start()+5 for m in re.finditer("src=", html)]
+ [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]
image_list = []
for start,end in zip(indeces_start,indeces_end):
image_list.append(html[start:end])
print(image_list)
That's a solution which comes to my mind. It consists of finding the start and end indeces of the image path strings. It obviously has to be adjusted if there are different image types.
Edit: Changed the start criteria, in case there are other URLs in the document

You can use
import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)
See the Python demo. Output:
['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
See the regex demo, too. It means
https://s2 - some literal text
[^\s?]* -zero or more chars other than whitespace and ? chars
(?=\?lastmod=\d) - immediately to the right, there must be ?lastmode= and a digit (the text is not added to the match since it is a pattern inside a positive lookahead, a non-consuming pattern).

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
if(len(x)== 0): continue
x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
if(len(x)== 0): continue
url.append(x[0])
print(url)

Related

Unable to extract GSTIN using python regex

Can anyone help in fixing the issue here.
I am trying to extract GSTIN/UIN from texts.
#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works
def extract_gstin(text):
return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))
Your second pattern in the commented out part works, and you can omit {1} as it is the default.
What you might do to make it a bit more specific is add word boundaries \b to the left and right to prevent a partial word match.
If it should be after GSTIN : you can use a capture group as well.
Example with the commented pattern:
import re
GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')
def extract_gstin(s):
return re.findall(GSTIN_REG, s)
s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))
Output
['06AACCE2308Q1ZK']
A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)
\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b
Regex demo

Python3 replace tags based on condition of the type of tag

I want all the tags in a text that look like <Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela> to be replaced with <a my-inner-type="CR:1234">Bob Alice</a> and <a my-inner-type="BS:5678">Nelson Mandela</a> respectively. So basically, depending on the Type whether TypeA or TypeB, I want to replace the text accordingly in a text string using Python3 and regex.
I tried doing the following in python but not sure if that's the right approach to go forward:
import re
def my_replace():
re.sub(r'\<(.*?)\>', replace_function, data)
With the above, I am trying to do a regex of the< > tag and every tag I find, I pass that to a function called replace_function to split the text between the tag and determine if it is a TypeA or a TypeB and compute the stuff and return the replacement tag dynamically. I am not even sure if this is even possible using the re.sub but any leads would help. Thank you.
Examples:
<Car:1234|Bob Alice> becomes <a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela> becomes <a my-inner-type="BS:5678">Nelson Mandela</a>
This is perfectly possible with re.sub, and you're on the right track with using a replacement function (which is designed to allow dynamic replacements). See below for an example that works with the examples you give - probably have to modify to suit your use case depending on what other data is present in the text (ie. other tags you need to ignore)
import re
def replace_function(m):
# note: to not modify the text (ie if you want to ignore this tag),
# simply do (return the entire original match):
# return m.group(0)
inner = m.group(1)
t, name = inner.split('|')
# process type here - the following will only work if types always follow
# the pattern given in the question
typename = t[4:]
# EDIT: based on your edits, you will probably need more processing here
# eg:
if t.split(':')[0] == 'Car':
typename = 'CR'
# etc
return '<a my-inner-type="{}">{}</a>'.format(typename, name)
def my_replace(data):
return re.sub(r'\<(.*?)\>', replace_function, data)
# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))
Warning: if this text is actually full html, regex matching will not be reliable - use an html processor like beautifulsoup. ;)
Probably an extension to #swalladge's answer but here we use the advantage of a dictionary, if we know a mapping. (Think replace dictionary with a custom mapping function.
import re
d={'TypeA':'A',
'TypeB':'B',
'Car':'CR',
'Bus':'BS'}
def repl(m):
return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'
s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
OUTPUT
<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>
<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>
Working example here.
regex
We capture what we need in 3 groups and refer to them through match object.Highlighted in bold are the three groups that we captured in the regex.
<(.*?)(:\d+)\|(.*?)>
We use these 3 groups in our repl function to return the right string.
Sorry this isn't a complete answer but I'm falling asleep at the computer, but this is the regex that'll match either of the strings you provided, (<Type)(\w:)(\d+\|)(\w+\s\w+>). Check out https://pythex.org/ for testing your regex stuff.
Try with:
import re
def get_tag(match):
base = '<a my-inner-type="{}">{}</a>'
inner_type = match.group(1).upper()
my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
return base.format(my_inner_type, match.group(3))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))
This code will work if you have it in the form <Type:num|name>:
def replaceupdate(tag):
replace = ''
t = ''
i = 1
ident = ''
name = ''
typex = ''
while t != ':':
typex += tag[i]
t = tag[i]
i += 1
t = ''
while t != '|':
if tag[i] == '|':
break
ident += tag[i]
t = tag[i]
i += 1
t = ''
i += 1
while t != '>':
name += tag[i]
t = tag[i]
i += 1
replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
return replace
I know it does not use regex and it has to split the text some other way, but this is the main bulk.

Python to extract the #user and url link in twitter text data with regex

There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name after # and url link in the twitter text, for example: galaxy5univ and url link.
tweet_text = ['#galaxy5univ I like you',
'RT #BestOfGalaxies: Let's sit under the stars ...',
'#jonghyun__bot .........((thanks)',
'RT #yosizo: thanks.ddddd <https://yahoo.com>',
'RT #LDH_3_yui: #fam, ccccc https://msn.news.com']
my code:
import re
pu = re.compile(r'http\S+')
pn = re.compile(r'#(\S+)')
for row in twitter_text:
text = pu.findall(row)
name = (pn.findall(row))
print("url: ", text)
print("name: ", name)
Through testing the code in a large number of twitter data, I have got that my two patterns for url and name both are wrong(although in a few twitter text data is right). Do you guys have some documents or link about extract name and url from twitter text in the case of large twitter data.
If you have advices about extracting name and url from twitter data, please tell me, thanks!
Note that your pn = re.compile(r'#(\S+)') regex will capture any 1+ non-whitespace characters after #.
To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:
pn = re.compile(r'#([^\s:]+)')
Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.
If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'#(\S+):').
As for a URL matching regex, there are many on the Web, just choose the one that works best for you.
Here is an example code:
import re
p = re.compile(r'#([^\s:]+)')
test_str = "#galaxy5univ I like you\nRT #BestOfGalaxies: Let's sit under the stars ...\n#jonghyun__bot .........((thanks)\nRT #yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT #LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str))
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']
If the usernames doesn't contain special chars, you can use:
#([\w]+)
See Live demo

Python Regex Google App Engine

I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?
>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'
How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'
Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m
The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']
For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo
And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078

python encoding regex issue

I am trying to get this line out from a page:
$ 55 326
I have made this regex to get the numbers:
player_info['salary'] = re.compile(r'\$ \d{0,3} \d{1,3}')
When I get the text I use bs4 and the text is of type 'unicode'
for a in soup_ntr.find_all('div', id='playerbox'):
player_box_text = a.get_text()
print(type(player_box_text))
I can't seem to get the result.
I have also tried with a regex like these
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}')
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}', re.UNICODE)
But I can't find out to get the data.
The page I am reading has this header:
Content-Type: text/html; charset=utf-8
Hope for some help to figure it out.
re.compile doesn't match anything. It just creates a compiled version of the regex.
You want something like this:
matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text)
player_info['salary'] = matchObj.group(1) + matchObj.group(2)
This is a good site for getting to grips with regex.
http://txt2re.com/
#!/usr/bin/python
# URL that generated this code:
# http://txt2re.com/index-python.php3?s=$%2055%20326&2&1
import re
txt='$ 55 326'
re1='.*?' # Non-greedy match on filler
re2='(\\d+)' # Integer Number 1
re3='.*?' # Non-greedy match on filler
re4='(\\d+)' # Integer Number 2
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
int1=m.group(1)
int2=m.group(2)
print "("+int1+")"+"("+int2+")"+"\n"

Categories

Resources