Python3.7: RegEx for string between strings on multiple lines? - python

I would like to find 30,850 in:
<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>
with:
^(?!<div class='user-information__achievements-data' data-test-points-count>
|<.div>)(.*)$
(returns nothing)
How come ^(?!START\-OF\-FIELDS|END\-OF\-FIELDS)(.*)$ does work for:
START-OF-FIELDS
<div>
Line A
END-OF-FIELDS
(returns <div>)?

You also can search text by bs4
from bs4 import BeautifulSoup
tx = """
<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>
"""
bs = BeautifulSoup(tx,"lxml")
result = bs.find("div",{"class":"user-information__achievements-data"}).text
print(result.strip()) # 30,850

Besides I totally agree to never parse HTML with re (and it's really fun to read, btw) if you only have this piece of text and need a quick re.search, a simple r'\d+,\d+' would do...:
import re
s = '''<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>'''
re.search(r'\d+,\d+', s)
<re.Match object; span=(179, 185), match='30,850'>

No need for regex just do:
i=" <div class='user-information__achievements-data' data-test-points-count>"
print(s.splitlines()[s.splitlines().index(i)+1].lstrip())
Output:
30,850

You want re.DOTALL because by default . doesn't match newlines and line brakes.
re.compile(YOUR_REGEX, flags=re.S)
You can also prepend your regex with (?s) for the same effect.

Related

Extracting everything between two lxml tags Python

Consider the following html snippet
<html>
.
.
.
<div>
<p> Hello </p>
<div>
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
</div>
Let us say that I need to extract everything from Text1 to Text2, including the tags.
Using a few methods, I have been able to extract the tags of those two, i.e. their unique ID.
Essentially I have 2 Element.etree elements, corresponding to the two tags I require.
How do I extract everything in between the two tags?
(One possible solution I can think of is to find the two tags common ancestor, and do a iterwalk() and start extracting at Element1, and stop at 2. However, I'm not exactly sure how this would be)
Any solution would be appreciated.
Please note that I have already found the two tags that I require, and I'm not looking for solutions to find those tags (for eg using xpath)
Edit: My desired output is
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
Please note that I wouldn't mind the initial 2 <div> tags, but do not want the Hello.
The same goes with the closing tags of the end. I'm mostly interested in the inbetween contents.
Edit 2: I have extracted the Etree elements using complex xpath conditions, which was not feasible with other alternatives such as bs4, so any solution using the lxml elements would be appreciated :)
After review and questioning:
from essentials.tokening import CreateToken # This was imported just to generate a random string - pip install mknxgn_essentials
import bs4
HTML = """<html>
<div>
<div>
<div id="start">
Hello, My name is mark
</div>
</div>
</div>
<div>
This is in the middle
</div>
<div>
<div id="end">
This is the end
</div>
</div>
<div>
Do not include this.
</div>
</html>"""
RandomString = CreateToken(30, HTML) #Generate a random string that could never occur on it's own in the file, if it did occur, use something else
soup = bs4.BeautifulSoup(HTML, features="lxml") # Convert the text into soup
start_div = soup.find("div", attrs={"id": "start"}) #assuming you can find this element
start_div.insert_before(RandomString) # insert the random string before this element
end_div = soup.find("div", attrs={"id": "end"}) #again, i was assuming you can also find this element
end_div.insert_after(RandomString) # insert the random string after this element
print(str(soup).split(RandomString)[1]) # Get between both random strings
The output of this returns:
>>> <div id="start">
>>> Hello, My name is mark
>>> </div>
>>> </div>
>>> </div>
>>> <div>
>>> This is in the middle
>>> </div>
>>> <div>
>>> <div id="end">
>>> This is the end
>>> </div>

Regular expression selecting a text with skipping a few lines

I need help selecting the price of a html code. As I have extracted the Title of a movie, I now need to extract the price. I have tried using the lookahead regular expression but I get an error when I use \n.* as it says "A quantifier inside a lookbehind makes it non-fixed width". I need the first and the second price in the text.
Regex I have tried:
(?<=Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
and:
Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
But doesn't work.
Text:
<a class="blue_link" href="http://www.ebgames.com.au/Games/sjbeiub108723">Hello:</a>
<div class="hi">
<p>Including <a class="blue_link">
<p>Price$<data1>40.00</p>
Pls help and thank you :)
You can use this regex with the DOTALL flag:
import re
r = "The Durrells: Series 2.+\$(\d+\.\d+).+\$(\d+\.\d+)"
text = ''' <a class="blue_link fn url" href="http://www.fishpond.com.au/Movies/Durrells-Series-2-Keeley-Hawes/5014138609450">The Durrells: Series 2</a>
<div class="by">
<p>Starring <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Keeley+Hawes">Keeley Hawes</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Milo+Parker">Milo Parker</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Josh+O%27Connor">Josh O'Connor</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Daisy+Waterstone">Daisy Wat...</a></p>
<div class="productSearch-metainfo">
DVD (UK), May 2017 </div>
</div>
</div></td>
<td align="right" style="vertical-align:top;"><div class="productSearch-price-container">
<span class="rrp-label">Elsewhere</span> <s>$30.53</s> <span class="productSpecialPrice"><b>$27.46</b></span> <div style="white-space:nowrap;"> <span class="you_save">Save 10%</span> </div><span class="free-shipping">with Free Shipping!</span></div>
'''
print(re.findall(r, text, re.DOTALL))
Output:
[('30.53', '27.46')]

how to access elements by path?

I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text

BeautifulSoup - getting rid of paragraph whitespace/line breaks

similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
print(item)
This returns:
<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>
When I choose to print item.get_text() instead, I get
abgeneigt machen
to disincline
abhängig machen
2137
to predicate
Absenker machen
to layer
So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?
Yes, between tags the HTML contains whitespace (including newlines) too.
You can easily collapse all multi-line whitespace with a regular expression:
import re
re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)
This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.
You can the the strip() function in python
item.get_text().strip()

Python -- Regex -- How to find a string between two sets of strings

Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.

Categories

Resources