Pattern match not working as expected python - python

I was playing around with pattern matches in different html codes of sites I noticed something weird. I used this pattern :
pat = <div class="id-app-orig-desc">.*</div>
I used it on a app page of the play store(Picked a random app). So according to me it should just give what's between the div tags (ie the description) but that does not happen. I gives everything starting from the first of the pattern and goes on till the last of the page completely ignoring in between. Anyone knows what's happening?!
And I check the length of the list returned it's just 1.

First of all, do not parse HTML with regex, use a specialized tool - HTML parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<div>
<div class="id-app-orig-desc">
Do not try to get me with a regex, please.
</div>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('div', {'class': 'id-app-orig-desc'}).text.strip()
Prints:
Do not try to get me with a regex, please.

Related

Python code to keep only a set of html tags in a input string

I have text like this:
<div>
<script></script>
<h1>name</h1>
<p> Description </p>
<i> italic </i>
</div>
I want to remove all html tags except h tags and p tags. For this I'm trying to make a more generic method like this:
def strip_tags(text, a_list_of_tags_to_not_remove)
Using the following Beautiful Soup code I can remove all the html tags, but it doesn't allow to keep a list of tags, while removing others.
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html).text
Can I do this using Beautiful Soup or are there any other python library to do this?
Yes, you can.
You can use .find_all([]) to find all the tags you don't care about, then call .unwrap() to get rid of them while keeping the content.
You can use the find_all function:
soup.find_all(['h1', 'p'])
to get a list of the tags you need, instead of having to find all the tags you don't want.

Python findall regex issue

So, essentially my main issue comes from the regex part of findall. I'm trying to webscrape some information, but I can't for the life of me get any data to come out correctly. I thought that the (\S+ \S+) was the regex part, and I'd be extracting from any parts in between the HTML code of <li> and </li>, but instead, I get an empty list from print(data). I realize that I'm going to need a \S+ for every word in each of the list code parts, so how would I go about this? Also, how would I get it to post each one of the different parts of the HTML with the list code parts?
INPUT: Just the website. Mikky Ekko - Time
OUTPUT: In this case, it should be album titles (i.e. Mikky Ekko - Time)
import urllib.request
from re import findall
url = "http://rnbxclusive.se"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = str(html)
data = findall("<li>(\S+ \S+)</li>.*", htmlStr)
print(data)
for item in data:
print(item)
Use lxml
import lxml.html
doc = lxml.html.fromstring(response.read())
for li in doc.findall('.//li'):
print li.text_content()
<li>([^><]*)<\/li>
Try this.This will give all contents of <li> tag. flag.See demo.
http://regex101.com/r/dZ1vT6/55

how I can get all images if i'm using beautiful soup?

How I can a image if code like this:
<div class="galery-images">
<div class="galery-images-slide" style="width: 760px;">
<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>
I want to get 136666697057736800.jpg
I wrote:
images = soup.select("div.galery-item")
And i get a list:
[<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136013892671126300.jpg);" ></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136666699218876700.jpg);"></div>]
I dont understand: how I can get all images?
Use regex or a css parser to extract the url, concatenate the host to the beginning of the URL, finally download the image like this.
import urllib
urllib.urlretrieve("https://www.google.com/images/srpr/logo11w.png", "google.png")
To make your life easier, you should use a regex:
urls = []
for ele in soup.find_all('div', attrs={'class':'galery-images-slide'}):
pattern = re.compile('.*background-image:\s*url\((.*)\);')
match = pattern.match(ele.div['style'])
if match:
urls.append(match.group(1))
This works by finding all the divs belonging to the parent div (which has the class: 'galery-images-slide'). Then, you can parse the child divs to find any that contain the style (which itself contains the background-url) using a regex.
So, from your above example, this will output:
[u'/images/photo/1/20130206/30323/136666697057736800.jpg']
Now, to download the specified image, you append the site name in front of the url, and you should be able to download it.
NOTE:
This requires the regex module (re) in Python in addition to BeautifulSoup.
And, the regex I used is quite naive. But, you can adjust this as required to suit your needs.

Excluding unwanted results of findAll using BeautifulSoup

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
Read more »</p>
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
I've been trying to alter the arguments in my soup.find_all() call
to specifically exclude any text that comes before the <a href="#"
class="show-archived">Read more »</a>
I've drowned in Regular Expressions-type matching limbo with no success.
I can't seem to take advantage of the class="show-archived" attribute.
Any ideas would be gratefully appreciated. Thanks in advance.
Is this what you are seeking?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p

Using Beautiful Soup Python module to replace tags with plain text

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.
I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:
<div id="abc">
some long text goes here and hopefully it
will get picked up by the parser as content
</div>
results = soup.findAll(text=lambda(x): len(x) > 20)
When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:
anchors = soup.findAll('a')
for a in anchors:
a.replaceWith('plain text')
The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.
So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?
Thanks for your suggestions!
Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.
import urllib
from BeautifulSoup import BeautifulSoup
page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')
anchors = soup.findAll('a')
i = 0
for a in anchors:
print str(i) + ":" + str(a)
for a in anchors:
if (a.string is None): a.string = ''
if (a.previousSibling is None and a.nextSibling is None):
a.previousSibling = a.string
elif (a.previousSibling is None and a.nextSibling is not None):
a.nextSibling.replaceWith(a.string + a.nextSibling)
elif (a.previousSibling is not None and a.nextSibling is None):
a.previousSibling.replaceWith(a.previousSibling + a.string)
else:
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
a.nextSibling.extract()
i = i+1
When I run the above code, I get the following error:
0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
File "parselink.py", line 44, in <module>
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'
When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.
Could you please let me know what I am doing wrong?
-ecognium
An approach that works for your specific example is:
from BeautifulSoup import BeautifulSoup
ht = '''
<div id="abc">
some long text goes here and hopefully it
will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)
anchors = soup.findAll('a')
for a in anchors:
a.previousSibling.replaceWith(a.previousSibling + a.string)
results = soup.findAll(text=lambda(x): len(x) > 20)
print results
which emits
$ python bs.py
[u'\n some long text goes here ', u' and hopefully it \n will get picked up by the parser as content\n']
Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).
I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!
When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:
import re
def flatten_tags(s, tags):
pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
return pattern.sub("", s)
The tags argument is either a single tag or a list of tags to be flattened.

Categories

Resources