Get num of page with beautifulsoup

Get num of page with beautifulsoup - python

i want to get the number of pages in the next code html:
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ" class="outputText marginLeft0punto5">1</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ" class="outputText marginLeft0punto5">37</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ" class="outputText marginLeft0punto5">736</span>
The goal is get the number 1, 37 and 736
My problem is that i don't know how define the line to extract the numbers, for example for the number 1:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
first_page = int(soup.find('span', {'id': 'viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ'}).getText())
Thanks so much
EDIT: Finally i found a solution with Selenium:
numpag = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ"]').text)
pagtotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ"]').text)
totaltotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ"]').text)
Thanks #abarnert, sorry for the caos in my question, it was my first post =)

The code you provided already works for the example you provided.
My guess is that your problem is that it doesn't work for any other page, probably because those id values are different each time.
If that's the case, you need to look at (or show us) multiple different outputs to figure out if there's a recognizable pattern that you can match with a regular expression or a function full of string operations or whatever. See Searching the tree in the docs for the different kinds of filters you can use.
As a wild guess, that Z7 and AVEQAI930OBRD02JPMTPG21004 are replaced by different strings of capitals and digits each time, but the rest of the format is always the same? If so, there are some pretty obvious regular expressions you can use:
rnumpag = re.compile(r'.*:form1:textfooterInfoNumPagMAQ')
rtotalpagina = re.compile(r'.*:form1:textfooterInfoTotalPaginaMAQ')
rtotaltotal = re.compile(r'.*:form1:textfooterTotalTotalMAQ')
numpag = int(soup.find('span', id=rnumpag).string)
totalpagina = int(soup.find('span', id=rtotalpagina).string)
totaltotal = int(soup.find('span', id=rtotaltotal).string)
This works on your provided example, and would also work on a different page that had different strings of characters within the part we're matching with .*.
And, even if my wild guess was wrong, this should show you how to write a search for whatever you actually do have to search for.
As a side note, you were using the undocumented legacy function getText(). This implies that you're copying and pasting ancient BS3 code. Don't do that. Some of it will work with BS4, even when it isn't documented to (as in this case), but it's still a bad idea. It's like trying to run Python 2 source code with Python 3 without understanding the differences.
What you want here is either get_text(), string, or text, and you should look at what all three of these mean in the docs to understand the difference—but here, the only thing within the tag is a text string, so they all happen to do the same thing.

Related

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]

i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .

You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.

Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to place the iterated string variable in another string

Quick overview: I am writing a a very simple script using Python and Selenium to view Facebook Metrics for multiple Facebook pages.
I am trying to find a clean way to loop through the pages and output their results (it's only one number that I am collecting).
Here is what I have right now but it is not working.
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get(('https://www.facebook.com/{link}/insights/?section=navVideos'))

# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get('https://www.facebook.com/'+ link + '/insights/?section=navVideos')
its just string concatenation
or if you are so much inclined to use that syntax, have a look at the comment by #heather

It didn't work for you, because you aimed to use Python 3.6's f-strings, but forgot to prepend your string with the f char - crucial for this syntax. E.g. your code should be (only the relevant part):
browser.get(f'https://www.facebook.com/{link}/insights/?sights/?section=navVideos')
Alternatively you could use string formatting (e.g. the established approach before 3.6):
browser.get('https://www.facebook.com/{}/insights/?sights/?section=navVideos'.format(link))
In general, string concatenation - 'string1' + variable + 'string2' - is discouraged in python for performance and readability reasons.
BTW, in your sample code you had brackets around the get()'s argument - it is browser.get((arg)), which essentially turned it to a tuple, and might've caused error in the call. Not sure was it a typo or on purpose, as you can see I and the other responders have removed it.

Interpreting nested HTML <blockquote>s in Python?

I have a web app that reads from the Tumblr API and reformats the way that "reblog chains" are formatted.
With Tumblr, commentary for a post is stored as HTML blockquotes. As users respond to the commentary above, another level gets added to the blockquote chain, eventually resulting in many nested reblog chains.
Here is an example of how a "reblog chain" looks in plain HTML:
<p><a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:</p><blockquote>
<p><a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:</p><blockquote>
<p>Here is an example of a Tumblr post.</p> <p>It can have multiple <p> elements sometimes. It may only have one, though, at other times.</p>
</blockquote>
<p>This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.</p>
</blockquote>
<p>This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.</p>
And this is what it looks like when rendered.
I want to be able to reformat the reblog chain so that it looks more like this:
example-blog-domain:
Here is an example of a Tumblr post.
It can have multiple <p> elements sometimes. It may only have one, though, at other times.
chainsaw-police:
This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.
example-blog-domain:
This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.
I know, It's an incredibly confusing structure, hence why I'm trying to write something to make it more readable.
Is there any way to interpret the HTML and split the reblogs up into individual "comments"? For example, having an array or dict that has the username and the commentary would be more than enough. However, after messing with lxml and BeautifulSoup for months, I'm at my wits' end.
If there was even a way to do it in CSS, which I highly doubt, that would be fine.
Thanks in advance, everyone!

I guess CSS does not have a such functionality.
You need parse to a structure by lxml, ... and render it. It is easier way. You can also create a filter using regexp that does not pass wrong items of html code.

reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython using a tonne of convoluted regexes that end up looking more like a voodoo magic spell than a text manipulation script.
Lots of regexes later, I think I've solved this for an
arbitrarily-deep comment chain.
import re
with open("tcomment.txt","r") as tf:
text = ""
for line in tf:
text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("</p>\s*<p>","<br><br>", text)
text = text.replace("<p>\n", "")
text = text.replace("</p>\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
temp = ""
j = 0 - (i+1)
if (len(bits)-i) > i:
temp = "<b>" + bits[i] + "</b> " + bits[j]
comments.append(temp)
comments.reverse()
for comment in comments:
print("<p>%s</p>" % (comment))
print()
The line bits[0] = "Latest:" can be changed to whatever you want the
most recent comment to display, and you'll probably want to change how
the text comes into the script.
For the text you provided, this gives me:
<p><b>example-blog-domain:</b> Here is an example of a Tumblr post.<br><br>It can have multiple <p> elements sometimes. It may
only have one, though, at other times.
<p><b>chainsaw-police:</b> This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
above as a <blockquote>.
<p><b>Latest:</b> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
being residing deeper in the nest of blockquotes.
e: Some thoughts: this is in Python 3, but everything but the print
statements should work in Python 2, I think. I used text.split()
whenever possible because direct string manipulation is typically
faster than regular expressions are, but that may not be appropriate
here. And finally, it's possible that I'm making more work for myself
than I need to in the substitutions section, but at this point I've
looked at the code too long to figure out if it could be slimmed down.

How do you get the text from an HTML 'datacell' using BeautifulSoup

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':
here is my HTML snippet:
headerRows[0][10].contents
[<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">
</font></font></font>]
Note that this is a list item from Python [].
I need the value Apples Produced but can't get to it.
Any suggestions would be appreciated
Suggestions on a good book that explains this would earn my eternal gratitude
Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute
say it is:
[<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">
</font></font></font>]
Apples Produced
I am trying to learn to read/understand the documentation and your response will help
I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.
Cheers

The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:
headerRows[0][10].findNext('b').string
A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:
>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'

headerRows[0][10].contents[0].find('b').string

I have a base class that I extend all Beautiful Soup classes with a bunch of methods that help me get at text within a group of elements that I don't necessarily want to rely on the structure of. One of those methods is the following:
def clean(self, val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '', s) #remove tags
val = re.sub("\s+" , " ", val) #collapse internal whitespace
return val.strip() #remove leading & trailing whitespace

Sanitising user input using Python

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML

If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get num of page with beautifulsoup - python

Related

python3.6 How do I regex a url from a .txt?

How to place the iterated string variable in another string

Interpreting nested HTML <blockquote>s in Python?

How do you get the text from an HTML 'datacell' using BeautifulSoup

Sanitising user input using Python

Categories

Resources