Quick overview: I am writing a a very simple script using Python and Selenium to view Facebook Metrics for multiple Facebook pages.
I am trying to find a clean way to loop through the pages and output their results (it's only one number that I am collecting).
Here is what I have right now but it is not working.
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get(('https://www.facebook.com/{link}/insights/?section=navVideos'))
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get('https://www.facebook.com/'+ link + '/insights/?section=navVideos')
its just string concatenation
or if you are so much inclined to use that syntax, have a look at the comment by #heather
It didn't work for you, because you aimed to use Python 3.6's f-strings, but forgot to prepend your string with the f char - crucial for this syntax. E.g. your code should be (only the relevant part):
browser.get(f'https://www.facebook.com/{link}/insights/?sights/?section=navVideos')
Alternatively you could use string formatting (e.g. the established approach before 3.6):
browser.get('https://www.facebook.com/{}/insights/?sights/?section=navVideos'.format(link))
In general, string concatenation - 'string1' + variable + 'string2' - is discouraged in python for performance and readability reasons.
BTW, in your sample code you had brackets around the get()'s argument - it is browser.get((arg)), which essentially turned it to a tuple, and might've caused error in the call. Not sure was it a typo or on purpose, as you can see I and the other responders have removed it.
Related
I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])
i want to get the number of pages in the next code html:
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ" class="outputText marginLeft0punto5">1</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ" class="outputText marginLeft0punto5">37</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ" class="outputText marginLeft0punto5">736</span>
The goal is get the number 1, 37 and 736
My problem is that i don't know how define the line to extract the numbers, for example for the number 1:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
first_page = int(soup.find('span', {'id': 'viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ'}).getText())
Thanks so much
EDIT: Finally i found a solution with Selenium:
numpag = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ"]').text)
pagtotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ"]').text)
totaltotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ"]').text)
Thanks #abarnert, sorry for the caos in my question, it was my first post =)
The code you provided already works for the example you provided.
My guess is that your problem is that it doesn't work for any other page, probably because those id values are different each time.
If that's the case, you need to look at (or show us) multiple different outputs to figure out if there's a recognizable pattern that you can match with a regular expression or a function full of string operations or whatever. See Searching the tree in the docs for the different kinds of filters you can use.
As a wild guess, that Z7 and AVEQAI930OBRD02JPMTPG21004 are replaced by different strings of capitals and digits each time, but the rest of the format is always the same? If so, there are some pretty obvious regular expressions you can use:
rnumpag = re.compile(r'.*:form1:textfooterInfoNumPagMAQ')
rtotalpagina = re.compile(r'.*:form1:textfooterInfoTotalPaginaMAQ')
rtotaltotal = re.compile(r'.*:form1:textfooterTotalTotalMAQ')
numpag = int(soup.find('span', id=rnumpag).string)
totalpagina = int(soup.find('span', id=rtotalpagina).string)
totaltotal = int(soup.find('span', id=rtotaltotal).string)
This works on your provided example, and would also work on a different page that had different strings of characters within the part we're matching with .*.
And, even if my wild guess was wrong, this should show you how to write a search for whatever you actually do have to search for.
As a side note, you were using the undocumented legacy function getText(). This implies that you're copying and pasting ancient BS3 code. Don't do that. Some of it will work with BS4, even when it isn't documented to (as in this case), but it's still a bad idea. It's like trying to run Python 2 source code with Python 3 without understanding the differences.
What you want here is either get_text(), string, or text, and you should look at what all three of these mean in the docs to understand the difference—but here, the only thing within the tag is a text string, so they all happen to do the same thing.
I am new to both Python (and django) - but not to programming.
I am having no end of problems with identation in my view. I am trying to generate my html dynamically, so that means a lot of string manipulation. Obviously - I cant have my entire HTML page in one line - so what is required in order to be able to dynamically build an html string, i.e. mixing strings and other variables?
For example, using PHP, the following trivial example demonstrates generating an HTML doc containing a table
<?php
$output = '<html><head><title>Getting worked up over Python indentations</title></head><body>';
output .= '<table><tbody>'
for($i=0; $i< 10; $i++){
output .= '<tr class="'.(($i%2) ? 'even' : 'odd').'"><td>Row: '.$i;
}
$output .= '</tbody></table></body></html>'
echo $output;
I am trying to do something similar in Python (in my views.py), and I get errors like:
EOL while scanning string literal (views.py, line 21)
When I put everything in a single line, it gets rid of the error.
Could someone show how the little php script above will be written in python?, so I can use that as a template to fix my view.
[Edit]
My python code looks something like this:
def just_frigging_doit(request):
html = '<html>
<head><title>What the funk<title></head>
<body>'
# try to start builing dynamic HTML from this point onward...
# but server barfs even further up, on the html var declaration line.
[Edit2]
I have added triple quotes like suggested by Ned and S.Lott, and that works fine if I want to print out static text. If I want to create dynamic html (for example a row number), I get an exception - cannot concatenate 'str' and 'int' objects.
I am trying to generate my html dynamically, so that means a lot of string manipulation.
Don't do this.
Use Django's templates. They work really, really well. If you can't figure out how to apply them, do this. Ask a question showing what you want to do. Don't ask how to make dynamic HTML. Ask about how to create whatever page feature you're trying to create. 80% of the time, a simple {%if%} or {%for%} does everything you need. The rest of the time you need to know how filters and the built-in tags work.
Use string.Template if you must fall back to "dynamic" HTML. http://docs.python.org/library/string.html#template-strings Once you try this, you'll find Django's is better.
Do not do string manipulation to create HTML.
cannot concatenate 'str' and 'int' objects.
Correct. You cannot.
You have three choices.
Convert the int to a string. Use the str() function. This doesn't scale well. You have lots of ad-hoc conversions and stuff. Unpleasant.
Use the format() method of a string to insert values into the string. This is slightly better than complex string manipulation. After doing this for a while, you figure out why templates are a good idea.
Use a template. You can try string.Template. After a while, you figure out why Django's are a good idea.
my_template.html
<html><head><title>Getting worked up over Python indentations</title></head><body>
<table><tbody>
{%for object in objects%}
<tr class="{%cycle 'even' 'odd'%}"><td>Row: {{object}}</td></tr>
{%endfor%}
</tbody></table></body></html>
views.py
def myview( request ):
render_to_response( 'my_template.html',
{ 'objects':range(10) }
)
I think that's all you'd need for a mockup.
In Python, a string can span lines if you use triple-quoting:
"""
This is a
multiline
string
"""
You probably want to use Django templates to create your HTML. Read a Django tutorial to see how it's done.
Python is strongly typed, meaning it won't automatically convert types for you to make your expressions work out, the way PHP will. So you can't concatenate strings and numbers like this: "hello" + num.
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html
What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?
Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.
Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.
The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.
Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.
I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML
If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.
To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")