I've read a lot about multipart/forms, mechanize and twill, but I couldn' findout howto implement a code.
Using MultipartPostHandler to POST form-data with Python
First I Tried to fill the forms on
www.imagebam.com/basic-upload
I can fill the forms but cant send the data really even if I submit() it.
after looking the source code at the page above, I realized all I need to do is "post" data in correct content-type to the page (correct me if Im wrong please)
http://www.imagebam.com/sys/upload/save
directly..
I tried to use poster.py, but couldnt understand how this stuff works. I can use mechanize and twill a little bit, but I am stucked since this is more complex than simple form posting, I think.
So my questions;
-How can I use poster.py (or user-created multipartform classes) to upload images to imagebam.com
-or any other alternative solutions :)
Don't rely completely on third party libraries like mechanize. Either implement its official api in python API ImageBam
or see this project developed in pyqt4 pymguploader to upload image and than try to implement yourself.
Mechanize is not the right tool for the task.
Implementing http://code.google.com/p/imagebam-api/ in python is way more robust.
The examples are in PHP/curl, converting them to python/urllib2 should be trivial.
Yes! I did it. I used
this question.
Here is the code:
>>> from poster.encode import multipart_encode
>>> from poster.streaminghttp import register_openers
>>> import urllib2
>>> register_openers()
<urllib2.OpenerDirector instance at 0x02CDD828>
>>> datagen, headers = multipart_encode({"file[]": open("D:\hedef\myfile.jpg","rb"),"content_type":"1","thumb_size":"350"})
>>> request = urllib2.Request("http://www.imagebam.com/sys/upload/save", datagen, headers)
>>> print urllib2.urlopen(request).read()
Now all I need to do is use BeautifulSoup to fecth the thumbnail codes :)
Related
For example, I want to download the latest WHO PDF on COVID-19. I'm really not sure how to do this.
If you type in 'who covid19 pdf' on Google, the pdf and link will come up.
I noticed that the links branch off from the main WHO domain name - maybe this can help?
Does anyone know how I can go about this?
From Python's standard library, use the urllib package. Specifically the retrieve function. A succient example is recreated below from this reference.
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://randomsite.com/file.pdf", "file.pdf")
I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.
page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.
I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.
The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.
I am working on writing a keyword extractor in Python. I would like to use the Yahoo Content API. The question is, is there a Python2.7 (or even 3.x) wrapper for the Yahoo Content API? I could not find one doing normal searches.
In parallel, I am trying alchemyAPI, OpenCalais, DBPedia Spotlight. I would love to make a comparison to figure out which one to use in production.
Any guidance would be most appreciated.
Thanks
I was interested in the answer as well. This is a possible solution:
import requests
text = """
Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration
"""
payload = {'q': "select * from contentanalysis.analyze where text='{text}'".format(text=text)}
r = requests.post("http://query.yahooapis.com/v1/public/yql", data=payload)
print(r.text)
According to the documentation, you can POST requests to the Yahoo content API and get JSON back. Python has the urllib2, requests and json libraries for that, all of which are well-documented and easy to use.
I am coding a Python2 script to perform some automatic actions in a website. I'm using urllib/urllib2 to accomplish this task. It involves GET and POST requests, custom headers, etc.
I stumbled upon an issue which seems to be not mentioned in the documentation. Let's pretend we have the following valid url: https://stackoverflow.com/index.php?abc=def&fgh=jkl and we need to perform a POST request there.
How my code looks like (please ignore if you find any typo errors):
data = urllib.urlencode({ "data": "somedata", "moredata": "somemoredata" })
urllib2.urlopen(urllib2.Request("https://stackoverflow.com/index.php?abc=def&fgh=jkl", data))
No errors are shown, but according to the web server, the petition is being received to "https://stackoverflow.com/index.php" and not to "https://stackoverflow.com/index.php?abc=def&fgh=jkl". What is the problem here?
I know that I could use Requests, but I'd like to use urllib/urllib2 first.
If I'm not wrong, you should pass your request data in data dictionary you passed to the url open() function.
data = urllib.urlencode({'abc': 'def', 'fgh': 'jkl'})
urllib2.urlopen(urllib2.Request('http://stackoverflow.com/index.php'))
Also, just like you said, use Requests unless you absolutely need the low level access urllib provides.
Hope this helps.
I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.
My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.
As an aside, I would prefer to do any processing in python.
Thanks
A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.
I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).
For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.
Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.
You can pass a data parameter to urllib.urlopen to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.
Also, if the form has method="GET", the request data is just part of the url given to urlopen.
Pretty much standard for scraping the returned HTML data is BeautifulSoup.
I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.
You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.