Retrieving images loaded in the browser - python

The following code works great for fetching static images:
import urllib
urllib.urlretrieve({SOME_URL}, "00000001.jpg")
I'd however like to fetch all media that I can load on WhatsApp Web. For instance, looking at the source of a conversation that's fully loaded on my screen, I can see the image URLs such as blob:https://web.whatsapp.com/a2e9249a-365c-4ce7-a3ce-795be018400e – that link/image can actually be opened in a separate browser tab.
If I replace {SOME_URL} with that link, Python doesn't recognise blob as a valid HTTP request. I have kept my browser with WhatsApp Web loaded in the background as I assume those links may actually change every time I load the conversation. But the first problem is trying to find a urlretrieve equivalent. Any thoughts? Thank you!

Related

Is there any other way to extract data from dynamic website, rather than using selenium?

I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on). The page is dynamically loaded. I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request) . But I could not find any thing helpful. So is there a way to get the data from the website ?
If you don't want to use selenium then the alternative is to use a web parser like bs4 or use simply the request module.
You are on the right path in finding the call to the API. XHR requests can be seen under the network tab but the multitude of resources that appears makes it intricate to understand the requests being made. A simple way around this is to use the following method:
Instead of Network tab go to the console tab. There click on the settings icon, and then tick just the option Log XMLHTTPRequests.
Now refresh the page and scroll down to initiate dynamic calls. You will now be able to see the logs of all XHR in a more clear way.
For example
(index):29 Fetch finished loading: GET "**https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**".
Making a get request to that URL gives a bunch of Json objects. You can now use this url and others that you can derive to make the request straight to the URL.
See the answer here on how you can integrate the url with a request module to fetch data.

Tradingview snapshot feature; pulling request get with python to execute and grab url?

I am trying to use Python 3.6 to access the snapshot feature of tradingview to get the url created. I was able to use selenium and be successful but it was taking along 4 seconds to execute. I was hoping to use requests library to achieve this.
First of; is this even possible? url is https://s.tradingview.com/widgetembed/?symbol=DRYS. There is a little camera on the right side. The html scraper cant be used since the url is generated after pressing the snapshot icon.
Any pointers can help.

Post API search with Python

I'm trying to scrape all news items from this website. They are not showing in the source code: http://www.uvm.dk/aktuelt
I've tried using Firefox' LIVE Http Headers and Chrome's developer tool but still can't figure out what goes on behind the scenes. I'm sure it's pretty simple :-)
I have these information but how do I use them to scrape the wanted news?
http://www.uvm.dk/api/search
Request Method:POST
Connection: keep-alive
PageId=8938bc1b-a673-4513-80d1-e1714ca93d7c&Term=&Years%5B%5D=2017&WorkAreaIds=&SubjectIds=&TemplateIds=&NewsListIds%5B%5D=Emner&TimeSearch%5BEvaluation%5D=&FlagSearch%5BEvaluation%5D=Alle&DepartmentNames=&Letters=&RootItems=&Language=da&PageSize=10&Page=1
Can anyone help?
Not a direct answer but some hints.
Your approach with livehttpheaders is a good one. Open the side bar before loading home page, clear all. Then load home page and an article. There usually will a ton of http request because of images, css and js. But you'll be able to find the few ones useful. Usually the very first is for home page and one somewhere below is the article main page. An other interesting one is the one when you click next page.
I like to decouple download (HTTP) and scraping (HTML or JSON or so).
I download to a file with a first script and scrap with a second one.
First because I want to be able to adjust scraping without downloading again and again. Second because I prefer to use bash+curl to download and python+lxml to scrap. If I need information from scraping to go on downloading, my scraping script output it on the console.

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Python: upload images

Can python upload images onto the internet and provide a URL for it? For example is it possible for python to upload an image onto photobucket or any other uploading image service and then retreive the URL for it?
Certainly. You'll need to find an image hosting service with an API (hint: Flickr), and then write some Python code to interact with it (hint: XML-RPC).
Pseudocode
import xmlrpclib
with open( "..." ) as imagelist:
for image in imagelist:
message = xmlrpclib.make_some_message_or_other
response = message.send( )
response.parse( )
You'll need a more specific question if you want a more specific answer!
Sure!
To do it, you basically have to have Python pretend to be a web browser. You need to go get the upload form, fill in all the fields, and pick your image. Then you need to submit the form data, uploading the image. Once that's done, you need to get the "upload complete" page from the site, and find the URL where the image went.
A good library to start with might be the Python version of Perl's famous WWW::Mechanize module. The library is basically a programmable Web browser that you can script in Python.
EDIT: If you plan to do this a lot, you probably do want to use an actual supported API. Otherwise the image hoster might get mad that your python bot is spamming their site.

Categories

Resources