I'm trying to scrape all news items from this website. They are not showing in the source code: http://www.uvm.dk/aktuelt
I've tried using Firefox' LIVE Http Headers and Chrome's developer tool but still can't figure out what goes on behind the scenes. I'm sure it's pretty simple :-)
I have these information but how do I use them to scrape the wanted news?
http://www.uvm.dk/api/search
Request Method:POST
Connection: keep-alive
PageId=8938bc1b-a673-4513-80d1-e1714ca93d7c&Term=&Years%5B%5D=2017&WorkAreaIds=&SubjectIds=&TemplateIds=&NewsListIds%5B%5D=Emner&TimeSearch%5BEvaluation%5D=&FlagSearch%5BEvaluation%5D=Alle&DepartmentNames=&Letters=&RootItems=&Language=da&PageSize=10&Page=1
Can anyone help?
Not a direct answer but some hints.
Your approach with livehttpheaders is a good one. Open the side bar before loading home page, clear all. Then load home page and an article. There usually will a ton of http request because of images, css and js. But you'll be able to find the few ones useful. Usually the very first is for home page and one somewhere below is the article main page. An other interesting one is the one when you click next page.
I like to decouple download (HTTP) and scraping (HTML or JSON or so).
I download to a file with a first script and scrap with a second one.
First because I want to be able to adjust scraping without downloading again and again. Second because I prefer to use bash+curl to download and python+lxml to scrap. If I need information from scraping to go on downloading, my scraping script output it on the console.
Related
I've done a ton of research online, read through many of the similar questions on Stack Overflow, but can't find anything too useful.
I'm trying to scrape some information off a housing website for a research paper. I can't use requests (I don't think) because they don't have "name" for their username and password fields, and it's a site that requires login, so I am trying to use Selenium. This website uses infinite scrolling to display their information.
I see all of the information I need is in the XHR tab in the developer tools:
In fact, the preview tab has exactly the information I need:
The recommendlist pops up as I scroll on the page.
I was just wondering if there was any way to access the information in this tab using Python, so I can parse and analyze the data?
Is there a different approach I'm not thinking of?
Thanks!
Follow the steps:
Open the web console
Go to Network and set XHR
At the top you will see "Request URL"
Copy it and make requests to this URL via requests/selenium
I am new to web scraping and building crawlers and i started practicing on a grocery website.
I've been trying to crawl data from a website for quite some time and could'nt get through for more than three pages, for the first three pages the websites let's me access the data but after that i dont get any response and even for a few seconds i stop getting response on the browser as well. The website uses API to get all the data so i can not even use BeautifulSoup, i thought of using selenium but no luck there too.
I am using python's requests library to get the data and json to parse. The website requires post method to access all the products so i am sending cookies, headers and params as well and using same cookies etc for the next pages also.
I am looking for some general responses if anyone went through the same situation and got a workaround maybe.
Thank you.
Here is how you can unblock this website. (Sorry, can't provide the code because, it is likely to not fucntion without my location details. So try the method I say to get the code).
Open that link in Google Chrome > Open Developer Tools by pressing Ctrl + Shift + I > Go to Networks tab. Over there, go to XMR and find 'details'. This looks like:
Right click on it, Copy it as Bash Curl.
Go to Curl to Requests , paste the code, and press enter. The curl gets converted to requests. Copy that and run.
Here in that, the last line will be like:
response = requests.post('https://www.kroger.com/products/api/products/details', headers=headers, cookies=cookies, data=data)
This does the requests.
4. Now after this, when we extract what we require:
data = response.json() # saving as a dictionary
product = data['products'] # getting the product
Now from this scraped data, take whatever you need. Happy Coding :)
I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.
I need to develop web app for extracting prices of books from different e-commerce sites like amazon,homeshop18 when user enters book name in the interface and displays all the information.
My questions are
1)how to pass that query to amazon site search box and i can get only the pages relevant to the query instead of crawling the whole site.
2)What can be used to develop this application?BeautifulSoup or scrappy?API's are not available for all e-commerce sites to use it
am new to python.so any help will be highly appreciated
I personnaly use BeautifulSoup to parse web pages, but beware it's a bit slow if you have to parse pages massively. I know that lxml is faster but a bit less coder-friendly.To guess the right parameters (either for an HTTP GET or POST) for getting the result page you want, you should proceed like this:
Switch on the firebug plugin for Firefox or the integrated inspector for Chrome
Go on the web page you're interested in, and do the search
Go into firebug/inspector to see the parameters of the HTTP request Firefox or Chrome sent to the website.
Reproduce the request in your python script. For example using urllib
There is another way to guess the right HTTP GET or POST parameters, it's to use a network analyzer like Wireshark. This is a more detailed approach but feels more like
finding a needle in a haystack once you used the tools in Firefox/Chrome.
Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.