beautifulsoup and mechanize to get ajax call result - python

hi im building a scraper using python 2.5 and beautifulsoup
but im stuble upon a problem ... part of the web page is generating
after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters
is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ...
i would appreciate any links or some code samples
thanks

ok so i have figured it out ... it was quite simple after i realised that i could use combination of urllib, ulrlib2 and beautifulsoup
import urllib, urllib2
from BeautifulSoup import BeautifulSoup as bs_parse
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
res = urllib2.urlopen(req)
page = bs_parse(res.read())

No, you can't do that easily. AFAIK your options are, easiest first:
Read the AJAX javascript code yourself, as a human programmer, understand it, and then write python code to simulate the AJAX calls by hand. You can also use some capture software to capture requests/responses made live and try to reproduce them with code;
Use selenium or some other browser automation tool to fetch the page on a real web browser;
Use some python javascript runner like spidermonkey or pyv8 to run the javascript code, and hook it to your copy of the HTML dom;

Related

Requests How to use two different web sites and switching them?

Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.

BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)
As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!

Python Requests Data Not in Source Page (requests.get)

From the following link I cannot scrape data because requests work on source page but this site don't have required data in the python generated source page.
I am using requests.get method.
https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22value%22%3A%22hbb%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%229185f4458c49741d6003f0a9aa8935c2%22%7D%7D
Can any one help me out?
Thanks in advance.
That site is built dynamically using Javascript. You will have to use something like Selenium, which uses a Chrome browser in the background to run the Javascript. The requests-HTML module does the same thing in a way familiar to requests users.

Using Beautiful Soup to parse Edabit - Python

I'm trying to write code to get the amount of XP obtained by completing Edabit's challenges by parsing the individual url associated with a user on the site. Here's what I have:
from bs4 import BeautifulSoup
import requests
url = "https://edabit.com/user/xHRGAqa56TcXTLEMW"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
div = soup.find(id="react-root")
print(div)
The find is returning a value of none but I really don't know why. I think the site was made with meteor and that may be causing a problem?
Any help much appreciated.
This happens when there is dynamic content on the website which is then loaded when the javascript is executed in the browser.
You can look at the page source of your webpage in browser to see if the tag is there or not.
Since your script is not a browser but just a program which is fetching the webpage from the website, that's why the content is not being showed in your script.
If you want that javascript to be executed in the script you can setup something like splash server.
Another way would be to check the network requests that javascript is making in your browser to load that content(which is usually an API request) and make those same requests to get the content from the API directly instead of crawling it from the browser.
Hope it helps.
there is none output means soup.find didn't match any id that you searched for. Inspect the html file once again correctly. It may work.

Python BeautifulSoup && Request to scrape search engines

I'm a little confused on how to do this. I'm not sure if this is correct but I'm trying to query a search via a url. I've tried doing this:
url = 'https://duckduckgo.com/dogs?ia=meanings'
session = requests.session()
r = session.get(url)
soup = bs(r.content, 'html.parser')
I get some html back from the response; however, when I look for all the links it comes up with nothing besides the original search url.
links = soup.find_all('a')
for link in links:
print(link)
here
When I do the search on a browser and inspect the html code, all the links exist, but for some reason are not coming back to me via my request.
Anyone have any ideas, I'm trying to build a web-scraping application and I thought this would be something really easy that I could incorporate into my terminal.
The problem is that the search results and most of the page are dynamically loaded with the help of JavaScript code being executed by the browser. requests would only download the initial static HTML page, it has no JS engine since it is not a browser.
You have basically 3 main options:
use DuckDuckGo API (Python wrapper, may be there is a better one - please recheck) - this option is preferred
load the page in a real browser through selenium and then parse the HTML which is now the same complete HTML that you see in your browser
try to explore what requests are made to load the page and mimic them in your BeautifulSoup+requests code. This is the hardest and the most fragile approach that may involve complex logic and javascript code parsing.

Categories

Resources