I was just making a simple Python program that would update me about my Instagram likes and followers. I don't want to open the browser every time it runs so I'm using requests and Beautifulsoup to get the page data. But the get method gives me a script that is totally different from the actual page script(the script in the browser). There's no class so I can't use BeautifulSoup.find() method on it. It seems like I'm receiving a different script because it's a robot accessing the page.
Is there any other module or method I can use?
Related
I'm trying to use requests to basically scrape certain websites using requests & BeautifulSoup. The issue I'm having is that using requests get doesn't return the same result as it would if I used a browser. I know that requests doesn't support Javascript, but I'm trying to work around that. If I disable Javascript on Chrome and went to an Instagram profile, the web page returns (to the best of it's ability). It doesn't fully load, but it still includes basic details like a <title></title> that includes the username of the profile. However, when I send a GET request to the same URL in Python, the <title></title> attribute just returns with as <title>Instagram</title>
Why am I getting a different result?
For example, if I go to https://instagram.com/test on Chrome (with Javascript disabled) - I get a blank page, but I still get <title>Zac (#test) • Instagram photos and videos</title>, but if I use requests.get("https://instagram.com/test"), the return from the response would have <title>Instagram</title>
I've tried playing around with requests, and even tried switching to another lib like requests-html, but it always returns a different result than the browser.
Instead of scrapping Instagram you might need to use it API.
https://developers.facebook.com/docs/instagram-api/
Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.
I want to create a bot that scraps all images from a website and send them to me in return.
What I had thought was to make a python scraper with BeautifulSoup that gets the url images and then use the function bot.send_photo(chat_id, 'your URl') in a forloop with every url.
The thing is I actually don't know if this can be done in this way or if it has to use only the telegram's functions to work on mobile.
I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on). The page is dynamically loaded. I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request) . But I could not find any thing helpful. So is there a way to get the data from the website ?
If you don't want to use selenium then the alternative is to use a web parser like bs4 or use simply the request module.
You are on the right path in finding the call to the API. XHR requests can be seen under the network tab but the multitude of resources that appears makes it intricate to understand the requests being made. A simple way around this is to use the following method:
Instead of Network tab go to the console tab. There click on the settings icon, and then tick just the option Log XMLHTTPRequests.
Now refresh the page and scroll down to initiate dynamic calls. You will now be able to see the logs of all XHR in a more clear way.
For example
(index):29 Fetch finished loading: GET "**https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**".
Making a get request to that URL gives a bunch of Json objects. You can now use this url and others that you can derive to make the request straight to the URL.
See the answer here on how you can integrate the url with a request module to fetch data.
I am using Python with Selenium WebDriver to browse through website.
Now I have the problem that I want to monitor the XHR AJAX calls, thrown on the current site.
For example: I am on a specific page and the python selenium script clicks on a button on this site. This website button calls an AJAX request within this site.
Is it possible to monitor this XHR AJAX request and get it in my python script to handle the called AJAX URL?
Thanks!
UPDATE: I exactly want to do sth like this (but in python obviously)
https://dmitripavlutin.com/catch-the-xmlhttp-request-in-plain-javascript/
You can use browser.execute_script to catch the calls as explained in the link that you mentioned. In addition, start a fake website using Django on a separate thread. And in the JavaScript handler (sendReplacement), replace the url with the one of your django server.
In that server you will receive the AJAX call and be able to examine it.
You may be able to implement a simpler solution without the django server by simply monitor the calls directly from the JavaScript snippet and make it return the value you want directly. But if you need to monitor many calls and perform more complex examination is the requests, the former solution is more powerful.