Different Twitter HTML structure for browsers and python web opener - python

I'm working on a script which downloads some data from Twitter profiles. I found out that HTML structure is different in web browser than in python "robot" because when I open the page through python urllib2 and BeautifulSoup I get different tag IDs and classes. Is there a way to get the same content as in web browser?
I need it for short urls resolving because in web browser, resolved urls are stored in link title attribute.

Most websites adapt their response according to the User-Agent header on the request. If none is set, it is obvious that this is not a browser, but some sort of script. You'll probably want to set a User-Agent header that is somewhat similar to a "real" browser.
Lots of methods to do this are described here: Changing user agent on urllib2.urlopen and here: Fetch a Wikipedia article with Python
On an unrelated note, you might want to use Requests, which is a much better API than the standard urllib2.

Don't screen scrape for twitter profile information. Use the api. Your whole program will be much more robust. It's probably against their TOS to change your user agent and mess with stuff too.

Related

Requests How to use two different web sites and switching them?

Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.

Retrieving all network requests required to load a webpage using python

Let say I am making a python request
url = "https://www.google.com"
r = requests.get(url)
Is there any method for getting all the network requests needed to load such a website, for example, those listed in the inspect element tool in chrome? I believe that I could achieve the same effect using Selenium, but is there any library or method that I could use to simply get all the network requests/network responses when requesting a URL.
Selenium Wire may be worth a try. I haven't been able to find much else in this space either.
https://github.com/wkeeling/selenium-wire
Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You author your code in the same way as you do with Selenium, but you get extra APIs for inspecting requests and responses and making changes to them on the fly.
This article describes more HTTP Request packages that may have similar capabilities or related extensions.
https://www.twilio.com/blog/5-ways-http-requests-python

How to surf the web without cookies from code

I was try to scrape some links from the web via google search.
Let's say my query is [games site:pastebin.com].
I was trying this in both python and dart but the result i got was that i need to login in for it and i don't ant to use cookies.
So, is there any way to get the result of https://www.google.com/search?q=site%3Apastebin.com+games from code block without cookies?
The Code I Tried:
Python 3.9.5
import requests
req = requests.get("https://www.google.com/search?q=games+site%3Apastebin.com")
That fully depends on the website you are trying to access. Some pages won't allow you to use certain features from their page without cookies at all, some do. For the purpose you are trying to achieve, I'd rather recommend using a search API, which doesn't require cookies - since cookies are normally for regular users.
Google usually asfaik doesn't like it if you scrape their content using scripts.
As mentioned before, you can look for alternative search engines, which don't require cookie usage

What information do I need when scraping a website that requires logging in?

I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?

python open web page and get source code

We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)

Categories

Resources