Getting html of Facebook page using urllib2 - python

I am writing a Python script that can take a Facebook URL and locally save an html file of that Facebook page. Based on the answer to this question: Inherent way to save web page source
I tried using urllib2, but the resulting html file is different (missing some parts) compared to the html file that get from manually right clicking on the Facebook page and saving the entire webpage. Do you know why they would be different and what other Python libraries I could use instead of urllib2?

Related

Python web scraping the current URL

I would like to ask you, whether (and if yes, then how) it is possible to web scrape from a webpage that is already open. I do not want python to open the webpage, also, without the need to paste the URL into the code, because the webpage uses multiple authorizations (you need to approve the login on your app on your mobile phone and I don't know how to evade this) + its link doesn't change (no matter where on the webpage you are).
Basically, I need to web scrape the specific info from this webpage to excel (it doesn't matter the format). The webpage URL is constant and when pasted to the code, it won't let python access it because of the authentication.
In other words: How do I make python identify the webpage that is already running and web scrape the info I need?

Firefox seems to beautify HTML?

I am trying to make a small crawler app.
So I am constantly looking into the code of page inside of Firefox and I download the same page using urllib inside of python.
Sample URL: https://store.steampowered.com/search/?term=arma 3
And I checked using two separate libraries of python. Same result. File saved by python app looks different when saved on PC from one saved with browser.
Is it browser making the code more readable or server treats differently than standard browser application in a different way?
Thanks!
I tried saving using 2 different python libraries for accessing the web page.
I won't show the code here, because it's 200K of HTML without new lines.
The HTML code is with new lines when downloading page using normal browser.

Downloading dynamic web pages in python

I use the python requests (http://docs.python-requests.org/en/latest/) library in order to download and parse particular web pages. This works fine as long as the page is not dynamic. Things look different if the page under consideration uses javascript.
In particular, I am talking about a web page that automatically loads more content once you scrolled to the bottom of the page so that you can continue scrolling. This new content is not included in the page's source text, thus, I can't download it.
I thought about simulating a browser in python (selenium) - is this the right way to go?

Viewing whole webpage with beautifulsoup

I am scraping a website but it only shows a portion of the website at the bottom it has a view more button. Is there anyway to view everything on the webpage via python?
BeautifulSoup just parses the returned HTML. It doesn't execute JavaScript, which is often used to load new content or to modify the existing webpage after it has loaded.
You'll need to execute the JavaScript, which requires more than just an HTML parser. You basically need to use a browser. There are a few Python packages to do this:
Selenium
Ghost.py

Downloading a file using python BeautifulSoup

I am writing a web scraping script in python and I have to download a file. On the website , there is an image <a href="javascript:DownloadMyFile();">. When I click it, it calls a function which brings the dialog to save or open the file. How do I download the file using python directly using beautiful soup?
BeautifulSoup does not render JavaScript, so you have two options:
Figure out what the JavaScript is doing to generate the url that is handed to your browser when you click that anchor.
Write your scraper using something modern like CasperJS -- it can handle JavaScript.

Categories

Resources