Server errors while web scraping https in python

Server errors while web scraping https in python - python

I'm trying to scrape this site https://propaccess.trueautomation.com/ClientDB/Property.aspx?prop_id=17471
I can type the address directly into my url bar, and I get the results I want, but when I scrape in python, i only get the source code for a "runtime error" page.
I'm thinking it might have something to do with https because I can scrape pages in the clear like craigslist.
My code is as follows,
import urllib
import re
domain = "https://propaccess.trueautomation.com/ClientDB/Property.aspx?
prop_id=17471"
htmlfile = urllib.urlopen(domain)
htmltext = htmlfile.read()
print htmltext
I'm new to python, but not to the internet. I was assuming if I could type the url into the browswer with success, I'd be able to type the same url into python. That seems to not be the case, and I don't have a clue why.
Thanks.
Mike
Update: If I browse to said url in a browser I have never used to surf this page, I get the "runtime error" page.

I cannot access the page you linked.
It seems like you are on an authenticated session, and your python code, of course, has no idea what's going on. It, thus, will return the "permission denied" or the sort of result.
If so, you probably want to pass the session cookie when you request.
The Requests library hopefully will do what you need.
(http://docs.python-requests.org/en/latest/user/advanced/#session-objects)
Hint: when you do scraping job, use incognito mode to see a web page.
How the page looks will be exactly the same to your python environment.

Related

Python mechanize redirection destination leads to new redirects

I'm new to mechanize and now need to scrape a webpage that requires authentication to view the content. I am able to log in with mechanize separately (with browser.open(login link)), but the login seems to exert no effect when I try browser.open(content link) right after the login. I also tried to start with browser.open(content link) and hope it redirects to the login link, but when I run browser.response().read(), it always give me a destination URL (say www.url,com/url1), and when I try to browser.open(this destination link), it gives a new destination URL (something like www.url,com/xyz-url1), which seems like an endless loop.
Therefore, I am wondering if there is any way to fix this. I have tried browser.set_handle_redirect(mechanize.HTTPRedirectHandler) after browser.open and browser.follow_meta_refresh but neither works. It would also help if I can stay logged in while I scrape the original content page as well, but I do not know how. I only scraped with BeautifulSoup before so am really unfamiliar with mechanize. Thanks in advance.

BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)

As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!

Web scraping Twitter pages Using Python

Using my twitter developer credentials I get twitter API from news channels. Now I want to access the source of the news using URL embedded with twitter API data.
I try to get using BeautifulSoup and requests to get content of the twitter page.
But continuously I got an error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?'
I cleaned the browser and try to use every browser. But got the same response. Please help to solve this problem.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/i/web/status/1283691878588784642'
# get contents from url
content = requests.get(url).content
# get soup
soup = BeautifulSoup(content,'lxml')

Do you get the error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?' when running the script or when visiting it with your GUI web browser?
Have you tried getting your data through legacy?
If you get this error when running the script, there is nothig you can do like clearing browser cache, etc. The only way to get around this problem is to find another way to access the Twitter page in your python program.
From my experience I would say that the easiest way around this problem is to use gecko driver with FireFox. So Twitter gets all the features it needs.

Web login using python3 requests

I am trying to web scrape the a piece of news. I try to login into the website by python so that I can have full access to the whole web page. But I have looked at so many tutorials but still fail.
Here is the code. Can anyone tell me why.
There is no bug in my code. But I still can not see the full text, which means I am still not log in.
`
url='https://id.wsj.com/access/pages/wsj/us/signin.html?mg=id-wsj&mg=id-wsj'
payload={'username':'my_user_name',
'password':'******'}
session=requests.Session()
session.get(url)
response=session.post(url,data=payload)
print(response.cookies)
r=requests.get('https://www.wsj.com/articles/companies-push-to-repeal-amt-after-senates-last-minute-move-to-keep-it-alive-1512435711')
print(r.text)
`

Try sending your last GET request using the response variable. After all, it's the one who made the login and holds the cookies (if there are any). You've used a new requests object for your last request thus ignoring the login you just made.

How do I scrape website HTML after Javascript has run?

So I was trying to scrape a website. When I scrape it, it turns out that the result isn't the same as when you try to right click and view page source on Mozilla or Google Chrome.
The code I used:
import urllib
page = urllib.urlopen("http://www.google.com/search?q=python")
#or any other website that uses search
python = page.read()
print python
It turns out that the code only takes the 'raw' web page, which isn't what I wanted. For websites like this, I want the code after javascript etc. has run. So that the result is the same as if you were right clicking and viewing source code from your browser.
Is there any other way of doing this?

its not exactly a raw page as it is an error page from google to you :
in the print python part it says on the top of the message :
Your client does not have permission to get URL /search?q=python from this server.
if you were to change your page variable to
page = urllib.urlopen("http://volt.al/")
you'd see the javascript.
try it out with different pages to see what you like

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.