Python - Urllib2 Wait for page to load to scrape data - python

Firstly, I'd like to say that I do not want to use any libraries that are not provided with Python 2.7.10. The same question was posted on Stack Overflow but was answered with the Requests library.
I have a script that logs into Roblox.com using urllib2. To check if there is a captcha before I try to log in, I wanted to do check_captcha = re.findall('recaptcha_image', newlogin) but roblox needs to redirect to the captcha login page AND the captcha has to load onto the page.
So how can I make Python wait to redirect/load the page fully before I go ahead and .read() and scrape it.

This will wait 10 seconds before it reads it:
import urllib2
import time
url = 'Roblox url'
data = urllib2.urlopen(url)
time.sleep(10)
data = data.read()

Related

Why is the html content I got from inspector different from what I got from Request?

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

Requests Module not fetching full website in Python

Sorry for a Noob question.... I have written a code which searches google for an image stored locally on my computer. I accomplished this using the requests module. I want to scrape the result page for information about the image but request module never fetches the entire page. It only fetches a part of it and thus I am not able to scrape the website for results
import requests
import webbrowser
from bs4 import BeautifulSoup
filePath = "C:\\Users\\mjjha\\Documents\\Checkrow\\monaLisa.jpg"
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
r=requests.get(fetchUrl)
webbrowser.open(fetchUrl)
soup=BeautifulSoup(r.content,'html.parser')
head=soup.find_all('a')
for i in head:
print(i['href'])
The web page looks like this:
but when I scrape it for anchor tag links using beautiful soup I get the following result:
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZisFTqPOZmEYpGB89rRLg4R4TfmF3WVQ_1gHEFiENQ8wbYqQq7-KsJUE5KuuxvINd0hFo10EMmP4RzWvOBvRxmsHZ7vm6etW-I36-QfCwmwir1NawORzsWZJffCnSwTpdts39mmQ1EfkcH0R8eGsiJ4_1Xw9DA_1C9mqLpChwRYdgOT-oFNcpt2O25Zhmo6ouG2XA5ZelCbKAChT4DJfGz0TXphXB_1dGEluDV6R_15n42URKCX5Q1zIqR6_16CR0rgXBphz95FMrETLqtPURRbAaWzauYisnSk6jF_1T5GbuoJKHtqThXevhogUSW9ERfZr5vbbWI6DA9c&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_fWA_BOpuaXy87gYOc9cg4jE6KwU%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiuVpzNQ_10ZbUunarcxtcBADLP2GPHTlIiPAtNpsOezRg48a1S4ofT8Df9C4NRT_1PzMk4baDStlOQGRp2okPHCALA5TvLodRqGj_1Q9_19KWykusySDjQNkbi67Ob6Kx7LZ0ybQ59c9mvyda27CBq8_19XutgXXgl4hGCLdXX9M3Od0WI9BckSHxv_1zajCMhKj1XaLKl9p7T9S0hfQbyZs4zQNcudXEk_1y3Zle6anU1rmSpEdpeCXC6r_1HnTnTLAtYWQlVFVF6QuCT8W5djGXGXwTjQH7NgkXnOi6q7v4F_1DqTVytnSAcBX6rc1eJFlXwIR2dzR73cs983mzb686VgOqUUNS1IG8w&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_d19wKMnK5qQH_fmlL2YBuhhR_BE%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiu1U_1mLzDh7oSZSFpNdcot7N84lXExmiJp6LMIJ1NO2PHle7mcBr72CTgX45DTbkkF8yfhvT0GATXTIzgd--ayauOaI-gLvTa-DAeOAodk35Kz6mpzCzl8ly6YYdUbn2S5cCe35BP37ysxT-tSFbvtLPwovJuiNPmunzpk_1j0a88zkXOmb1tn3vfgXnb6IhaucZJIMZztSDOIljOiaSTIzhdQ1aSusETDAu3EMNnoWRaFWqzcUGHzIWuABI9gJkelzjDaV-aK4ilxQJhwGJnzuKNHDbJ4GSX33an2jIfssmwfoWZLFej_1V0Zijr2fuFqULhoAg2lku41nHNxHY1nI0gNU4M2Q&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_5Sxk31MIG7AiUTjwI71yoWFyO_E%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
The content fetched by requests module doesn't contain the full web page I don't know why. I want to scrape information in image ,anchor and h3 tags from the page using beautiful soup but its just not working out.
The main problem is Python Requests module doesn't render JavaScript. As a result, you are not getting the webpage you are supposed to get.
You are using a webbrowser module to view your URL where JavaScript is enabled, so you are getting the page as expected. But next, when you use the requests module to get the page, javascript stays disabled, and google doesn't let you render the page but instead redirects you to another page(Google Homepage). And there, you get different HTML resulting in no search results(you did in the first place).
IN 1 is the URL you are trying to hit, and 2 is the URL you are redirected to.
Look at the difference is google.com/webhp?tbs=sbi:AMhZZisX...
VS google.com/search?tbs=sbi:AMhZZisX...
The HTML of that page results in is this -
Always use the source HTML given by the requests module, which shows you the actual result.
As you can see, this is not the search result page.
So to reach your goal, try using Selenium.

Get redirected ULR with Python to get access code

This question has been asked several times, and didn't find any answer that works for me. I am using request library to get the redirect url, however my code returns the original url. If I click on the link it takes few second before I get the redirect url and then manually extract the code, but I need to get this information by python.
Here is my code. I have tried response.history but it returns empty list.
import requests
response = requests.get("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
print(response)
print('-------------------')
print(response.url)
I am trying to get the code by following this Microsoft documention "https://learn.microsoft.com/en-us/graph/auth-v2-user".
Here are the links that I found in stack over flow and didn't solve my issue.
To get redirected URL with requests , How to get redirect url code with Python? ( this is probably very close to my situation), how to get redirect url using python requests and this one Python Requests library redirect new url
I didn't have any luck to get redirected url back by using requests as mentioned in previous posts. But I was able to work around this using webbrowser library and then get the browser history using sqlite 3 and was able to get the result that I was looking for.
I had to go through postman and add postman url into my app registration for using Graph API, but if you simply want to get redirected url you can follow the same code and you should get redirected url.
let me know if there are better solutions.
here is my code:
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://oauth.pstmn.io/v1/browser-callback?code={code}&state=12345&session_state={session_state}#

Scraping a secure website requiring clicks on javascript links

I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef&regl=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.

error while parsing url using python

I am working on a url using python.
If I click the url, I am able to get the excel file.
but If I run following code, it gives me weird output.
>>> import urllib2
>>> urllib2.urlopen('http://intranet.stats.gov.my/trade/download.php?id=4&var=2012/2012%20MALAYSIA%27S%20EXPORTS%20BY%20ECONOMIC%20GROUPING.xls').read()
output :
"<script language=javascript>window.location='2012/2012 MALAYSIA\\'S EXPORTS BY ECONOMIC GROUPING.xls'</script>"
why its not able to read content with urllib2?
Take a look using an http listener (or even Google Chrome Developer Tools), there's a redirect using javascript when you get to the page.
You will need to access the initial url, parse the result and fetch again the actual url.
#Kai in this question seems to have found an answer to javascript redirects using the module Selenium
from selenium import webdriver
driver = webdriver.Firefox()
link = "http://yourlink.com"
driver.get(link)
#this waits for the new page to load
while(link == driver.current_url):
time.sleep(1)
redirected_url = driver.current_url

Categories

Resources