Scraping a react.js webpage with dryscrape - python

I have trouble scraping the homepage http://www.jobs.ch which is programmed with react.js.
I want to put the term Business in the search box and execute the search.
Dryscrape worked for another example which was not a react.js page.
How can I write the term Business in this search field?
The error message when my script is executed:
ubuntu#ubuntu:~/scripts$ python jobs.py
Traceback (most recent call last):
File "jobs.py", line 30, in <module>
name.set("Business")
AttributeError: 'NoneType' object has no attribute 'set'
Here is my script:
#We will write a Python script to visit a webpage. Fill in the form and submit the form.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import dryscrape
# make sure you have xvfb installed
dryscrape.start_xvfb()
root_url = 'http://www.jobs.ch/en/vacancies/'
if __name__ == '__main__':
# set up a web scraping session
session = dryscrape.Session(base_url = root_url)
# we don't need images
session.set_attribute('auto_load_images', False)
session.set_header('User-agent', 'Google Chrome')
# visit exact webpage which is the form in this example
session.visit('http://www.jobs.ch/en/vacancies/')
# fill in the form by taking ID of field from webdev tool
#name = session.at_xpath('//*[#data-reactid="107]')
name = session.at_xpath('//*[#data-reactid="107"]//*[#class="search-input col-sm-4 col-md-5"]')
name.set("Business")
# submit form
name.form().submit()
# save a screenshot of the web page
session.render("jobs.png")
print("Session rendered successfully!")

I think your xpath has an issue but apart from that, your session itself has been configured incorrectly.
This line
session = dryscrape.Session(base_url = root_url)
sets the base of the URL to your root_url so when you do session.visit('http://www.jobs.ch/en/vacancies/') you are in fact visiting the concatenation of your root_url and the URL provided in session.visit.
If you print session.url() you would be able to see that the URL you actually visited was http://www.jobs.ch/en/vacancies/http://www.jobs.ch/en/vacancies/
The xpath of the page which I got from Chrome -> Inspect -> Right Click -> Copy XPath is //*[#id="react-root"]/div/div[1]/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/div/div[1]/div/input
Please verify that you are using the correct xpath.

Related

Requests Module not fetching full website in Python

Sorry for a Noob question.... I have written a code which searches google for an image stored locally on my computer. I accomplished this using the requests module. I want to scrape the result page for information about the image but request module never fetches the entire page. It only fetches a part of it and thus I am not able to scrape the website for results
import requests
import webbrowser
from bs4 import BeautifulSoup
filePath = "C:\\Users\\mjjha\\Documents\\Checkrow\\monaLisa.jpg"
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
r=requests.get(fetchUrl)
webbrowser.open(fetchUrl)
soup=BeautifulSoup(r.content,'html.parser')
head=soup.find_all('a')
for i in head:
print(i['href'])
The web page looks like this:
but when I scrape it for anchor tag links using beautiful soup I get the following result:
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZisFTqPOZmEYpGB89rRLg4R4TfmF3WVQ_1gHEFiENQ8wbYqQq7-KsJUE5KuuxvINd0hFo10EMmP4RzWvOBvRxmsHZ7vm6etW-I36-QfCwmwir1NawORzsWZJffCnSwTpdts39mmQ1EfkcH0R8eGsiJ4_1Xw9DA_1C9mqLpChwRYdgOT-oFNcpt2O25Zhmo6ouG2XA5ZelCbKAChT4DJfGz0TXphXB_1dGEluDV6R_15n42URKCX5Q1zIqR6_16CR0rgXBphz95FMrETLqtPURRbAaWzauYisnSk6jF_1T5GbuoJKHtqThXevhogUSW9ERfZr5vbbWI6DA9c&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_fWA_BOpuaXy87gYOc9cg4jE6KwU%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiuVpzNQ_10ZbUunarcxtcBADLP2GPHTlIiPAtNpsOezRg48a1S4ofT8Df9C4NRT_1PzMk4baDStlOQGRp2okPHCALA5TvLodRqGj_1Q9_19KWykusySDjQNkbi67Ob6Kx7LZ0ybQ59c9mvyda27CBq8_19XutgXXgl4hGCLdXX9M3Od0WI9BckSHxv_1zajCMhKj1XaLKl9p7T9S0hfQbyZs4zQNcudXEk_1y3Zle6anU1rmSpEdpeCXC6r_1HnTnTLAtYWQlVFVF6QuCT8W5djGXGXwTjQH7NgkXnOi6q7v4F_1DqTVytnSAcBX6rc1eJFlXwIR2dzR73cs983mzb686VgOqUUNS1IG8w&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_d19wKMnK5qQH_fmlL2YBuhhR_BE%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiu1U_1mLzDh7oSZSFpNdcot7N84lXExmiJp6LMIJ1NO2PHle7mcBr72CTgX45DTbkkF8yfhvT0GATXTIzgd--ayauOaI-gLvTa-DAeOAodk35Kz6mpzCzl8ly6YYdUbn2S5cCe35BP37ysxT-tSFbvtLPwovJuiNPmunzpk_1j0a88zkXOmb1tn3vfgXnb6IhaucZJIMZztSDOIljOiaSTIzhdQ1aSusETDAu3EMNnoWRaFWqzcUGHzIWuABI9gJkelzjDaV-aK4ilxQJhwGJnzuKNHDbJ4GSX33an2jIfssmwfoWZLFej_1V0Zijr2fuFqULhoAg2lku41nHNxHY1nI0gNU4M2Q&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_5Sxk31MIG7AiUTjwI71yoWFyO_E%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
The content fetched by requests module doesn't contain the full web page I don't know why. I want to scrape information in image ,anchor and h3 tags from the page using beautiful soup but its just not working out.
The main problem is Python Requests module doesn't render JavaScript. As a result, you are not getting the webpage you are supposed to get.
You are using a webbrowser module to view your URL where JavaScript is enabled, so you are getting the page as expected. But next, when you use the requests module to get the page, javascript stays disabled, and google doesn't let you render the page but instead redirects you to another page(Google Homepage). And there, you get different HTML resulting in no search results(you did in the first place).
IN 1 is the URL you are trying to hit, and 2 is the URL you are redirected to.
Look at the difference is google.com/webhp?tbs=sbi:AMhZZisX...
VS google.com/search?tbs=sbi:AMhZZisX...
The HTML of that page results in is this -
Always use the source HTML given by the requests module, which shows you the actual result.
As you can see, this is not the search result page.
So to reach your goal, try using Selenium.

Python webscraping page which requires login

I am trying to automate a web data gathering process using Python. In my case, I need to pull the information from https://app.ixml.com.br/documentos/nfe page. However, before you go to this page, you need to log in at https://app.ixml.com/login. The code below should theoretically log into the site:
import re
from robobrowser import RoboBrowser
username = 'email'
password = 'password'
br = RoboBrowser()
br.open('https://app.ixml.com.br/login')
form = br.get_form()
form['email'] = username
form['senha'] = password
br.submit_form(form)
src = str(br.parsed())
However, by printing the src variable, I get the source code from the https://app.ixml.com.br/login page, ie before logging in. If I add the following lines at the end of the previous code
br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())
The src2 variable contains the source code of the page https://app.ixml.com.br/. I tried some variations, such as creating a new br object, but got the same result. How can I access the information at https://app.ixml.com.br/documentos/nfe?
If it is ok to have a webpage opening you can try to solve this using selenium. This package makes it possible to create a program that reacts just like a user would.
The following code would have you login:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://app.ixml.com.br/login")
browser.find_element_by_id("email").send_keys("abc#mail")
browser.find_element_by_id("senha").send_keys("abc")
browser.find_element_by_css_selector("button").click()

Scraping a secure website requiring clicks on javascript links

I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef&regl=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.

Can't get proper results with mechanize when using br.submit()

I am trying to submit a form, and get the results of the page that it heads to after submitting the form. I'm using mechanize.
1) When I'm using the code to click on the first-button, it is getting a response. But when I read the response, it is showing the source of the same page (the page where the form is located). Not of the page that the browser is redirected to after the submission of the form.
from mechanize import Browser
br = Browser()
br.open("http://link.net/form_page.php")
br.select_form(nr=0)
br.form['number'] = '0123456789'
response = br.submit(nr=0)
print response.read()
Now, when I do this, the source of the same page (i.e. form_page.php) is showing up. But, it should have shown the source of "results.php" (that is where the browser leads to when I do it manually)
2) There are multiple submit buttons in the page. I am clicking only the first one. But when I'm trying to click other submit buttons using nr=1 or nr=2, it is showing this error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize /_mechanize.py", line 524, in select_form
raise FormNotFoundError("no form matching "+description)
mechanize._mechanize.FormNotFoundError: no form matching nr 1
Can you please help me?
make sure you are selecting right form or make sure there is a form that you are selecting on the web page. you can check it by like this code :
for form in br.forms():
print form
and see what result returned to you.
This looks similar to this issue, where submit was calling some Javascript to validate the inputs before redirecting. It may be worth having a look at the HTML of the page and checking what it does on submit.
Try the following:
import mechanize
br = mechanize.Browser()
br.open("http://link.net/form_page.php")
br.select_form(nr=0)
br['number'] = '0123456789' ### try instead of 'br.form[]'
response = br.submit() ### no need to specify form again
text = response.read()
Don't forget about 'br.set_handle_robots(False)', 'br.set_all_readonly(False)', etc...

error while parsing url using python

I am working on a url using python.
If I click the url, I am able to get the excel file.
but If I run following code, it gives me weird output.
>>> import urllib2
>>> urllib2.urlopen('http://intranet.stats.gov.my/trade/download.php?id=4&var=2012/2012%20MALAYSIA%27S%20EXPORTS%20BY%20ECONOMIC%20GROUPING.xls').read()
output :
"<script language=javascript>window.location='2012/2012 MALAYSIA\\'S EXPORTS BY ECONOMIC GROUPING.xls'</script>"
why its not able to read content with urllib2?
Take a look using an http listener (or even Google Chrome Developer Tools), there's a redirect using javascript when you get to the page.
You will need to access the initial url, parse the result and fetch again the actual url.
#Kai in this question seems to have found an answer to javascript redirects using the module Selenium
from selenium import webdriver
driver = webdriver.Firefox()
link = "http://yourlink.com"
driver.get(link)
#this waits for the new page to load
while(link == driver.current_url):
time.sleep(1)
redirected_url = driver.current_url

Categories

Resources