I'm trying to access a webpage and return all of the hyperlinks on that page. I'm using the same code from a question that was answered here.
I wish to access this correct page, but It is only returning content from this incorrect page.
Here is the code I am running:
import httplib2
from bs4 import SoupStrainer, BeautifulSoup
http = httplib2.Http()
status, response = http.request('https://www.iparkit.com/Minneapolis')
for link in BeautifulSoup(response, 'html.parser', parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print (link['href'])
Results:
/account
/monthlyAccount
/myproducts
/search
/
{{Market}}/search?expressSearch=true&timezone={{timezone}}
{{Market}}/search?expressSearch=false&timezone={{timezone}}
{{Market}}/events
monthly?timezone={{timezone}}
/login?next={{ getNextLocation(browserLocation) }}
/account
/monthlyAccount
/myproducts
find
parking-app
iparkit-express
https://interpark.custhelp.com
contact
/
/parking-app
find
https://interpark.custhelp.com
/monthly
/iparkit-express
/partners
/privacy
/terms
/contact
/events
I don't mind returning the above results, but It doesn't return any links that could get me to the page I want. Maybe it's protected? Any ideas or suggestions, thank you in advance.
The page you are trying to scrape is full JavaScript generated.
This http.request('https://www.iparkit.com/Minneapolis') would give almost nothing in this case.
Instead, you must do what a real browser do - Process JavaScript, then try to scrape what has been processed. For this you can try Selenium.
For your page, after running JavaScript you will get ~84 URLs, while trying to scrape without running JavaScript, you would get ~7 URLs.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('PATH_TO_CHROME_WEBDRIVER', chrome_options=chrome_options)
driver.get('https://www.iparkit.com/Minneapolis')
content = driver.page_source
Then you extract what you want from that content using BeautifulSoup in your case.
Related
I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?
The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)
Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.
I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?
As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']
I have an app that uses requests to analyze and act on web page text. But it does not seem to work on this page that is likely built with angular: https://bio.tools/bowtie, in that the source HTML is different than the actual content. I am trying to collect the DOI that is referenced on the page (10.1186/gb-2009-10-3-r25), but when requests picks up the HTML source the DOI is not there.
I've heard that Google is able to parse pages that are generated using javascript. How do they do it? Any tips on viewing the DOI information with python?
You probably need an engine which runs the javascript of the http response for you (like an internet browser does). You can use selenium for this and then parsing the html it returns with beautifulsoup.
Example:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://bio.tools/bowtie"
path = "path/to/chrome/webdriver"
browser = webdriver.Chrome(path) # Can also be Firefox, etc.
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
...
Since Instagram API is not working, I am trying to crawl information of a given hashtag. On the search page of Hash-Tag, it has Ajax embedded, so I followed rules online to look for url where the data is retrieved. Then I have the following link.
https://www.instagram.com/graphql/query/?query_hash=f92f56d47dc7a55b606908374b43a314&variables=%7B%22tag_name%22%3A%22cancun%22%2C%22show_ranked%22%3Afalse%2C%22first%22%3A20%2C%22after%22%3A%22QVFENlVELW9hZjlJVWU1RWd6anpWdGNsYkVwU3M5TzUtaDlRN3VoRHlwU1EwWWRBZ2t6TFkzbEl1M3RRcmItd0JKbVBiM2pLUXZpT0JzNWp3dFhIcElfWg%3D%3D%22%7D
However, When I am trying crawl that page using Urlopen, Instagram blocked my crawler. I tried to use User-Agent to get around it, and it is not working.
Then I tried to use Webdriver to fake the browser, it gets around the blockage, but I don't get anything from the crawling.
Does anyone know what is wrong with that.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver.get(url)
pagesource = driver.page_source
bsObj = BeautifulSoup(pagesource,'html.parser')
print(bsObj.prettify())
Appreciate any Help!
I'm using BeautifulSoup for a while and I've hadn't had much problems.
But now I'm trying to scrape from a site that gives me some problem.
My code is this:
preSoup = requests.get('https://www.betbrain.com/football/world/')
print(currUrl)
soup = BeautifulSoup(preSoup.content,"lxml")
print(soup)
the content I get seems to be some sort of script and/or api they're connected to, but not the real content of the webpage I see in the browser.
I cant reach the games for example. Does anyone knows a way around it?
Thank you
Okay requests gets only the html and doesnt load the js
you have to use webdriver for that
you can use Chrome, Firefox and etc.. i use PhantomJS because is running in the background its "headless" browser. Underneath you will find some example code that will help you understand how to use it
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://www.betbrain.com/football/world/")
time.sleep(5)# you can give it some time to load the js
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for i in soup.findAll("span", {"class": "Participant1"}):
print (i.text)