Scraping website in which html is injected with javascript - python

I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?

requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[#type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[#type="text/javascript"]') if 'tickers' in i.xpath('string')]

Related

Can't get for loop to work while parsing HTML using Beautiful Soup 4

I'm using the Beautiful Soup documentation to help me understand how to implement it. I'm not too familiar with Python as a whole, so maybe I'm making a syntax error, but I don't believe so. The code below should print out any links from the main Etsy page, but it's not doing that. The documentation states something similar to this, but maybe I'm missing something. Here's my code:
#!/usr/bin/python3
# import library
from bs4 import BeautifulSoup
import requests
import os.path
from os import path
# Request to website and download HTML contents
url='https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, 'html.parser')
for x in soup.head.find_all('a'):
print(x.get('href'))
The HTML prints if I set it up that way, but I can't get the for loop to work.
If you're trying to get all tags from the specified URL then:
url = 'https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
with requests.get(url) as r:
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
if (body := soup.body):
for a in body.find_all('a', href=True):
print(a['href'])

python crawling text from <em></em>

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks)

how Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks) !
The problem is that the actual page code does not appear
import mechanicalsoup as ms
Browser = ms.StatefulBrowser()
Browser.open("https://mobile.twitter.com/login")
Browser.select_form('form[action="/sessions"]')
Browser["session[username_or_email]"] = 'email'
Browser["session[password]"] = 'password'
Browser.submit_selected()
Browser.open("https://mobile.twitter.com/i/bookmarks")
html = Browser.get_current_page()
print html
Use BeautifulSoup.
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
From this answer:
https://stackoverflow.com/a/43290890/11034096
Edit:
It looks like the issue is that Twitter is trying to use a JS redirect to load the next page. JS isn't supported by mechanicalsoup, so you'll need to try something like selenium.
The html variable that you are returning is actually a BeautifulSoup object and not the text HTML. I would try using:
print(html.text())
to see if that will print the HTML directly.
Alternatively, from the BeautifulSoup documentation you should be able to use the non-pretty printing of:
str(html)
or
unicode(html.a)

How to download tickers from webpage, beautifulsoup didnt get all content

I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())
The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

Python How to open an URL and get source code at the same time?

What I had try are as following:
1)
response = urllib2.urlopen(url)
html = response.read()
In this way, I can't open the url in browser.
2)
webbrowser.open(url)
In this way, I can't get source code of the url.
So, how can I open an URL and get source code at the same time?
Thanks for your help.
Have a look at BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
You can request a website and then read the HTML source code from it:
import requests
from bs4 import BeautifulSoup
r = requests.get(YourURL)
soup = BeautifulSoup(r.content)
print soup.prettify()
If you want to read JavaScript, look into Headless Browsers.

Categories

Resources