'NoneType' Error While WebScraping StockTwits

'NoneType' Error While WebScraping StockTwits - python

I am trying to write a script that simply reads and prints all of the tickers on a particular accounts watchlist. I have managed to navigate to the page print the user's name from the HTML, and now I want to print all the tickers he follows by using find() to find their location, then .find_all() to find each ticker, but every time I try to use the find() command to navigate to the watchlist tickers it returns 'NoneType.'
Here is my code:
import requests
import xlwt
from xlutils.copy import copy
from xlwt import Workbook
import xlrd
import urllib.request as urllib2
from bs4 import BeautifulSoup
hisPage = ("https://stocktwits.com/GregRieben/watchlist")
page = urllib2.urlopen(hisPage)
soup = BeautifulSoup(page, "html.parser")
his_name = soup.find("span", {"class":"st_33aunZ3 st_31YdEUQ st_8u0ePN3 st_2mehCkH"})
name = his_name.text.strip()
print(name)
watchlist = soup.find("div", {"class":"st_16989tz"})
tickers = watchlist.find_all('span', {"class":"st_1QzH2P8"})
print(type(watchlist))
print(len(watchlist))
Here I want the highlighted value (LSPD.CA) and all the others afterwards (they all have the exact same HTML set up)
Here is my Error:

That content is dynamically added from an api call (so not present in your request to original url where DOM is not updated as it would be when using a browser). You can find the API call for the watchlist in the network traffic. It returns json. You can extract what you want from that.
import requests
r = requests.get('https://api.stocktwits.com/api/2/watchlists/user/396907.json').json()
tickers = [i['symbol'] for i in r['watchlist']['symbols']]
print(tickers)
If you need to get user id to pass to API it is present in a number of places in response from your original url. I am using regex to grab from a script tag
import requests, re
p = re.compile(r'subjectUser":{"id":(\d+)')
with requests.Session() as s:
r = s.get('https://stocktwits.com/GregRieben/watchlist')
user_id = p.findall(r.text)[0]
r = s.get('https://api.stocktwits.com/api/2/watchlists/user/' + user_id + '.json').json()
tickers = [i['symbol'] for i in r['watchlist']['symbols']]
print(tickers)

Related

Crawl dynamic page crawl website elements

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.

Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

Can't scrape a link connected to some download button from a page using requests

I'm trying to download a csv file from a webpage using requests module. The idea is to parse the link connected to download button so that I can use the link to download the csv file. The link that I'm trying to grab is a dynamic one but I've noticed so far that there is always some way to find that. However, I just couldn't make it possible.
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://finance.yahoo.com/quote/AAPL/history?p=AAPL"
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
file_link = soup.select_one("a[href='/finance/download/']").get("href")
print(file_link)
With the above attempt, the script throws AttributeError: as it could not find the link in that site.
How can I fetch the downloadlink from that page using requests?

It seems that the link to download the CSV is constructed dynamically via JavaScript. But you can construct similar link using Python:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'AAPL'
from_ = datetime(2019,9,27,0,0).strftime('%s')
to_ = datetime(2020,9,27,23,59).strftime('%s')
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,220.539993,220.960007,217.279999,218.820007,216.670242,25352000
2019-09-30,220.899994,224.580002,220.789993,223.970001,221.769623,25977400
2019-10-01,225.070007,228.220001,224.199997,224.589996,222.383545,34805800
2019-10-02,223.059998,223.580002,217.929993,218.960007,216.808853,34612300
2019-10-03,218.429993,220.960007,215.130005,220.820007,218.650574,28606500
2019-10-04,225.639999,227.490005,223.889999,227.009995,224.779770,34619700
2019-10-07,226.270004,229.929993,225.839996,227.059998,224.829269,30576500
2019-10-08,225.820007,228.059998,224.330002,224.399994,222.195404,27955000
2019-10-09,227.029999,227.789993,225.639999,227.029999,224.799576,18692600
2019-10-10,227.929993,230.440002,227.300003,230.089996,227.829498,28253400
2019-10-11,232.949997,237.639999,232.309998,236.210007,233.889374,41698900
2019-10-14,234.899994,238.130005,234.669998,235.869995,233.552719,24106900
2019-10-15,236.389999,237.649994,234.880005,235.320007,233.008133,21840000
2019-10-16,233.369995,235.240005,233.199997,234.369995,232.067444,18475800
2019-10-17,235.089996,236.149994,233.520004,235.279999,232.968521,16896300
2019-10-18,234.589996,237.580002,234.289993,236.410004,234.087433,24358400
2019-10-21,237.520004,240.990005,237.320007,240.509995,238.147125,21811800
2019-10-22,241.160004,242.199997,239.619995,239.960007,237.602539,20573400
2019-10-23,242.100006,243.240005,241.220001,243.179993,240.790909,18957200
...and so on.
EDIT:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'AAPL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)

How to download tickers from webpage, beautifulsoup didnt get all content

I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())

The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

While I use bs4 to parse site

I want to parse the price information in Bitmex using bs4.
(The site url is 'https://www.bitmex.com/app/trade/XBTUSD')
So, I wrote down the code like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)
And, the result is like this
connected...
[]
Why '[]' poped up? and To bring the price text like '6065.5', what should I do?
The text I want to parse is
<span class="price">6065.5</span>
and the selector is
content > div > div.tickerBar.overflown > div > span.instruments.tickerBarSection > span:nth-child(1) > span.price
I just study Python, so question can seems to be odd to pro...sorry

You were pretty close. Give the following a try and see if it's more what you wanted. Perhaps the format you seeing or retrieving is not quite what you expect. Hope this is helpful.
from bs4 import BeautifulSoup
import requests
import sys
import json
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
sys.exit(1)
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()
# parse json text
d = json.loads(price.text)
# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)
Example output:
connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]

Your problem is that the page doesn't contain those span elements in first place. If you check the response tab in your browser developer tools (press F12 in firefox) you can see that the page is composed of script tags with some code written in javascript that creates the elements dynamically when executed.
Since BeautifulSoup can't execute Javascript, you can't extract the elements directly with it. You have two alternatives:
Use something like selenium that allows you to drive a browser from python - that means javascript will be executed because you're using a real browser - however the performance suffers.
Read the javascript code, understand it and write python code to simulate it. This usually is harder but luckly for you this seem very simple for the page you want:
import requests
import lxml.html
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[#id='initialData']/text()")[0])
As you can see the data is in json format inside the page. After loading the data variable you can use it to access the infomation you want:
for row in data['orderBook']:
print(row['symbol'], row['price'], row['side'])
Will print:
('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

How to scrape total search results using Python

I am a beginner in Python and web scraping but I am really interested. What I want to do is to extract the total number of search results per day.
If you open it, you will see here:
Used Cars for Sale
Results 1 - 20 of 30,376
What I want is only the number 30,376. Is there any way to extract it on a daily basis automatically and save it to an excel file please? I have played around some packages in Python but all I got is error messages and something not relevant like below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "..."
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
make_soup(base_url)
Can someone show me how to extract that particular number please? Thanks!

Here is the one way through requests module and soup.select function.
from bs4 import BeautifulSoup
import requests
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
def make_soup(url):
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
txt = soup.select('#result-header .result-count')[0].text
print txt.split()[-1]
make_soup(base_url)
soup.select accepts an css selector as argument. This #result-header .result-count selector means find the element having result-count class which was inside an element having result-header as id.

from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
html = urlopen(base_url).read()
soup = BeautifulSoup(html, 'lxml')
result_count = soup.find(class_="result-count").text.split('of ')[-1]
print(result_count)
out:
30,376

from bs4 import BeautifulSoup
import requests, re
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
a = BeautifulSoup(requests.get(base_url).content).select('div#result-header p.result-count')[0].text
num = re.search('([\w,]+)$',a)
print int(num.groups(1)[0].replace(',',''))
Output:
30378
Will get any other number also which is at the end of the statement.
Appending new rows to an Existing Excel File
Script to append today's date and the extracted number to existing excel file:
!!!Important!!!: Don't run this code directly on your main file. Instead make a copy of it first and run on that file. If it works properly then you can run it on your main file. I'm not responsible if you loose your data :)
import openpyxl
import datetime
wb = openpyxl.load_workbook('/home/yusuf/Desktop/data.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
a = sheet.get_highest_row()
sheet.cell(row=a,column=0).value=datetime.date.today()
sheet.cell(row=a,column=1).value=30378 # use a variable here from the above (previous) code.
wb.save('/home/yusuf/Desktop/data.xlsx')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

'NoneType' Error While WebScraping StockTwits - python

Related

Crawl dynamic page crawl website elements

Can't scrape a link connected to some download button from a page using requests

How to download tickers from webpage, beautifulsoup didnt get all content

While I use bs4 to parse site

How to scrape total search results using Python

Categories

Resources