Alternatives to Python beautiful soup

Alternatives to Python beautiful soup - python

I wrote a few lines to get data from a financial data website.
It simply uses beautiful soup to parse and requests to get.
Is there any other simpler or sleeker ways of getting the same result?
I'm just after a discussion to see what others have come up with.
from pandas import DataFrame
import bs4
import requests
def get_webpage():
symbols = ('ULVR','AZN','HSBC')
for ii in symbols:
url = 'https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('tr')
data = [[td.getText() for td in rows[i].find_all('td')] for i in range(len(rows))]
#for i in data:
# [-7:] Date
# [-6:] Open
# [-5:] High
# [-4:] Low
# [-3:] Close
# [-2:] Adj Close
# [-1:] Volume
data = DataFrame(data)
print(ii, data)
if __name__ == "__main__":
get_webpage()
Any thoughts?

You can try with read_html() method:
symbols = ('ULVR','AZN','HSBC')
df=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L') for ii in symbols]
df1=df[0][0]
df2=df[1][0]
df3=df[2][0]

As its just the entire table that I want, it seems easier to use pandas.read_html, especially as I have no need to scrape anything in particular apart from the entire table.
There is some helpful information on this site as a guidance https://pbpython.com/pandas-html-table.html
By just using import pandas as pd I get the result I am after.
import pandas as pd
def get_table()
symbols = ('ULVR','AZN','HSBC')
position=0
for ii in symbols:
table=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?
p=' + ii + '.L')]
print (symbols[position])
print (table, '\n')
position += 1
if __name__ == "__main__":
get_table()

Related

Python Streamlit, and yfinance issues

I'll just list the two bugs I know as of now, and if you have any recommendations for refactoring my code let me know I'll go ahead and list out the few known issues as of now.
yfinance is not appending the dividendYield to my dict, I did make sure that their is an actual Dividend Yield for those Symbols.
TypeError: can only concatenate str (not "Tag") to str which I assume is something to do with how it parsing through the xml, and it ran into a tag so I am not able to create the expander, I thought I could solve it with this if statement, but instead I just don't get any expander at all.
with st.expander("Expand for stocks news"):
for heading in fin_headings:
if heading == str:
st.markdown("* " + heading)
else:
pass
Full code for main.py:
import requests
import spacy
import pandas as pd
import yfinance as yf
import streamlit as st
from bs4 import BeautifulSoup
st.title("Fire stocks :fire:")
nlp = spacy.load("en_core_web_sm")
def extract_rss(rss_link):
# Parses xml, and extracts the headings.
headings = []
response1 = requests.get(
"http://feeds.marketwatch.com/marketwatch/marketpulse/")
response2 = requests.get(rss_link)
parse1 = BeautifulSoup(response1.content, features="xml")
parse2 = BeautifulSoup(response2.content, features="xml")
headings1 = parse1.findAll('title')
headings2 = parse2.findAll('title')
headings = headings1 + headings2
return headings
def stock_info(headings):
# Get the entities from each heading, link it with nasdaq data // if possible, and Extract market data with yfinance.
stock_dict = {
'Org': [],
'Symbol': [],
'currentPrice': [],
'dayHigh': [],
'dayLow': [],
'forwardPE': [],
'dividendYield': []
}
stocks_df = pd.read_csv("./data/nasdaq_screener_1658383327100.csv")
for title in headings:
doc = nlp(title.text)
for ent in doc.ents:
try:
if stocks_df['Name'].str.contains(ent.text).sum():
symbol = stocks_df[stocks_df['Name'].str.contains(
ent.text)]['Symbol'].values[0]
org_name = stocks_df[stocks_df['Name'].str.contains(
ent.text)]['Name'].values[0]
# Recieve info from yfinance
stock_info = yf.Ticker(symbol).info
print(symbol)
stock_dict['Org'].append(org_name)
stock_dict['Symbol'].append(symbol)
stock_dict['currentPrice'].append(
stock_info['currentPrice'])
stock_dict['dayHigh'].append(stock_info['dayHigh'])
stock_dict['dayLow'].append(stock_info['dayLow'])
stock_dict['forwardPE'].append(stock_info['forwardPE'])
stock_dict['dividendYield'].append(
stock_info['dividendYield'])
else:
# If name can't be found pass.
pass
except:
# Don't raise an error.
pass
output_df = pd.DataFrame.from_dict(stock_dict, orient='index')
output_df = output_df.transpose()
return output_df
# Add input field input field
user_input = st.text_input(
"Add rss link here", "https://www.investing.com/rss/news.rss")
# Get financial headlines
fin_headings = extract_rss(user_input)
print(fin_headings)
# Output financial info
output_df = stock_info(fin_headings)
output_df.drop_duplicates(inplace=True, subset='Symbol')
st.dataframe(output_df)
with st.expander("Expand for stocks news"):
for heading in fin_headings:
if heading == str:
st.markdown("* " + heading)
else:
pass

There is an issue in your logic in stock_info function because of which same symbol is getting different values and when you are cleaning the duplicate, based on occurrence of the symbol its retaining the row with first occurrence of symbol.
The below code will solve both of your issues.
import requests
import spacy
import pandas as pd
import yfinance as yf
import streamlit as st
from bs4 import BeautifulSoup
st.title("Fire stocks :fire:")
nlp = spacy.load("en_core_web_sm")
def extract_rss(rss_link):
# Parses xml, and extracts the headings.
headings = []
response1 = requests.get(
"http://feeds.marketwatch.com/marketwatch/marketpulse/")
response2 = requests.get(rss_link)
parse1 = BeautifulSoup(response1.content, features="xml")
parse2 = BeautifulSoup(response2.content, features="xml")
headings1 = parse1.findAll('title')
headings2 = parse2.findAll('title')
headings = headings1 + headings2
return headings
def stock_info(headings):
stock_info_list = []
stocks_df = pd.read_csv("./data/nasdaq_screener_1658383327100.csv")
for title in headings:
doc = nlp(title.text)
for ent in doc.ents:
try:
if stocks_df['Name'].str.contains(ent.text).sum():
symbol = stocks_df[stocks_df['Name'].str.contains(
ent.text)]['Symbol'].values[0]
org_name = stocks_df[stocks_df['Name'].str.contains(
ent.text)]['Name'].values[0]
# Recieve info from yfinance
print(symbol)
stock_info = yf.Ticker(symbol).info
stock_info['Org'] = org_name
stock_info['Symbol'] = symbol
stock_info_list.append(stock_info)
else:
# If name can't be found pass.
pass
except:
# Don't raise an error.
pass
output_df = pd.DataFrame(stock_info_list)
return output_df
# Add input field input field
user_input = st.text_input(
"Add rss link here", "https://www.investing.com/rss/news.rss")
# Get financial headlines
fin_headings = extract_rss(user_input)
output_df = stock_info(fin_headings)
output_df = output_df[['Org','Symbol','currentPrice','dayHigh','dayLow','forwardPE','dividendYield']]
output_df.drop_duplicates(inplace=True, subset='Symbol')
st.dataframe(output_df)
with st.expander("Expand for stocks news"):
for heading in fin_headings:
heading = heading.text
if type(heading) == str:
st.markdown("* " + heading)
else:
pass

For issue #2 the patch code that you posted has a small mistake. Rather than checking if heading == str, which does something completely different than you intended and will always be False, you want to check if isinstance(heading, str). That way you get True if heading is a string and False if not. However, even then, it should not be a solution as heading is not a string. Instead you want to call get_text on heading to get the actual text part of the parsed object.
heading.get_text()
More information would be needed to solve issue #1. What does stock_dict look like before you create the Dataframe out of it? Specifically, what values are in stock_dict['dividendYield']? Can you print it and add it to your question?
Also, about the refactoring part. An
else:
pass
block does completely nothing and should be deleted. (When the if condition is false nothing happens anyways)

Python: Get element next to href

Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?

Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates

There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.

For Loop only prints the first value

I am trying to web scrape stock data using a for loop on a list of five stocks. The problem is only the first value is returned five times. I have tried appending to a list but it still doesn't work, although clearly I am not appending correctly. On the website, I want to get the data for Operating Cash which comes in the form of 14B or 1B for example, which is why I have removed the B and multiplied that number to get a raw value. Here is my code:
import requests
import yfinance as yf
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User Agent':'Mozilla/5.0'}
stocks = ['AMC','AMD','PFE','AAPL', 'NVDA']
finished_list = []
for stock in stocks:
url = f'https://www.marketwatch.com/investing/stock/{stock}/financials/cash-flow'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
operating_cash = soup.findAll('div', class_ = "cell__content")[134].text
finished_list.append(operating_cash)
if 'B' in operating_cash:
cash1 = operating_cash.replace('B','')
if '(' in cash1:
cash2 = cash1.replace('(','-')
if ')' in cash2:
cash3 = cash2.replace(')','')
cash3 = float(cash3)
print(cash3*1000000000)
else:
cash1 = float(cash1)
print(cash1 * 1000000000)
The current output is -1060000000.0 five times in a row which is the correct value for operating cash for AMC but not for the other four. Thanks in advance to anyone who can help me out.

You don't need to use if conditions for str.replace(). Instead, do all your replacements in one line like so:
for stock in stocks:
url = f'https://www.marketwatch.com/investing/stock/{stock}/financials/cash-flow'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
operating_cash = soup.findAll('div', class_ = "cell__content")[134].text
finished_list.append(operating_cash)
cash = float(operating_cash.replace('B','').replace('(','-').replace(')',''))
print(cash*1000000000)
-1060000000.0
1070000000.0000001
14400000000.0
80670000000.0
5820000000.0

extracting multiple data from table row in BS4

in the code below I am trying to extract IP addresses and ports of http://free-proxy-list.net from the table using BeautifulSoup.
But every time I get the whole row which is useless because I can't separate IP addresses from their ports.
How can I get IP and port separated?
Here is my code:
def get_proxy(self):
response = requests.get(self.url)
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
for i in data_list:
print(i.text)

In your code,
instead of -
i.text you could use i.getText(' ,') (or with another separator of your choice other than ,).
That will give you comma separated IP and Ports.
Moreover for convenience you could load the proxy list into a dataframe as well.
Make the following changes/additions to your code -
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
data_list2 = [tr.getText(' ,') for tr in soup.select('tr') if tr.td]
#for i in data_list:
#print(i.text)
df = pd.DataFrame(data_list2,columns=['proxy_list'])
df_proxyList= df['proxy_list'].str.split(',', expand=True)[0:300]
df_proxyList would look like (with few garbage columns) -

Try this. I had to add the isnumeric() condition to make sure that the code doesn't include the data from another table which is present on the same website.
from bs4 import BeautifulSoup as bs
import requests
from collections import defaultdict
def get_proxy(URL):
response = requests.get(url)
soup = bs(response.content,'html.parser')
mapping = defaultdict()
for tr in soup.select('tr'):
if len(list(tr)) == 8:
ip_val = str(list(tr)[0].text)
port_val = str(list(tr)[1].text)
if port_val.isnumeric():
mapping[ip_val] = port_val
for items in mapping.keys():
print("IP:",items)
print("PORT:",mapping[items])
if __name__ == '__main__':
url = "http://free-proxy-list.net"
get_proxy(url)

Scrape a table iterating over pages of a website: how to define the last page?

I have the following code that works OK:
import requests
from bs4 import BeautifulSoup
import pandas as pd
df_list = []
for i in range(1, 13):
url = 'https://www.uzse.uz/trade_results?date=25.01.2019&mkt_id=ALL&page=%d' %i
df_list.append(pd.read_html(url)[0])
df = pd.concat(df_list)
df
But for this particular page I know the number of pages, which is 13 in range(1, 13). Is there a way to define the last page so I do not have to go and check how many pages there are on a given page?

Try with
for i in range(1, 100):
url = 'https://www.uzse.uz/trade_results?date=25.01.2019&mkt_id=ALL&page=%d' %i
if pd.read_html(url)[0].empty:
break
else :
df_list.append(pd.read_html(url)[0])
page=0 # using while
while page > 0:
url = 'https://www.uzse.uz/trade_results?date=25.01.2019&mkt_id=ALL&page=%d' % i
df_list.append(pd.read_html(url)[0])
page = page + 1
if pd.read_html(url)[0].empty:
break
print(page)

I know the number of pages, which is 13 in range(1, 13).
You seem to be suffering from an OBOB (https://en.wikipedia.org/wiki/Off-by-one_error). Put a print(i) in your loop and you'll see it counts from 1 up to 12.
You may be happier with:
for i in range(13):
and then use the expression ... % (i + 1).
Cf https://docs.python.org/3/library/stdtypes.html#range

For this particular website, you could detect the number of pages from the pagination bar. You can use something similar to the following code snippet:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.uzse.uz/trade_results?date=25.01.2019&mkt_id=ALL')
soup = BeautifulSoup(r.text, 'html.parser')
lastpage_url = soup.find("li", {"class": "last next"}).findChildren("a")[0]['href']
num_pages = int(lastpage_url[lastpage_url.rfind("=")+1:])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Alternatives to Python beautiful soup - python

You can try with read_html() method: symbols = ('ULVR','AZN','HSBC') df=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L') for ii in symbols] df1=df[0][0] df2=df[1][0] df3=df[2][0]

Related

Python Streamlit, and yfinance issues

Python: Get element next to href

For Loop only prints the first value

extracting multiple data from table row in BS4

Scrape a table iterating over pages of a website: how to define the last page?

Categories

Resources