Extracting specific Information from a website using BeautifulSoup (Python)

Extracting specific Information from a website using BeautifulSoup (Python) - python

I am accessing the following website to extract a list of stocks:
http://www.barchart.com/stocks/performance/12month.php
I am using the following code:
from bs4 import BeautifulSoup
import requests
url=raw_input("http://www.barchart.com/stocks/performance/12month.php")
r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
data = r.text
soup =BeautifulSoup(data, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
The problem is I am getting a lot of other information that is not needed. I wanted to ask what would be a method that would just give me the stock names and nothing else.

r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all("td", {"class": "ds_name"})
for td in tds:
print td.a.text
If you look at the source code of the page, you will find that all you need is in a table. To be specific, the stocks' names are in <td></td> whose class="ds_name". So, that's it.

Related

scraping table with bs4

I am trying to scrape a table that is under a div tag with id pcaxis_tablediv using the following code. However, when I am printing it, it returns None. I am looking at the source code of the website and I can't see what am I doing wrong.
url='https://www.statistikdatabasen.scb.se/pxweb/sv/ssd/START__AM__AM0208__AM0208B/YREG65/sortedtable/tableViewSorted/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
wanted_table = soup.find_all('div', id="pcaxis_tablediv")
print(wanted_table)

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)

Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))

You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Getting None when scraping for operating income from SEC EDGAR document

I'm trying to obtain the latest quarter's operating income/loss from a quarterly filling.
Desired output highlighted in green: financial statement
Here's the URL of the document that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
If you'd like to see the data point visually, it is in PART I, Item 1. Financial Statements, Operating income.
The HTML code for the figure that I'm trying to get:
<ix:nonfraction id="fact-identifier-125" name="us-gaap:OperatingIncomeLoss" contextref="FD2019Q3QTD" unitref="usd" decimals="-6" scale="6" format="ixt:numdotdecimal" data-original-id="d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isadditionalitemsonly="false">11,544</ix:nonfraction>
The code that I used to obtain this data point (11,544).:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
response = requests.get(url)
content = BeautifulSoup(response.content, 'html.parser')
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss", "contextref":"FD2019Q3QTD"})
print (operatingincomeloss)
I also tried with
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss"}
Eventually, I want to loop through all the relevant fillings to pull this data point. Currently, I'm just getting None. When I CTRl+F through content, I can't find the ix:nonfraction tag as well.

Page is loaded via JavaScript, I've attached the XHR request made and extracted the data required.
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("#d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0"):
print(item.text)
Output:
11,544
Updated:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("ix:nonfraction", {'contextref': 'FD2019Q3QTD', 'name': 'us-gaap:OperatingIncomeLoss'}):
print(item.text)

As #αԋɱҽԃ αмєяιcαη said, the page is loaded via JavaScript.
I have used the xhr request for this code.
Considering the attributes you have used, I have taken name attribute only, as contextref changes for each element.
You could also change the name attribute if you want to loop through other elements.
As you said you want to loop through this tag, I have printed all the output returning in the code below.
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm')
soup = BeautifulSoup(res.text, 'html.parser')
for data in soup.find_all('ix:nonfraction', {'name': 'us-gaap:OperatingIncomeLoss'}):
print(data.text)
Output:
11,544
12,612
48,305
54,780
7,442
7,496
26,329
26,580
3,687
3,892
14,371
15,044
3,221
3,414
12,142
15,285
1,795
1,765
7,199
7,193
1,155
1,127
4,811
4,980
17,300
17,694
64,852
69,082
11,544
12,612
48,305
54,780

BeautifulSoup scraping Bitcoin Price issue

I am still new to python, and especially BeautifulSoup. I've been reading up on this stuff for a few days and playing around with bunch of different codes and getting mix results. However, on this page is the Bitcoin Price I would like to scrape. The price is located in:
<span class="text-large2" data-currency-value="">$16,569.40</span>
Meaning that, I'd like to have my script print only that line where the value is. My current code prints the whole page and it doesn't look very nice, since it's printing a lot of data. Could anybody please help to improve my code?
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find('text-large2', attrs={'class': 'stripe'})
for row in soup.findAll('div'):
for cell in row.findAll('tr'):
print cell.text
And this is a snip of the output I get after running the code. It doesn't look very nice or readable.
#SourcePairVolume (24h)PriceVolume (%)Updated
1BitMEXBTC/USD$3,280,130,000$15930.0016.30%Recently
2BithumbBTC/KRW$2,200,380,000$17477.6010.94%Recently
3BitfinexBTC/USD$1,893,760,000$15677.009.41%Recently
4GDAXBTC/USD$1,057,230,000$16085.005.25%Recently
5bitFlyerBTC/JPY$636,896,000$17184.403.17%Recently
6CoinoneBTC/KRW$554,063,000$17803.502.75%Recently
7BitstampBTC/USD$385,450,000$15400.101.92%Recently
8GeminiBTC/USD$345,746,000$16151.001.72%Recently
9HitBTCBCH/BTC$305,554,000$15601.901.52%Recently

Try this:
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find("div", {"class" : "col-xs-6 col-sm-8 col-md-4 text-left"
}).find("span", {"class" : "text-large2"})
for i in div:
print i
This prints 16051.20 for me.
Later Edit: and if you put the above code in a function and loop it it will constantly update. I get different values now.

This works. But I think you use older version of BeautifulSoup, try pip install bs4 in command prompt or PowerShell
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
value = soup.find('span', {'class': 'text-large2'})
print(''.join(value.stripped_strings))

beautifulsoup can't find any tags

I have a script that I've used for several years. One particular page on the site loads and returns soup, but all my finds return no result. This is old code that has worked on this site in the past. Instead of searching for a specific <div> I simplified it to look for any table, tr or td, with find or findAll. I've tried various methods of opening the page, including lxml - all with no results.
My interests are in the player_basic and player_records div's
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
import urllib2
url = "http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456"
#html = urllib2.urlopen(url).read()
html = urllib2.urlopen(url,"lxml")
soup = BeautifulSoup(html)
#div = soup.find('div', {"class":"player_basic"})
#div = soup.find('div', {"class":"player_records"})
item = soup.findAll('td')
print item

you're not reading the response. try this:
import urllib2
url = 'http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456'
response = urllib2.urlopen(url, 'lxml')
html = response.read()
then you can use it with BeautifulSoup. if it still does not work, there are strong reasons to believe that there is malformed HTML in that page (missing closing tags, etc.) since the parsers that BeautifulSoup uses (specially html.parser) are not very tolerant with that.
UPDATE: try using lxml parser:
soup = BeautifulSoup(html, 'lxml')
tds = soup.find_all('td')
print len(tds)
$ 142

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting specific Information from a website using BeautifulSoup (Python) - python

Related

scraping table with bs4

How can I get the text from this specific div class?

Getting None when scraping for operating income from SEC EDGAR document

BeautifulSoup scraping Bitcoin Price issue

beautifulsoup can't find any tags

Categories

Resources