Webscraper in Python - How do I extract exact text I need? - python

Good day
I am trying to write my first webscraper. I have managed to write the following:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("http://www.sharenet.co.za/v3/quickshare.php?scode=BTI")
r = s.post("http://www.sharenet.co.za/v3/quickshare.php?scode=BTI")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find_all("td", class_="dataCell"))
I am trying to extract a share price. When Inspect the element this is the HTML code:
<td class="dataCell" align="right">85221</td>
Image of share price table
Basically, my issue is that can search for all the tags but can't extract the exact tag I want.
Thanks in advance for any help.

Tags have a get_text() method. find_all returns a list of tags.
for cell_tag in soup.find_all("td"):
print(cell_tag.get_text())

Related

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

BeautifullSoup returns the whole DIV, but without the value

This might be a bit of a basic question, but either I do not know how to phrase it, or I'm not finding the answer.
So, I want to scrape a specific value of a website (18.73kWh) in this scenario.
> <div class="itemized-bill-header-consumption"data-bind="text:$root.formatItemizedbillConsumption(key.consumption,key.type)">18.73kWh</div>
So I am using Python and BeutifullSoup to get the value,
kwh = soup.findAll('div',{"class":"itemized-bill-header-consumption"})
The thing is, that as a result, i'm getting
[<div class="itemized-bill-header-consumption" data-bind="text:$root.formatItemizedbillConsumption(key.consumption,key.type)"></div>]
Which is pretty much everything minus the value I want... and I can't figure out why.
Thanks in advance for your help
Use the get_text() method.
html = """
<div class="itemized-bill-header-consumption"data-bind="text:$root.formatItemizedbillConsumption(key.consumption,key.type)">18.73kWh</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features='lxml')
for div in soup.findAll('div',{"class":"itemized-bill-header-consumption"}):
print(div.get_text())
Output
18.73kWh
You can use CSS selector select. You can try it:
from bs4 import BeautifulSoup
html_doc="""<div class="itemized-bill-header-consumption"data-bind="text:$root.formatItemizedbillConsumption(key.consumption,key.type)">18.73kWh</div>"""
soup = BeautifulSoup(html_doc, 'lxml')
kwh = soup.select("div", class_="itemized-bill-header-consumption")[0].text
print(kwh)
Output will be:
18.73kWh

Finding name and codes of all airports

I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis

python3, web scraping, beautifulsoup can't return data

i have been trying for two days to extract the price of BTC from https://www.bitfinex.com/stats. I am missing something fundamental as i have looked at lots of different tutorials, videos and blogs.
the price is located in the HTML like this -
<td class="col-currency">4849.7</td>
my code below
import requests
from bs4 import BeautifulSoup
#enter website address
url = requests.get('https://www.bitfinex.com/stats')
html = url.content
soup = BeautifulSoup(html)
where do i go from here?
You should read the bs4 documentation
You're looking for this to find the element
data = soup.find('div', attrs={'class': 'col-currency'})
then this to get the text
data = data.text
if data.text doesn't work, you can just use string manipulation to get the result from data

Categories

Resources