Retrieve content field from html span - python

I have the following html code inside an object:
<span itemprop="price" content="187">187,00 €</span>
My idea is to get the contet of the span object (the price). In order to do so, I am doing the following:
import requests
from lxml import html
tree = html.fromstring(res.content)
prices = tree.xpath('//span[#class="price"]/text()')
print(float(prices[0].split()[0].replace(',','.')))
Here, res.content contains inside the span object shown above. As you can see, I am getting the price from 187,00 € (after some modifications) when it would be easier to get it from the "content" tag inside span. I have tried using:
tree.xpath('//span[#class="price"]/content()')
But it does not work. Is there a way to retrieve this data? I am open to use any other libraries.

You can use the BeautifulSoup library for html parsing:
from bs4 import BeautifulSoup as soup
d = soup('<span itemprop="price" content="187">187,00 €</span>', 'html.parser')
content = d.find('span')['content']
Output:
'187'
To be event more specific, you can provide the itemprop value:
content = d.find('span', {'itemprop':'price'})['content']
To get the content between the tags, use soup.text:
content = d.find('span', {'itemprop':'price'}).text
Output:
'187,00\xa0€'

You can try
prices = tree.xpath('//span[#class="price"]')
for price in prices:
print(price.get("content"))

Related

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Unable to extract date value from website with Python and Beautiful Soup

Im tying to extract date from one website. I want date/time when news article is published.
This is my code:
from bs4 import BeautifulSoup
import requests
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=911"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
date_tag = 'div#middle p' # this gives me all the paragraphs
date = soup.select(date_tag)
print(date)
You can also try with this website:
url = 'http://www.embrach.ch/de/aktuell/aktuellesinformationen/?action=showinfo&info_id=1098080'
Please check out the url, thats the website that I want to scrape, and date/time that I want to get is: 13:05:28 26.11.2020
This is my css selector that only gives me paragraphs, but date/time is not in paragraph, its in font tag.
date_tag = 'div#middle p'
But when i set my css selector to:
date_tag = 'div#middle font'
I get []
Is it possible to extract data thats not in any child tag?
If you grab those elements, you'll notice that date is the next sibling node to the <h1> tag. So get the <div id="middle"> tag. Then within that tag, get the <h1> tag. then from that <h1> tag, get the .nextSibling (there's also .previousSibling if it's placed before a certain tag element) which is the text. Then it's just a matter of some string manipulation.
Code:
import requests
from bs4 import BeautifulSoup
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=911"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
date = soup.find_all('div',{'id':'middle'})
print(date)
for each in date:
print(each.find('h1').nextSibling.split(':',1)[-1].strip())
Output:
13:05:28 26.11.2020
You would have to extract the entire text as well, since that is all the same element. But what you can do is to take the element, since it is basically the same, except a few minutes which i assume it doesnt matter too much. If you need help with choosing the h1 element let me know

HTML parsing , nested div issue using BeautifulSoup

I am trying to extract specific nested div class and the corresponding h3 attribute (salary value).
So, I have tried the search by class method
soup.find_all('div',{'class':"vac_display_field"}
which returns an empty list.
Snippet code:
<div class="vac_display_field">
<h3>
Salary
</h3>
<div class="vac_display_field_value">
£27,951 - £30,859
</div>
</div>
Example here
First make sure you've instantiated your BeautifulSoup object correctly. Should look something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.civilservicejobs.service.gov.uk/csr/index.cgi?SID=cGFnZWNsYXNzPUpvYnMmb3duZXJ0eXBlPWZhaXImY3NvdXJjZT1jc3FzZWFyY2gmcGFnZWFjdGlvbj12aWV3dmFjYnlqb2JsaXN0JnNlYXJjaF9zbGljZV9jdXJyZW50PTEmdXNlcnNlYXJjaGNvbnRleHQ9MjczMzIwMTcmam9ibGlzdF92aWV3X3ZhYz0xNTEyMDAwJm93bmVyPTUwNzAwMDAmcmVxc2lnPTE0NzcxNTIyODItYjAyZmM4ZTgwNzQ2ZTA2NmY5OWM0OTBjMTZhMWNlNjhkZDMwZDU4NA=='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # the 'html.parser' part is optional.
Your code used to scrape the div tags looks correct (it's missing a closing parentheses, however). If, for some reason it still hasn't worked, try calling your find_all() method in this way:
soup.find_all('div', class_='vac_display_field')
If you look at the page's code, upon inspecting you'll find that the div tag you need is the second from the top:
Thus, your code can reflect that, using simple index notation:
Salary_info = soup.find_all(class_='vac_display_field')[1]
Then output the text:
for info in Salary_info:
print info.get_text()
HTH.

Webscraper in Python - How do I extract exact text I need?

Good day
I am trying to write my first webscraper. I have managed to write the following:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("http://www.sharenet.co.za/v3/quickshare.php?scode=BTI")
r = s.post("http://www.sharenet.co.za/v3/quickshare.php?scode=BTI")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find_all("td", class_="dataCell"))
I am trying to extract a share price. When Inspect the element this is the HTML code:
<td class="dataCell" align="right">85221</td>
Image of share price table
Basically, my issue is that can search for all the tags but can't extract the exact tag I want.
Thanks in advance for any help.
Tags have a get_text() method. find_all returns a list of tags.
for cell_tag in soup.find_all("td"):
print(cell_tag.get_text())

Categories

Resources