Unable to extract date value from website with Python and Beautiful Soup - python

Im tying to extract date from one website. I want date/time when news article is published.
This is my code:
from bs4 import BeautifulSoup
import requests
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=911"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
date_tag = 'div#middle p' # this gives me all the paragraphs
date = soup.select(date_tag)
print(date)
You can also try with this website:
url = 'http://www.embrach.ch/de/aktuell/aktuellesinformationen/?action=showinfo&info_id=1098080'
Please check out the url, thats the website that I want to scrape, and date/time that I want to get is: 13:05:28 26.11.2020
This is my css selector that only gives me paragraphs, but date/time is not in paragraph, its in font tag.
date_tag = 'div#middle p'
But when i set my css selector to:
date_tag = 'div#middle font'
I get []
Is it possible to extract data thats not in any child tag?

If you grab those elements, you'll notice that date is the next sibling node to the <h1> tag. So get the <div id="middle"> tag. Then within that tag, get the <h1> tag. then from that <h1> tag, get the .nextSibling (there's also .previousSibling if it's placed before a certain tag element) which is the text. Then it's just a matter of some string manipulation.
Code:
import requests
from bs4 import BeautifulSoup
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=911"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
date = soup.find_all('div',{'id':'middle'})
print(date)
for each in date:
print(each.find('h1').nextSibling.split(':',1)[-1].strip())
Output:
13:05:28 26.11.2020

You would have to extract the entire text as well, since that is all the same element. But what you can do is to take the element, since it is basically the same, except a few minutes which i assume it doesnt matter too much. If you need help with choosing the h1 element let me know

Related

Scrape <div<span from HTML-page

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name
You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

How to scrape a website using selected words if present?

I have used Beautifulsoup to scrape a website. My current code helps me to get the website content in HTML format. I used soup to find the word if it is present or not but I am not able to get the paragraph it belongs to.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://manychat.com/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title.text
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_body, page_head)
thirdParty = soup.find(text = 'Facebook')
Usually, the areas you're interested in searching are of a common kind, like <div> with a common class. So, you have Soup return all of the <div>s with that class, and you search the div text for your word.

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Why is BeautifulSoup's findAll returning an empty list when I search by class?

I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list.
<h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job">
html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job")
bs0bj=BeautifulSoup(html,"lxml")
nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"})
print(nameList)
The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using to obtain iframe content (the iframe src). Then extract the string from the script tag that has the info and load with json, extract the description (which is html) and pass back to bs to then select the h2 tags. You now have the rest of the info stored in the second soup object as well if required.
import requests
from bs4 import BeautifulSoup as bs
import json
r = requests.get('https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job?mobile=false&width=1140&height=500&bga=true&needsRedirect=false&jan1offset=0&jun1offset=60&in_iframe=1')
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script)
soup = bs(data['description'], 'lxml')
headers = [item.text for item in soup.select('h2')]
print(headers)
The answer lays hidden in two elements:
javascript rendered contents: after document.onload
in particular the content managed by js comes after this comment and it's, indeed, rendered by js. The line where the block starts is: "< ! - -BEGIN ICIMS - - >" (space added to avoid it goes blank)
As you can imagine the h2 class="ICISM class here" DOESN'T exist WHEN you call the bs4 methods.
The solution?
IMHO the best way to achieve what you want is to use selenium, to get a full rendered web page.
check this also
Web-scraping JavaScript page with Python

Retrieve content field from html span

I have the following html code inside an object:
<span itemprop="price" content="187">187,00 €</span>
My idea is to get the contet of the span object (the price). In order to do so, I am doing the following:
import requests
from lxml import html
tree = html.fromstring(res.content)
prices = tree.xpath('//span[#class="price"]/text()')
print(float(prices[0].split()[0].replace(',','.')))
Here, res.content contains inside the span object shown above. As you can see, I am getting the price from 187,00 € (after some modifications) when it would be easier to get it from the "content" tag inside span. I have tried using:
tree.xpath('//span[#class="price"]/content()')
But it does not work. Is there a way to retrieve this data? I am open to use any other libraries.
You can use the BeautifulSoup library for html parsing:
from bs4 import BeautifulSoup as soup
d = soup('<span itemprop="price" content="187">187,00 €</span>', 'html.parser')
content = d.find('span')['content']
Output:
'187'
To be event more specific, you can provide the itemprop value:
content = d.find('span', {'itemprop':'price'})['content']
To get the content between the tags, use soup.text:
content = d.find('span', {'itemprop':'price'}).text
Output:
'187,00\xa0€'
You can try
prices = tree.xpath('//span[#class="price"]')
for price in prices:
print(price.get("content"))

Categories

Resources