Scrape <div<span from HTML-page

Scrape <div<span from HTML-page - python

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name

You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

Related

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

I am trying to scrape the website, and there are 12 pages with X links on them - I just want to extract all the links, and store them for later usage.
But there is an awkward problem with extracting links from the pages. To be precise, my output contains only the last link from each of the pages.
I know that this description may sound confusing, so let me show you the code and images:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
But problem is, my output looks like [redacted].
Instead of ~80 links, I'm getting this! I wondered what happened, and it looks like my script from every generated URL (from the list named "issues" in the code) gets only the last listed link?! How to fix it? I do not have any idea what should be the problem here.

Were you perhaps missing an indentation when appending to paper_urls?
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
The whole code, after moving the print outside the loop, would look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

Python Beautiful Soup not pulling all the data

I'm currently looking to pull specific issuer data from URL html with a specific class and ID from the Luxembourg Stock Exchange using Beautiful Soup.
The example link I'm using is here: https://www.bourse.lu/security/XS1338503920/234821
And the data I'm trying to pull is the name under 'Issuer' stored as text; in this case it's 'BNP Paribas Issuance BV'.
I've tried using the class vignette-description-content-text, but it can't seem to find any data, as when looking through the soup, not all of the html is being pulled.
I've found that my current code only pulls some of the html, and I don't know how to expand the data it's pulling.
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ResultsContainer', class_="vignette-description-content-text")
I have found similar problems and followed guides shown in link 1, link 2 and link 3, but the example html used seems very different to the webpage I'm looking to scrape.
Is there something I'm missing to pull and scrape the data?

Based on your code, I suspect you are trying to get element which has class=vignette-description-content-text and id=ResultsContaine.
The class_ is correct way to use ,but not with the id
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
def applyFilter(element):
if element.has_attr('id') and element.has_attr('class'):
if "vignette-description-content-text" in element['class'] and element['id'] == "ResultsContainer":
return True
results = soup.find_all(applyFilter)
for result in results:
#Each result is an element here

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Getting None when scraping for operating income from SEC EDGAR document

I'm trying to obtain the latest quarter's operating income/loss from a quarterly filling.
Desired output highlighted in green: financial statement
Here's the URL of the document that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
If you'd like to see the data point visually, it is in PART I, Item 1. Financial Statements, Operating income.
The HTML code for the figure that I'm trying to get:
<ix:nonfraction id="fact-identifier-125" name="us-gaap:OperatingIncomeLoss" contextref="FD2019Q3QTD" unitref="usd" decimals="-6" scale="6" format="ixt:numdotdecimal" data-original-id="d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isadditionalitemsonly="false">11,544</ix:nonfraction>
The code that I used to obtain this data point (11,544).:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
response = requests.get(url)
content = BeautifulSoup(response.content, 'html.parser')
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss", "contextref":"FD2019Q3QTD"})
print (operatingincomeloss)
I also tried with
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss"}
Eventually, I want to loop through all the relevant fillings to pull this data point. Currently, I'm just getting None. When I CTRl+F through content, I can't find the ix:nonfraction tag as well.

Page is loaded via JavaScript, I've attached the XHR request made and extracted the data required.
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("#d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0"):
print(item.text)
Output:
11,544
Updated:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("ix:nonfraction", {'contextref': 'FD2019Q3QTD', 'name': 'us-gaap:OperatingIncomeLoss'}):
print(item.text)

As #αԋɱҽԃ αмєяιcαη said, the page is loaded via JavaScript.
I have used the xhr request for this code.
Considering the attributes you have used, I have taken name attribute only, as contextref changes for each element.
You could also change the name attribute if you want to loop through other elements.
As you said you want to loop through this tag, I have printed all the output returning in the code below.
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm')
soup = BeautifulSoup(res.text, 'html.parser')
for data in soup.find_all('ix:nonfraction', {'name': 'us-gaap:OperatingIncomeLoss'}):
print(data.text)
Output:
11,544
12,612
48,305
54,780
7,442
7,496
26,329
26,580
3,687
3,892
14,371
15,044
3,221
3,414
12,142
15,285
1,795
1,765
7,199
7,193
1,155
1,127
4,811
4,980
17,300
17,694
64,852
69,082
11,544
12,612
48,305
54,780

Beautiful soup with find all only gives the last result

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?

You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape <div<span from HTML-page - python

Related

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

Python Beautiful Soup not pulling all the data

How to get CData from html using beautiful soup

Getting None when scraping for operating income from SEC EDGAR document

Beautiful soup with find all only gives the last result

Categories

Resources