How to get CData from html using beautiful soup - python

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Related

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

Python Beautiful Soup not pulling all the data

I'm currently looking to pull specific issuer data from URL html with a specific class and ID from the Luxembourg Stock Exchange using Beautiful Soup.
The example link I'm using is here: https://www.bourse.lu/security/XS1338503920/234821
And the data I'm trying to pull is the name under 'Issuer' stored as text; in this case it's 'BNP Paribas Issuance BV'.
I've tried using the class vignette-description-content-text, but it can't seem to find any data, as when looking through the soup, not all of the html is being pulled.
I've found that my current code only pulls some of the html, and I don't know how to expand the data it's pulling.
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ResultsContainer', class_="vignette-description-content-text")
I have found similar problems and followed guides shown in link 1, link 2 and link 3, but the example html used seems very different to the webpage I'm looking to scrape.
Is there something I'm missing to pull and scrape the data?
Based on your code, I suspect you are trying to get element which has class=vignette-description-content-text and id=ResultsContaine.
The class_ is correct way to use ,but not with the id
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
def applyFilter(element):
if element.has_attr('id') and element.has_attr('class'):
if "vignette-description-content-text" in element['class'] and element['id'] == "ResultsContainer":
return True
results = soup.find_all(applyFilter)
for result in results:
#Each result is an element here

BeautifulSoup/Python Problems Parsing Websites

I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file
Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).

Finding name and codes of all airports

I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis

Categories

Resources