Parsing a HTML Table gets empy soup with beautifulsoup and request - python

I'm trying to get all the table in this url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats" in a DataFrame (821 rows in total, need all the table). The code I'm using is this:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
My idea is to get the info in soup and then look for the tag <script> jQuery.extend(Drupal.settings, {"basePath": ... and get inside the followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json where is all the data in the table. I already have a function to read this json link, but first need to find the info in soup and then get json link. Need to be in this way because I have to read many tables and get the json link by inspectioning manually is not an option for me.

You want the following regex pattern which finds the desired string after "url"
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)
It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

how can I get the names in this html code by python?

I want to get both of names "Justin Cutroni" and "Krista Seiden" without the tags
this is my html code that I want to get the names by python3:
I used beautifulsoup but I don't know how to get deep in the html codes and get the names.
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_lessons(web_page):
# Load the webpage content
r = requests.get(web_page)
# Convert to a beautiful soup object
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__title"]')
data = [x.text.split(';')[-1].strip() for x in table]
return data
find_teachers(web_pages[0])
You are looking at course-card__title, when it appears you want is course-card__teacher. When you're using requests, it's often more useful to look at the real HTML (using wget or curl) rather than the object model, as in your image.
What you have pretty much works with that change:
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_teachers(web_page):
# Load the webpage content
r = requests.get(web_page)
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__teacher"]')
return [x.text.strip() for x in table]
print(find_teachers(web_pages[0]))

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?

https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML
3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Categories

Resources