Parsing XML object in python 3.9

Parsing XML object in python 3.9 - python

I'm trying to get some data using the NCBI API. I am using requests to make the connection to the API.
What I'm stuck on is how do I convert the XML object that requests returns into something that I can parse?
Here's my code for the function so far:
def getNCBIid(speciesName):
import requests
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
url = base_url + "esearch.fcgi?db=assembly&term=(%s[All Fields])&usehistory=y&api_key=f1e800ad255b055a691c7cf57a576fe4da08" % speciesName
#xml object
api_request = requests.get(url)

You would use something like BeautifulSoup for this ('this' being 'convert and parse the xml object').
What you are calling your xml object is still the response object, and you need to extract the content from that object first.
from bs4 import BeautifulSoup
def getNCBIid(speciesName):
import requests
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
url = base_url + "esearch.fcgi?db=assembly&term=(%s[All Fields])&usehistory=y&api_key=f1e800ad255b055a691c7cf57a576fe4da08" % speciesName
#xml object. <--- this is still just your response object
api_request = requests.get(url)
# grab the response content
xml_content = api_request.content
# parse with beautiful soup
soup = BeautifulSoup(xml_content, 'xml')
# from here you would access desired elements
# here are docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

How to webscrape old school website that uses frames

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.

As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

how can I get the names in this html code by python?

I want to get both of names "Justin Cutroni" and "Krista Seiden" without the tags
this is my html code that I want to get the names by python3:
I used beautifulsoup but I don't know how to get deep in the html codes and get the names.
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_lessons(web_page):
# Load the webpage content
r = requests.get(web_page)
# Convert to a beautiful soup object
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__title"]')
data = [x.text.split(';')[-1].strip() for x in table]
return data
find_teachers(web_pages[0])

You are looking at course-card__title, when it appears you want is course-card__teacher. When you're using requests, it's often more useful to look at the real HTML (using wget or curl) rather than the object model, as in your image.
What you have pretty much works with that change:
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_teachers(web_page):
# Load the webpage content
r = requests.get(web_page)
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__teacher"]')
return [x.text.strip() for x in table]
print(find_teachers(web_pages[0]))

Parsing a HTML Table gets empy soup with beautifulsoup and request

I'm trying to get all the table in this url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats" in a DataFrame (821 rows in total, need all the table). The code I'm using is this:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
My idea is to get the info in soup and then look for the tag <script> jQuery.extend(Drupal.settings, {"basePath": ... and get inside the followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json where is all the data in the table. I already have a function to read this json link, but first need to find the info in soup and then get json link. Need to be in this way because I have to read many tables and get the json link by inspectioning manually is not an option for me.

You want the following regex pattern which finds the desired string after "url"
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)

Not getting json when using .text in bs4

In this code I think I made a mistake or something because I'm not getting the correct json when I print it, indeed I get nothing but when I index the script I get the json but using .text nothing appears I want the json alone.
CODE :
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import requests
import selenium.webdriver as webdriver
base_url = 'https://www.instagram.com/{}'
search = input('Enter the instagram account: ')
final_url = base_url.format(quote_plus(search))
response = requests.get(final_url)
print(response.status_code)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
print(scripts[0].text)

Change the line print(scripts[0].text) to print(scripts[0].string).
scripts[0] is a Beautiful Soup Tag object, and its string contents can be accessed through the .string property.
Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
If you want to then turn the string into a json so that you can access the data, you can do something like this:
...
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
json_output = json.loads(scripts[0].string)
Then, for example, if you run print(json_output['name']) you should be able to access the name on the account.

Get the latest XML file from a HTTPS

I have a series of XML files at a HTTPS URL below. I need to get the latest XML file from the URL.
I tried to modify this piece of code but does not work. Please help.
from bs4 import BeautifulSoup
import urllib.request
import requests
url = 'https://www.oasis.oati.com/cgi-bin/webplus.dll?script=/woa/woa-planned-outages-report.html&Provider=MISO'
response = requests.get(url, verify=False)
#html = urllib.request.urlopen(url,verify=False)
soup = BeautifulSoup(response)
I suppose beautifulsoup does not read response object. And if I use the urlopen function, it throws SSL error.

BeautifulSoup does not understand the requests's Response instances directly - grab .content and pass it to the "soup" to parse:
soup = BeautifulSoup(response.content, "html.parser") # you can also use "lxml" or "html5lib" instead of "html.parser"
BeautifulSoup understands the "file-like" objects as well - which means that, once you figure out your SSL error issue you can do:
data = urllib.request.urlopen(url)
soup = BeautifulSoup(data, "html.parser")

I did not frame my question correctly in the first place. But after furthering researching, I found out that I was really trying to extract all the URLs within the referenced url tags. With some more background of the Beautiful Soup, I would use soup.find_all('a').

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML object in python 3.9 - python

Related

How to webscrape old school website that uses frames

how can I get the names in this html code by python?

Parsing a HTML Table gets empy soup with beautifulsoup and request

Not getting json when using .text in bs4

Get the latest XML file from a HTTPS

Categories

Resources