Extract Tag from XML - python

I'm very new to Python and am attempting my first web scraping project. I'm attempting to extract the data following a tag within a XML data source. I've attached an image of the data I'm working with. My issue is that, it seems like no matter what tag I try to extract I constantly return no results. I am able to return the entire data source so I know the connection is not the issue.
My ultimate goal is to loop through all of the data and return the data following a particular tag. I think if I can understand why I'm unable to print a singular particular tag I should be able to figure out how to loop through all of the data. I've looked through similar posts but I think the tree in my set of data is particularly troublesome (that and my inexperience).
My Code:
from bs4 import BeautifulSoup
import requests
#Assign URL to scrape
URL = "http://api.powertochoose.org/api/PowerToChoose/plans?zip_code=78364"
#Fetch the raw HTML Data
Data = requests.get(URL)
Soup = BeautifulSoup(Data.text, "html.parser")
tags = Soup.find_all('fact_sheet')
print (tags)

Try to check the response of your example first, it is JSON not XML so no BeautifulSoup needed here, simply iterate the data list to pick your fact_sheets:
for plan in Data.json()['data']:
print(plan['fact_sheet'])
Out:
https://rates.cleanskyenergy.com:8443/rates/DownloadDoc?path=a70e9298-5537-481a-985c-c7a005b2e4f3.html&id_plan=223344
https://texpo-prod-api.eroms.works/api/v1/document/ViewProductDocument?type=efl&rateCode=SRCPLF24PTC&lang=en
https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXSIMVL1212AR&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC
https://signup.myvaluepower.com/Home/EFL?productId=32653&Promo=16410
https://docs.cloud.flagshippower.com/EFL?term=36&duns=007924772&product=galleon&lang=en&code=FPSPTC2
...

As you've already realized by now, you're getting the data as json, so doing something like:
fact_sheet_links = [d['fact_sheet'] for d in Data.json()['data']]
would get you the data you want.
But also, if you'd prefer to work with the xml, you can add headers to the request:
Data = requests.get(URL, headers={ 'Accept': 'application/xml' })
and get an xml response. When I did this, Soup.find_all('fact_sheet') still did not work (although I've seen this method used in some tutorials, so it might be a version problem - and it might still work for you), but it did work when I used find_all with lambda:
tags = Soup.find_all(lambda t: 'fact_sheet' in t.name)
and the results after altering your code looked like this. That just gives you the tags though, so if you want a list of the contents instead, one way would be to use list comprehension:
fact_sheet_links = [t.text for t in tags]
so that you get them like this.

Related

Python Printing Multiple Items- Web Scraping with XPath

I am very novice to Python and programming in general so please forgive my lack of insight. I have managed to web-scrape some data with Xpath.
#Dependencies
from lxml import html
import requests
#URL
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
#Use Requests to retrieve html
resp = requests.get(url)
#Create Tree from Request Response
tree = html.fromstring(resp.content)
#Create Tree element
elements = tree.xpath('//*[starts-with(#id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]')
# Scrape for URL and split for just business url
websites= (elements[0].attrib['href'].split("http://")[1])
The output of this code returns a single website url. However, I would like to print all the business urls to eventually put to a Pandas data frame.
How can I retrieve elements[0],elements[1],elements[2]... in one variable or expression?
I am sure there is an iterative function or list comprehension for this but I cannot wrap my brain around it. I'm thinking something like this:
Can I create a function to iterate through the 'elements[0]' and return all my values?
Any help is greatly appreciated, Thanks!
Here is a quick fix that will get the websites from this particular site working from your code; it stores them all in the 'websites' list. That said if you're working on a webscraper you'd probably be better served working with Beautiful Soup
#Create Tree element
elements = tree.xpath('//*[starts-with(#id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]')
websites = []
for element in elements:
try:
websites.append("http"+element.attrib['href'].split("http")[2])
except:
continue
for website in websites:
print(website, '\n')

Need help parsing the php/html file using python

I would like to perse the url https://www.horsedeathwatch.com/index.php and dump the data into a Pandas data frame.
Column like horse/date/course/cause of death
I tried pandas read_html to directly read this url and it didn't find the table even though it has table tag .
I tried using :
url='https://www.horsedeathwatch.com/index.php'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#print(page.text)
soup = BeautifulSoup(page.content,'lxml')
and then findall('tr') method but some reason not getting it to work.
Second thing i would like to do is .. each Horse(first column in the web page table) has a hyperlink with additional attribute.
any suggestion on how can i retrieve those additional attributes to a pandas data frame
Looking at the site I can see the data is loaded using a POST request to /loaddata.php passing the page number. Combining this with pandas.read_html:
import requests
import pandas
res = requests.post('https://www.horsedeathwatch.com/loaddata.php', data={'page': '3'})
html = pandas.read_html(res.content)
Although perhaps BeautifulSoup would give you a richer data structure .. because if you want to extract the further attributes against each horse you would need to get the anchor element's 'href' and perform another request - this one is a GET request and you need to parse the reponse content from <div class="view"> in the response.

Beautiful Soup can't find this html

Python3 - Beautiful Soup 4
I'm trying to parse the weather graph out of the website:
https://www.wunderground.com/forecast/us/ny/new-york-city
But when I grab the weather graph html but beautiful soup seems to grab all around it.
I am new to Beautiful Soup. I think it is not able to grab this because either it is not able to parse the tag thing they have going on or because the javascript that populates the graph hasn't loaded or is not parsable by BS (at least the way I'm using it).
As far as my code goes, it's extremely basic
import requests, bs4
url = 'https://www.wunderground.com/forecast/us/ny/new-york-city'
requrl = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
requrl.raise_for_status()
bs = bs4.BeautifulSoup(requrl.text, features="html.parser")
a = str(bs)
x = 'weather-graph'
print(a[a.find('x'):])
#Also tried a.find('weather-graph') which returns -1
I have verified that each piece of the code works in other scenarios. The last line should find that string and print out everything after that.
I tried making x many different pieces of the html in and around the graph but got nothing of substance.
There is an API you can use. Same as the page does. Don't know if key expires. You may need to do some ordering on output but you can do that by datetime field
import requests
r = requests.get('https://api.weather.com/v1/geocode/40.765/-73.981/forecast/hourly/240hour.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e').json()
for i in r['forecasts']:
print(i)
If unsure I will happily update to show you how to build dataframe and order.

scraping a constantly changing integer from a website

I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://www.oanda.com/currency/live-exchange-rates/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
data1 = soup.find(id='EUR_USD-b-int')
print data1
This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?
The value on this page is updated using Javascript by making a request to
GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/
(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)
The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)
The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.

How can I iterate over specific elements in HTML file and replace them?

I need to do a seemingly simple thing in Python which turned out to be quite complex. What I need to do is:
Open an HTML file.
Match all instances of a specific HTML element, for example table.
For each instance, extract the element as a string, pass that string to an external command which will do some modifications, and finally replace the original element with a new string returned from the external command.
I can't simply do a re.sub(), because in each case the replacement string is different and based on the original string.
Any suggestions?
You could use Beautiful Soup to do this.
Although for what you need, something simpler like lxml.etree would work fine.
Sounds like you want BeautifulSoup. Likely, you'd want to do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tables = soup.find_all( 'table' )
for table in tables:
contents = str( table.contents )
new_contents = transform( contents )
table.replaceWith( new_contents )
Alternatively, you may be looking for something closer to soup.replace_with
EDIT: Updated to the eventual solution.
I have found that parsing HTML via BeautifulSoup or any other such parses gets complex as you need to parse different pages, with different structure which sometimes are not well-formed, use javascript manipulation etc. Best solution in this case is to directly access the browser DOM and modify and query nodes. You can easily do that in a headless browser like phanotomjs
e.g. here is a phantomjs script
var page = require('webpage').create();
page.content = '<html><body><table><tr><td>1</td><td>2</td></tr></table></html>';
page.evaluate(function () {
var elems = document.getElementsByTagName('td')
for(var i=0;i<elems.length;i++){
elems[i].innerHTML = '!'+elems[i].innerHTML+'!';
}
});
console.log(page.content);
phantom.exit();
It changes all td text and output is
<html><head></head><body><table><tbody><tr><td>!1!</td><td>!2!</td></tr></tbody></table></body></html>

Categories

Resources