Simple examples on how to use html2json - python

I am brand new to Python. I know about 2-weeks of it, so I don't understand a lot of what I am seeing online. I have to download HTML text from multiple similar pages and convert it to JSON. It must include the HTML links--it cannot just be the table contents. I have figured out how to use Beautiful Soup to download the code off a website, and I have been able to get the portions into a Python list of 547 similar groupings. There are 547 in this one file so far. The next step is to convert it to JSON. I have "a" and "tr" and "td" tags and "class" and "data-href" and "href" (attributes?) as well as text and links associated with those. I think html2json will work for me, but I cannot figure out how to use it. I installed it, and it is in my library. I understand how to do collect(html, template) to get it to convert ... but nowhere gives an explanation on how to do the template. I only found 2 pages that describe it with weird terms and no actual examples. I can't even find anything that explains whether template uses () or {} or [] or whether it does or does not have an = for assigning it. Can someone provide an example of a few lines of HTML and what the template actually looks like in Python? Here is my code so far:
page = requests.get(url)
page
soup = BeautifulSoup(page.text, 'lxml')
soup
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_results
gs_links = []
for i in gs_results:
item = i
gs_links.append(item)
template = I_HAVE_NO_IDEA_WHAT_GOES_HERE
collect(gs_links, template)

Related

scraping data from web page but missing content

I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong

Python/Requets/Beautiful Soup Basic Scrape

Hope you're all well. I wrote a basic webscrape of an HTML site earlier today, along similar lines. I was following a tutorial, as you'll be able to see by my code I'm a bit of a green-horn to coding in Python. Hoping for a bit of guidance regarding scraping this site.
As you can see by the commented out code,
#print(results.prettify())
I am able to successfully able to print out the entire contents of the webpage. What I'd like to do however, is whittle down the contents of what I am printing out, so that I am just printing out the relevant content. There is a lot of content on the page that I don't want, and I'd like massage it out. Does anyone have any thoughts on why the for loop at the bottom of the code is not sequentially grabbing up the paragraphs in the xlmins unit of HTML and printing it out? Please see the below code for more.
import requests
from bs4 import BeautifulSoup
URL = "http://www.gutenberg.org/files/7142/7142-h/7142-h.htm"
page = requests.get(URL)
#we're going to create an object in Beautiful soup that will scrape it.
soup = BeautifulSoup(page.content, 'html.parser')
#this line of code takes
results = soup.find(xmlns='http://www.w3.org/1999/xhtml')
#print(results.prettify())
job_elems = results.find_all('p', xlmins="http://www.w3.org/1999/xhtml")
for job in job_elems:
paragraph = job.find("p", xlmins='http://www.w3.org/1999/xhtml')
print(paragraph.text.strip)
No <p> tag contains the attribute xlmins='http://www.w3.org/1999/xhtml', only the top HTML tag does. Remove that part and you'll get all the paragraphs.
job_elems = results.find_all('p')
for job in job_elems:
print(job.text.strip())

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

Trouble parsing HTML using BeautifulSoup

I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.

Decomposing HTML to link text and target

Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

Categories

Resources