parsing text from an object tag

parsing text from an object tag - python

im creating a small amazon alexa skill called JokePro and i have made a website that i can directly upload jokes to. The jokes go into a txt file in the database and then are loaded to the crude page from there.
I am looking to randomly select lines from the joke file directly displayed to the page with an object tag
how would i go about scraping the text given by the object tag.
http://jokepro.dx.am
source = requests.get("http://jokepro.dx.am/")
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
parsed = bs4call.find('pre') #ive replaced pre with object aswell
any help would be apreciated

If I understand you correctly, you want to load the text file described by <object> tag and then select random line from it:
import bs4
import requests
import random
url = "http://jokepro.dx.am/"
source = requests.get(url)
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
obj = bs4call.find('object')
text = requests.get(url + obj['data']).text
# print(text) # <-- to print the textfile
print( random.choice(text.splitlines()) )
This prints (for example):
want to know a REALLY good joke? A high school student making this application in a week!

Related

Getting the title using urllib

I am supposed to write a code that goes into a web site and gets its title so here is the code i have
import urllib.request
def findTitle(url):
urllib.request.Request(url)
#open url
urllib.request.urlopen(url)
urllib.request.urlopen(url).read().decode('utf-8')
#set same variable equal to the end of <title> tag
endTitlePos = url.find("<title>")
#set variable equal to starting position of <title> tag
startTitlePos = url.find("<title>", endTitlePos)
startTitlePos += len("<title>")
#set new variable equal to </title>
TitleContent=url.find("</title>",startTitlePos)
#return slice of output between the two variables
title = url[startTitlePos:endTitlePos]
content_list=[]
content_list.append(title)
return content_list
def main():
url="https://google.com/search"
print(findTitle(url))
main()
we are using google for an example. Now its supposed to just print "google" but currently it prints "['//google.com/searc']" i am just curious what i am missing here, i mean it seems very simple but i dont know why its printing the url rather then the title and how also do i turn it form the list into a string?

There are several alternative to get data from webpages. The best use BeautifulSoup. In your case string split() method works well
import urllib.request
def findTitle(url):
webpage = urllib.request.urlopen(url).read()
title = str(webpage).split('<title>')[1].split('</title>')[0]
return title
>>>print(findTitle('http://www.google.com'))
Google

Cleaning up data written from BeautifulSoup to Text File

I am trying to write a program that will collect specific information from an ebay product page and write that information to a text file. To do this I'm using BeautifulSoup and Requests and I'm working with Python 2.7.9.
I've been mostly using this tutorial (Easy Web Scraping with Python) with a few modifications. So far everything works as intended until it writes to the text file. The information is written, just not in the format that I would like.
What I'm getting is this:
{'item_title': u'Old Navy Pink Coat M', 'item_no': u'301585876394', 'item_price': u'US $25.00', 'item_img': 'http://i.ebayimg.com/00/s/MTYwMFgxMjAw/z/Sv0AAOSwv0tVIoBd/$_35.JPG'}
What I was hoping for was something that would be a bit easier to work with.
For example :
New Shirt 5555555555 US $25.00 http://ImageURL.jpg
In other words I want just the scraped text and not the brackets, the 'item_whatever', or the u'.
After a bit of research I suspect my problem is to do with the encoding of the information as its being written to the text file, but I'm not sure how to fix it.
So far I have tried,
def collect_data():
with open('writetest001.txt','w') as x:
for product_url in get_links():
get_info(product_url)
data = "'{0}','{1}','{2}','{3}'".format(item_data['item_title'],'item_price','item_no','item_img')
x.write(str(data))
In the hopes that it would make the data easier to format in the way I want. It only resulted in "NameError: global name 'item_data' is not defined" displayed in IDLE.
I have also tried using .split() and .decode('utf-8') in various positions but have only received AttributeErrors or the written outcome does not change.
Here is the code for the program itself.
import requests
import bs4
#Main URL for Harvesting
main_url = 'http://www.ebay.com/sch/Coats-Jackets-/63862/i.html?LH_BIN=1&LH_ItemCondition=1000&_ipg=24&rt=nc'
#Harvests Links from "Main" Page
def get_links():
r = requests.get(main_url)
data = r.text
soup = bs4.BeautifulSoup(data)
return [a.attrs.get('href')for a in soup.select('div.gvtitle a[href^=http://www.ebay.com/itm]')]
print "Harvesting Now... Please Wait...\n"
print "Harvested:", len(get_links()), "URLs"
#print (get_links())
print "Finished Harvesting... Scraping will Begin Shortly...\n"
#Scrapes Select Information from each page
def get_info(product_url):
item_data = {}
r = requests.get(product_url)
data = r.text
soup = bs4.BeautifulSoup(data)
#Fixes the 'Details about ' problem in the Title
for tag in soup.find_all('span',{'class':'g-hdn'}):
tag.decompose()
item_data['item_title'] = soup.select('h1#itemTitle')[0].get_text()
#Grabs the Price, if the item is on sale, grabs the sale price
try:
item_data['item_price'] = soup.select('span#prcIsum')[0].get_text()
except IndexError:
item_data['item_price'] = soup.select('span#mm-saleDscPrc')[0].get_text()
item_data['item_no'] = soup.select('div#descItemNumber')[0].get_text()
item_data['item_img'] = soup.find('img', {'id':'icImg'})['src']
return item_data
#Collects information from each page and write to a text file
write_it = open("writetest003.txt","w","utf-8")
def collect_data():
for product_url in get_links():
write_it.write(str(get_info(product_url))+ '\n')
collect_data()
write_it.close()

You were on the right track.
You need a local variable to assign the results of get_info to. The variable item_data you tried to reference only exists within the scope of the get_info function. You can use the same variable name though, and assign the results of the function to it.
There was also a syntax issue in the section you tried with respect to how you're formatting the items.
Replace the section you tried with this:
for product_url in get_links():
item_data = get_info(product_url)
data = "{0},{1},{2},{3}".format(*(item_data[item] for item in ('item_title','item_price','item_no','item_img')))
x.write(data)

Scraping urbandictionary with Python

I'm currently working on an arcbot and I'm trying to make a command "!urbandictionary", it should scrape the meaning of a term, the first one which is provided by urbandictionary, if there's another solution, e.g. another dictionary site with a better api that's also good. Here's my code:
if Command.lower() == '!urban':
dictionary = Argument[1] #this is the term which the user provides, e.g. "scrape"
dictionaryscrape = urllib2.urlopen('http://www.urbandictionary.com/define.php?term='+dictionary).read() #plain html of the site
scraped = getBetweenHTML(dictionaryscrape, '<div class="meaning">','</div>') #Here's my problem, i'm not sure if it scrapes the first meaning or not..
messages.main(scraped,xSock,BotID) #Sends the meaning of the provided word (Argument[0])
How do I correctly scrape a meaning of a word in urbandictionary?

Just get the text from the meaning class:
import requests
from bs4 import BeautifulSoup
word = "scrape"
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"meaning"}).text)
Gassing and breaking your car repeatedly really fast so that the front and rear bumpers "scrape" the pavement; while going hyphy

There is an unofficial api here apparently
`http://api.urbandictionary.com/v0/define?term={word}`
From https://github.com/zdict/zdict/wiki/Urban-dictionary-API-documentation

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?

I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.

They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Converting a pdf to text/html in python so I can parse it

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
Thomas

Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]

It's not exactly magic. I suggest
downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing text from an object tag - python

Related

Getting the title using urllib

Cleaning up data written from BeautifulSoup to Text File

Scraping urbandictionary with Python

Extracting parts of a webpage with python

Converting a pdf to text/html in python so I can parse it

Categories

Resources