web2py button should do different things - python

I'm new to web2py and websites in general.
I would like to upload xml-files with different numbers of questions in it.
I'm using bs4 to parse the uploaded file and then I want to do different things: if there is only one question in the xml file I would like to go to a new site and if there are more questions in it I want to go to another site.
So this is my code:
def import_file():
form = SQLFORM.factory(Field('file','upload', requires = IS_NOT_EMPTY(), uploadfolder = os.path.join(request.folder, 'uploads'), label='file:'))
if form.process().accepted:
soup = BeautifulSoup('file', 'html.parser')
questions = soup.find_all(lambda tag:tag.name == "question" and tag["type"] != "category")
# now I want to check the length of the list to redirect to the different URL's, but it doesn't work, len(questions) is 0.
if len(questions) == 1:
redirect(URL('import_questions'))
elif len(questions) > 1:
redirect(URL('checkboxes'))
return dict(form=form, message=T("Please upload the file"))
Does anybody know what I can do, to check the length of the list, after uploading the xml-file?

BeautifulSoup expects a string or a file-like object, but you're passing 'file'. Instead, you should use:
with open(form.vars.file) as f:
soup = BeautifulSoup(f, 'html.parser')
However this is not a web2py-specific problem.
Hope this helps.

Related

Cannot find the attribute in HTML source but the Python program runs correctly

I am scraping a YouTube page and find a open program codes online. The code runs and returns correct results. However, as I learn the code sentence by sentence, i find that I could not find the attribute in the source code. I searched for it in page source, inspect element view and copied and paste the raw code in word. Nowhere could I find it.
How did this happen?
Codes below:
soup=BeautifulSoup(result.text,"lxml")
# cannot find yt-lockup-meta-info anywhere......
view_element=soup.find_all("ul",class_="yt-lockup-meta-info")
totalview=0
for objects in view_element:
view_list=obj.findChildren()
for element in view_list:
if element.string.endwith("views"):
videoviews=element.text.replace("views","").replace(",","")
totalview=totalview+int(videoviews)
print(videoviews)
print("----------------------")
print("Total_Views"+str(totalview))
The attribute I searched for is "yt-lockup-meta-info".
The page source is here.
The original page.
I see a few problems, which I think might be cleared up if I saw the full code. However there are some things that need fixed within this block.
For example, this line should read:
for obj in view_element:
instead of:
for objects in view_element:
You are only referencing one "obj", not multiple objects when traversing through "view_element".
Also, there is no need to search for the word "views" when there is a class you can search directly.
Here is how I would address this problem. Hope this helps.
#Go to website and convert page source to Soup
response = requests.get('https://www.youtube.com/results?search_query=web+scraping+youtube')
soup = BeautifulSoup(response.text, 'lxml')
f.close()
videos = soup.find_all('ytd-video-renderer') #Find all videos
total_view_count = 0
for video in videos:
video_meta = video.find('div', {'id': 'metadata'}) #The text under the video title
view_count_text = video_meta.find_all('span', {'class': 'ytd-video-meta-block'})[0].text.replace('views', '').strip() #The view counter
#Converts view count to integer
if 'K' in view_count_text:
video_view_count = int(float(view_count_text.split('K')[0])*1000)
elif 'M' in view_count_text:
video_view_count = int(float(view_count_text.split('M')[0])*1000000)
elif 'B' in view_count_text:
video_view_count = int(float(view_count_text.split('B')[0])*1000000000)
else:
video_view_count = int(view_count_text)
print(video_view_count)
total_view_count += video_view_count
print(total_view_count)

Python-How would i go about getting block of text from an html document

https://www.cpms.osd.mil/Content/AF%20Schedules/survey-sch/111/111R-03Apr2003.html
This is the page I am trying to parse. It's from a government site which in my experience are not known for keeping up their certificates, so you are are going to be warned about it not being safe by your browser. All I want is this part,http://imgur.com/a/BL14W.
edit: Sorry, for the lack of information. I started asking this question then I got called away at work. It's no excuse but when I came back it was time to go home so, I just kinda hit submit.
I have already tried doing it more "manually" but apparently not all of the documents came out exactly the same. Here is what I tried:
def table_parser(page):
file = open(page)
table = []
num = 0
for line in file:
if 'Grade' in line:
num += 1
if num > 0:
num += 1
if 3 <= num < 21:
line = line.rstrip()
if line != '':
split_line = line.split(' ')
split_line = [x for x in split_line if x != '']
strip_line = split_line[:16]
table.append(strip_line)
WG = []
WL = []
WS = []
for l in table:
WG.append((l[1:6]))
WL.append(l[6:11])
WS.append(l[11:16])
file.close()
# Return 3 lists for the 3 charts I want
return WG, WL, WS
This is what I used that got the about half of the 65k files I started with mostly right. I passed the returned lists into csv writers to store them till I can get them all cleaned up. I know there is probably a better way but I came up with this before I could wrap my head around BeautifulSoup. I don't necessarily want the code to do this, just pointers on where to start. I tried to find documentation on BeautifulSoup but I couldn't figure out where to start for what I need.
Your question is a little vague so I'll try my best to help you.
1. Install Beautiful Soup 4
To get a block of text from a webpage,you will need to use the external library BeautifulSoup4 (BS4). Once downloaded and installed to your computer, first import BS4 using the following from bs4 import BeautifulSoupand import urllib.request. Then simply setup BS4 using soup = BeautifulSoup("", "html.parser").
2. Download Webpage
Downloading a webpage is simple, just use site_download = urllib.request.urlopen(url). In your case, simply replace "url" with the url you provided here. Then we need to read what we've downloaded using site_read = site_download.read().decode('utf-8') followed by soup = BeautifulSoup(site_read, "html.parser").
3. Get Block of Text
You can get text in many different ways, so I'll show you a few examples.
To get the first instance of < P > tag (paragraph) text:
text = soup.find("p")
text = getText()
To get all instances of the < P > tag:
text = soup.findAll("p")
text = getText()
To get text from a specific class:
text = soup.find(attrs={"class": "class_name_here"})
text = getText()
4. Further Info
More information on how to get different types of tags and other things you can do with BS4 can be found HERE.

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf.
I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.
slightly modified version of Ashwin's Answer:
import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
u = a.getObject()
if uri in u[ank].keys():
print(u[ank][uri])
This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.
It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.
Here is the code I used to get links on a PDFPage
annotationList = []
if page.annots:
for annotation in page.annots.resolve():
annotationDict = annotation.resolve()
if str(annotationDict["Subtype"]) != "/Link":
# Skip over any annotations that are not links
continue
position = annotationDict["Rect"]
uriDict = annotationDict["A"].resolve()
# This has always been true so far.
assert str(uriDict["S"]) == "/URI"
# Some of my URI's have spaces.
uri = uriDict["URI"].replace(" ", "%20")
annotationList.append((position, uri))
Then I defined a function like:
def getOverlappingLink(annotationList, element):
for (x0, y0, x1, y1), url in annotationList:
if x0 > element.x1 or element.x0 > x1:
continue
if y0 > element.y1 or element.y0 > y1:
continue
return url
else:
return None
which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.
In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.
I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:
PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
This I hope should give the links in your PDF.
P.S: I haven't extensively tried this.
import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")
urls = []
for page in pdf_file.pages:
for annots in page.get("/Annots"):
url=annots.get("/A").get("/URI")
if url is not None:
urls.append(url)
urls.append(" ; ")
print(urls)
You will get a semicolon separated list of links in the given PDF
The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).
I'd have thought it relatvely easy to process the annotations looking for type LNK though.
Here's a version that creates a list of URLs in the simplest way I could find:
import PyPDF2
pdf = PyPDF2.PdfFileReader('filename.pdf')
urls = []
for page in range(pdf.numPages):
pdfPage = pdf.getPage(page)
try:
for item in (pdfPage['/Annots']):
urls.append(item['/A']['/URI'])
except KeyError:
pass

How do I know when I'm done crawling a domain?

I've written a function in Python that gets all the links on a page.
Then, I run that function for all of the links that first function returned.
My question is, if I were to keep on doing this using CNN as my starting point, how would I know when I had crawled all (or most) of CNN's webpages?
Here's the code for the crawler.
base_url = "http://www.cnn.com"
title = "cnn"
my_file = open(title+".txt","w")
def crawl(site):
seed_url = site
br = Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.open(seed_url)
link_bank = []
for link in br.links():
if link.url[0:4] == "http":
link_bank.append(link.url)
if link.url[0] == "/":
url = link.url
if url.find(".com") == -1:
if url.find(".org") == -1:
link_bank.append(base_url+link.url)
else:
link_bank.append(link.url)
else:
link_bank.append(link.url)
if link.url[0] == "#":
link_bank.append(base_url+link.url)
link_bank = list(set(link_bank))
for link in link_bank:
my_file.write(link+"\n")
return link_bank
my_file.close()
I did not specifically look into your code, but you should look up how to implement a breadth-first-search, and additionally store already visited URLs in a set. If you find a new URL in the currently visited page, append it to the list of URLs to visit, if it wasn't in the set already.
You might need to ignore the query string (everything after the question mark in a URL).
The first thing coming into my mind is to have a set of visited links. Each time you are requesting a link, add a link to a set. Before requesting a link, check if it is not in the set.
Another point is that you are actually reinventing the wheel here, Scrapy web-scraping framework has link extracting mechanism built-in - it's worth using.
Hope that helps.

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources