can anyone help me with "extracting" stuff from site using Python? Here is the info :
I have folder name with set of numbers (they are ID of item) and i have to use that ID for entering page and then "scrap" info from page to my notepad... It's like this : http://www.somesite.com/pic.mhtml?id=[ID]... I need to exctract picture link (picture link always have ID.jpg at the end of the file)from it and write it in notepad and then replace that txt name with name of the picture... Picture is always in title tags... Thanks in advance...
What you need is a data scraper - http://www.crummy.com/software/BeautifulSoup/ will help you pull data off of websites. You can then load that data into a variable, write it to a file, or do anything you normally do with data.
You could try parsing the html source for images.
Try something similar:
class Parser(object):
__rx = r'(url|src)="(http://www\.page\.com/path/?ID=\d*\.(jpeg|jpg|gif|png)'
def __crawl(self, url):
images = []
code = urllib.urlopen(url).read()
for line in code.split('\n'):
imagesearch = re.search(self.__rx, line)
if imagesearch:
image = '%s.%s' % (imagesearch.group(2), imagesearch.group(4))
images.append(image)
return images
it's untestet, you may want to check the regex
Related
im creating a small amazon alexa skill called JokePro and i have made a website that i can directly upload jokes to. The jokes go into a txt file in the database and then are loaded to the crude page from there.
I am looking to randomly select lines from the joke file directly displayed to the page with an object tag
how would i go about scraping the text given by the object tag.
http://jokepro.dx.am
source = requests.get("http://jokepro.dx.am/")
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
parsed = bs4call.find('pre') #ive replaced pre with object aswell
any help would be apreciated
If I understand you correctly, you want to load the text file described by <object> tag and then select random line from it:
import bs4
import requests
import random
url = "http://jokepro.dx.am/"
source = requests.get(url)
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
obj = bs4call.find('object')
text = requests.get(url + obj['data']).text
# print(text) # <-- to print the textfile
print( random.choice(text.splitlines()) )
This prints (for example):
want to know a REALLY good joke? A high school student making this application in a week!
I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf.
I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.
slightly modified version of Ashwin's Answer:
import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
u = a.getObject()
if uri in u[ank].keys():
print(u[ank][uri])
This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.
It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.
Here is the code I used to get links on a PDFPage
annotationList = []
if page.annots:
for annotation in page.annots.resolve():
annotationDict = annotation.resolve()
if str(annotationDict["Subtype"]) != "/Link":
# Skip over any annotations that are not links
continue
position = annotationDict["Rect"]
uriDict = annotationDict["A"].resolve()
# This has always been true so far.
assert str(uriDict["S"]) == "/URI"
# Some of my URI's have spaces.
uri = uriDict["URI"].replace(" ", "%20")
annotationList.append((position, uri))
Then I defined a function like:
def getOverlappingLink(annotationList, element):
for (x0, y0, x1, y1), url in annotationList:
if x0 > element.x1 or element.x0 > x1:
continue
if y0 > element.y1 or element.y0 > y1:
continue
return url
else:
return None
which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.
In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.
I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:
PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
This I hope should give the links in your PDF.
P.S: I haven't extensively tried this.
import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")
urls = []
for page in pdf_file.pages:
for annots in page.get("/Annots"):
url=annots.get("/A").get("/URI")
if url is not None:
urls.append(url)
urls.append(" ; ")
print(urls)
You will get a semicolon separated list of links in the given PDF
The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).
I'd have thought it relatvely easy to process the annotations looking for type LNK though.
Here's a version that creates a list of URLs in the simplest way I could find:
import PyPDF2
pdf = PyPDF2.PdfFileReader('filename.pdf')
urls = []
for page in range(pdf.numPages):
pdfPage = pdf.getPage(page)
try:
for item in (pdfPage['/Annots']):
urls.append(item['/A']['/URI'])
except KeyError:
pass
I am working with python web.py framework,i had an anchor tag with html code as below for example
<p><a href = "/edit.py?tr=%d"%1>Edit</a></p>
So when i click this link it goes to edit.py file in my project directory, but as you observe i am passing some values after edit.py like /edit.py?tr=%d"%1. Actually i will pass these values dynamically in further process.
Here after redirecting to edit.py, how to access the values after ? from the py file?
because my intention is to edit the record after saving in to database.
You can get them using web.input, e.g.
def GET(self):
data = web.input()
tr = data.tr
Documentation is avaliable here: http://webpy.org/cookbook/input
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.
I'm a newbie to python and still on the process of learning on the go ...
I have a webserver which has list of images to load on a Device Under Test (DUT) ...
Requirement is:
if the image is already present on the server, proceed with loading the image onto the DUT.
if the image is not present on the server , then proceed with the download of the image and then upgrade the DUT.
I have written the following code but I'm quite not happy with the way I have written this, because I have a feeling that it could have been done better using some other method/s
Please suggest the areas where i could have done better and the techniques to do so..
Appreciate your time in reading this email and for your valuable suggestions.
import urllib2
url = 'http://localhost/test'
filename = 'Image60.txt' # image to Verify
def Image_Upgrade():
print 'proceeding with Image upgrade !!!'
def Image_Download():
print 'Proceeding with Image Download !!!'
resp = urllib2.urlopen(url)
flag = False
list_of_files = []
for contents in resp.readlines():
if 'Image' in contents:
c=(((contents.split('href='))[-1]).split('>')[0]).strip('"') # The content output would have html tags. so removing the tags to pick only image name
if c != filename:
list_of_files.append(c)
else:
Image_Upgrade()
flag = True
if flag==False:
Image_Download()
Thanks,
Vijay Swaminathan