I'm a newbie to python and still on the process of learning on the go ...
I have a webserver which has list of images to load on a Device Under Test (DUT) ...
Requirement is:
if the image is already present on the server, proceed with loading the image onto the DUT.
if the image is not present on the server , then proceed with the download of the image and then upgrade the DUT.
I have written the following code but I'm quite not happy with the way I have written this, because I have a feeling that it could have been done better using some other method/s
Please suggest the areas where i could have done better and the techniques to do so..
Appreciate your time in reading this email and for your valuable suggestions.
import urllib2
url = 'http://localhost/test'
filename = 'Image60.txt' # image to Verify
def Image_Upgrade():
print 'proceeding with Image upgrade !!!'
def Image_Download():
print 'Proceeding with Image Download !!!'
resp = urllib2.urlopen(url)
flag = False
list_of_files = []
for contents in resp.readlines():
if 'Image' in contents:
c=(((contents.split('href='))[-1]).split('>')[0]).strip('"') # The content output would have html tags. so removing the tags to pick only image name
if c != filename:
list_of_files.append(c)
else:
Image_Upgrade()
flag = True
if flag==False:
Image_Download()
Thanks,
Vijay Swaminathan
Related
I am trying to access the following link through my script and download the chart which comes up.
I was implementing it using the accepted response here but when I try to open the file, I get error: The file “test.png” could not be opened because it is empty.
Here is my code snippet:
import urllib
image_element = driver.find_element_by_id('chartImg')
src = image_element.get_attribute("src")
if src:
urllib.urlretrieve(str(src), "test.png")
Next I tried to debug further and changed my code to
if src:
a, b = urllib.urlretrieve(str(src), "test.png")
print a, b.items()
which gives me the following output:
test.png
[('date', 'Sat, 19 Nov 2016 01:19:20 GMT'), ('connection', 'Keep-Alive'), ('content-length', '0'), ('server', 'BigIP')]
Does anyone know why 'content-length' is '0'? I think this is the reason downloaded file is empty.
I think the reason for this is because the image you are scraping does not contain an extension. If you run this code for example:
src = "http://i.imgur.com/2C7Csq6.png"
urllib.urlretrieve(src, "test.png")
The PNG file works, and it is the exact same image. I've tried looking for ways to do this without having to upload to an image sharing service where it would provide an extension, but haven't found anything. I've also tried adding .png to the original src string, but that didn't work either. My guess is this is a website-specific problem. Hopefully you can find a workaround for this, good luck!
I found a work around...take screenshot
image_element = driver.find_element_by_id('chartImg')
src = image_element.get_attribute("src")
if src:
driver.get(src)
driver.save_screenshot('screen.png')
Don't know if there is a better way but this does the job
My goal is to connect to Youtube API and download the URLs of specific music producers.I found the following script which I used from the following link: https://www.youtube.com/watch?v=_M_wle0Iq9M. In the video the code works beautifully. But when I try it on python 2.7 it gives me KeyError:'items'.
I know KeyErrors can occur when there is an incorrect use of a dictionary or when a key doesn't exist.
I have tried going to the google developers site for youtube to make sure that 'items' exist and it does.
I am also aware that using get() may be helpful for my problem but I am not sure. Any suggestions to fixing my KeyError using the following code or any suggestions on how to improve my code to reach my main goal of downloading the URLs (I have a Youtube API)?
Here is the code:
#these modules help with HTTP request from Youtube
import urllib
import urllib2
import json
API_KEY = open("/Users/ereyes/Desktop/APIKey.rtf","r")
API_KEY = API_KEY.read()
searchTerm = raw_input('Search for a video:')
searchTerm = urllib.quote_plus(searchTerm)
url = 'https://www.googleapis.com/youtube/v3/search?part=snippet&q='+searchTerm+'&key='+API_KEY
response = urllib.urlopen(url)
videos = json.load(response)
videoMetadata = [] #declaring our list
for video in videos['items']: #"for loop" cycle through json response and searches in items
if video['id']['kind'] == 'youtube#video': #makes sure that item we are looking at is only videos
videoMetadata.append(video['snippet']['title']+ # getting title of video and putting into list
"\nhttp://youtube.com/watch?v="+video['id']['videoId'])
videoMetadata.sort(); # sorts our list alphaetically
print ("\nSearch Results:\n") #print out search results
for metadata in videoMetadata:
print (metadata)+"\n"
raw_input('Press Enter to Exit')
The problem is most likely a combination of using an RTF file instead of a plain text file for the API key and you seem to be confused whether to use urllib or urllib2 since you imported both.
Personally, I would recommend requests, but I think you need to read() the contents of the request to get a string
response = urllib.urlopen(url).read()
You can check that by printing the response variable
forgive me, if if come straight out with it but python drives me nuts at something what seemed to be quite simple.
In a nutshell
I'm writing an extension for a musicvideo scraper which is responsible for getting the fanart backdrop.
Here is the URL:
github.com/MViDLibraryToolKit/.../APICaller
So I was able to call the Fanart.tv API and receiving the right json response. My problem is that i'm to dumb to collect the URLs under the Element "artistbackground"
I search the internet and found a very similar post here at stackoverflow but unluckily this was concerning python2,API V2 and a different category at fanart.tv so I was not able to take use out of it. Here it was
Anyway, here is my poor Try to collect URLs to list
# --------------------- Response Verarbeitung
# Ausgabe zwecks Debug
# print(fanartTVresp)
# http://webservice.fanart.tv/v3/music/albums/ba853904-ae25-4ebb-89d6-c44cfbd71bd2?api_key=fdadba00cfaaf3621eaa748669256a9e&client_key=dce01d75553d7e3fbc2ad742aaf5d371
# zu befüllende Liste
url_list = []
# lade Web-Response
json_response = json.loads(fanartTVresp)
# durch Element artistbackground loopen
for artistbackground in json_response:
url = urllib.parse.quote(['url'], ':/')
if url:
url_list.append(url)
print(url_list)
The libs I loaded...
import musicbrainzngs
import urllib
import json
import socket
from pprint import pprint
from urllib.parse import quote
The rest from the code you can find at my github link. Please help me, it drives me crazy ^^
Kind regards
p.s. Please excuse my english, I came from germany :)
I think I finally got it.
# URL List for background images
url_list = []
# set only for debug / value came from powershell runtime later
location = os.path.abspath('C:/temp')
# decode json
json_response = json.loads(fanartTVresp.decode())
# set string objects
bgitem = json_response["artistbackground"]
bgcoverurl = json_response["artistbackground"][0]["url"]
# iterating items and collect
for bgcoverurl in bgitem:
url_list.append(bgcoverurl)
print(url_list)
After taking some hourse of sleep I reallized that "json.loads" deserialized the response to regular python objects. Correct me if I'm wrong.
Anyway, it finally works!
I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf.
I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.
slightly modified version of Ashwin's Answer:
import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
u = a.getObject()
if uri in u[ank].keys():
print(u[ank][uri])
This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.
It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.
Here is the code I used to get links on a PDFPage
annotationList = []
if page.annots:
for annotation in page.annots.resolve():
annotationDict = annotation.resolve()
if str(annotationDict["Subtype"]) != "/Link":
# Skip over any annotations that are not links
continue
position = annotationDict["Rect"]
uriDict = annotationDict["A"].resolve()
# This has always been true so far.
assert str(uriDict["S"]) == "/URI"
# Some of my URI's have spaces.
uri = uriDict["URI"].replace(" ", "%20")
annotationList.append((position, uri))
Then I defined a function like:
def getOverlappingLink(annotationList, element):
for (x0, y0, x1, y1), url in annotationList:
if x0 > element.x1 or element.x0 > x1:
continue
if y0 > element.y1 or element.y0 > y1:
continue
return url
else:
return None
which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.
In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.
I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:
PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
This I hope should give the links in your PDF.
P.S: I haven't extensively tried this.
import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")
urls = []
for page in pdf_file.pages:
for annots in page.get("/Annots"):
url=annots.get("/A").get("/URI")
if url is not None:
urls.append(url)
urls.append(" ; ")
print(urls)
You will get a semicolon separated list of links in the given PDF
The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).
I'd have thought it relatvely easy to process the annotations looking for type LNK though.
Here's a version that creates a list of URLs in the simplest way I could find:
import PyPDF2
pdf = PyPDF2.PdfFileReader('filename.pdf')
urls = []
for page in range(pdf.numPages):
pdfPage = pdf.getPage(page)
try:
for item in (pdfPage['/Annots']):
urls.append(item['/A']['/URI'])
except KeyError:
pass
can anyone help me with "extracting" stuff from site using Python? Here is the info :
I have folder name with set of numbers (they are ID of item) and i have to use that ID for entering page and then "scrap" info from page to my notepad... It's like this : http://www.somesite.com/pic.mhtml?id=[ID]... I need to exctract picture link (picture link always have ID.jpg at the end of the file)from it and write it in notepad and then replace that txt name with name of the picture... Picture is always in title tags... Thanks in advance...
What you need is a data scraper - http://www.crummy.com/software/BeautifulSoup/ will help you pull data off of websites. You can then load that data into a variable, write it to a file, or do anything you normally do with data.
You could try parsing the html source for images.
Try something similar:
class Parser(object):
__rx = r'(url|src)="(http://www\.page\.com/path/?ID=\d*\.(jpeg|jpg|gif|png)'
def __crawl(self, url):
images = []
code = urllib.urlopen(url).read()
for line in code.split('\n'):
imagesearch = re.search(self.__rx, line)
if imagesearch:
image = '%s.%s' % (imagesearch.group(2), imagesearch.group(4))
images.append(image)
return images
it's untestet, you may want to check the regex