I was doing some API stuff the other day and ran across this (maybe) weird behavior in python. Basically I have a script in a secret folder ./secret_stuff/token_request.py that constructs and returns a new API access token. Looks something like this
token_request.py
import json
import requests
def generate_token():
url = json.load(open('./secrets.json'))['endpoints']['token'] # reads and stores relevant endpoint for token request from supporting file
data = json.load(open('./secrets.json'))['token_request_data'] # loads token request data from supporting file
r = requests.post(url=url, params=data).json()
return r['access_token']
if __name__ == '__main__':
generate_token()
which runs fine when I execute it using CLI...
i import the external script as a module, then instantiate it at the top of my actual program that works with the api/data etc.
api_ingestor.py
import json
import re
import requests
import secret_stuff.token_request as tr
token = tr.generate_token()
endpoint = json.load(open('./secret_stuff/secrets.json'))['endpoints']['KeyMessageData']
and get FileNotFound error thrown from the tr.generate_token() call on the line where it assigns url as the endpoint from the .json file in its local directory, which is a subdirectory of the entire project.
The directory structure looks like this:
|
|
/secrets
| |
| |
| --secrets.json
| |
| --token_request.py
|
api_ingestor.py
And if I make this change to token_request.py in the line where the url and data are assigned:
def generate_token():
url = json.load(open('./secret_stuff/secrets.json'))['endpoints']['token'] # reads and stores relevant endpoint for token request from supporting file
data = json.load(open('./secret_stuff/secrets.json'))['token_request_data'] # loads token request data from supporting file
then it works in api_ingestor.py! No FileNotFound error!
But now it doesn't work by itself when run thru CLI because there is no subdirectory in its local directory..
Anyone want or able to explain this behavior to me?
Your file location will depend on the current working directory when it's executed. Use something like the following to make it always use the directory where the script is located:
import os
secret_file = os.path.join(os.path.dirname(__file__), "secrets.json")
url = json.load(open(secret_file))['endpoints']['token'] # reads and stores relevant endpoint for token request from supporting file
data = json.load(open(secret_file))['token_request_data'] # loads token request data from supporting file
This will use the directory where the script is located instead of the current working directory.
It would be MUCH nicer if you read the JSON into a structure and pulled the elements from it separately rather then have json.load() read the whole file twice.
I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:
For a given number n, I want to list the first n GitHub repositories (And Their URLs) based on the number of their stars.
I want to download the Readme.md file from those URLs.
I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.
I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.
My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.
Here's my attempt using the suggestion from #Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.
import urllib.request
import json
import ssl
import os
import time
n = 5 # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)
repos = api_json['items'][:n]
for repo in repos:
full_name = repo['full_name']
print('fetching readme from', full_name)
# find readme url (case senitive)
contents_url = repo['url'] + '/contents'
request = urllib.request.urlopen(contents_url)
page = request.read().decode()
contents_json = contents_json = json.loads(page)
readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
# download readme contents
try:
context = ssl._create_unverified_context() # prevent ssl problems
request = urllib.request.urlopen(readme_url, context=context)
except urllib.error.HTTPError as error:
print(error)
continue # if the url can't be opened, there's no use to try to download anything
readme = request.read().decode()
# create folder named after repo's name and save readme.md there
try:
os.mkdir(repo['name'])
except OSError as error:
print(error)
f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
f.write(readme)
print('ok')
# only 10 requests per min for unauthenticated requests
if n >= 9: # n + 1 initial request
time.sleep(6)
I have a table in Sharepoint that I'm wanting to convert into a Pandas Dataframe. I've largely used this question to try and frame a solution Get SharePoint List with Python. I'm having issues however.
Here is what I have so far...
import pandas as pd
from shareplum import Site
from requests_ntlm import HttpNtlmAuth
url = 'https://share.corporation.com/sites/group/subgroup/'
username = 'username'
password = 'password'
cred = HttpNtlmAuth(username, password)
site = Site(url, auth=cred, verify_ssl=False)
Up to this point, I can run the code without an error being thrown. However, when I run this bit:
sp_list = site.List('Q22020') # this creates SharePlum object
ShareplumRequestError: Shareplum HTTP Post Failed : 500 Server Error: Internal Server Error for url: https://share.corporation.com/sites/group/subgroup/_vti_bin/lists.asmx
I'm actually not entirely sure that my site.List('Q22020') is even correct.
However, following the instructions from this video: https://www.youtube.com/watch?v=dvFbVPDQYyk
When I manually enter the following url into my browser, it does generate an xml file, so I believe it's correct: https://share.corporation.com/sites/group/subgroup/_vti_bin/ListData.svc/Q22020
A friend pass me this code early. ListaSP returns a Dataframe with your Sharepoint list contents
from office365.runtime.auth.client_credential import ClientCredential
from office365.sharepoint.client_context import ClientContext
def dataframeSP(lista):
sp_list = lista
sp_lists = ctx.web.lists
s_list = sp_lists.get_by_title(sp_list)
l_items = s_list.get_items()
ctx.load(l_items)
ctx.execute_query()
columnas=list(pd.DataFrame.from_dict(l_items[0].properties.items()).iloc[:,0])
valores=list()
for item in l_items:
data=list(pd.DataFrame.from_dict(item.properties.items()).iloc[:,1])
valores.append(data)
resultado=pd.DataFrame(valores,columns=columnas)
return resultado
client_id = "########"
client_secret = "##############"
site_url = "https://YOURSHAREPOINT.sharepoint.com/YOURLIST"
ctx = ClientContext(site_url).with_credentials(ClientCredential(client_id, client_secret))
listaSP = ctx.web.lists.get_by_title("THE NAME OF YOUR SHAREPOINT LIST")
Try:
https://share.corporation.com/sites/group/subgroup/Lists/Q22020/_vti_bin/lists.asmx
If not, go to the list on the web and have a look at the URL once you are looking at a view of the 'Q22020' list. Your "url" parameter may be incorrect.
I had the same problem and followed the same logic of getting the list name from URL. However, I found that the list name actually had a space in it, despite the URL not showing it. Adding the space solved the issue.
Using your example, if the URL is https://share.corporation.com/sites/group/subgroup/_vti_bin/ListData.svc/Q22020
but the list is actually
'Q2 2020' then you would change your code to:
sp_list = site.List('Q2 2020')
Python - 2.7.5
Google Chrome
First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.
I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.
The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.
import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup
arcpy.env.overwriteOutput = True
workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"
class global_DataFinder(object):
def __init__(self):
object.__init__(self)
self.gdbSetup()
self.metro()
def gdbSetup(self):
arcpy.CreateFileGDB_management(workPath, gdbName)
def fileDownload(self, key, url, dlPath, dsName):
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll('a', href = True):
if not 'http://' in link['href']:
if urlparse.urljoin(url, link['href']) not in urlList:
zipDL = urlparse.urljoin(url, link['href'])
if zipDL.endswith(".zip"):
if key in zipDL:
urlList.append(zipDL)
for x in urlList:
print x
r = requests.get(x, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(dlPath)
arcpy.CreateFeatureDataset_management(gdbPath, dsName)
arcpy.env.workspace = dlPath
shpList = []
for shp in arcpy.ListFeatureClasses():
shpList.append(shp)
arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))
del shpList[:]
def metro(self):
key = "METRO_GIS_Data_Layers"
url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
dlPath = -- *#Where my zipfiles output to*
dsName = "Metro"
self.fileDownload(key, url, dlPath, dsName)
global_DataFinder()
As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.
The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.
When accessing the iframe source link through Chrome and checking the png source url this is what you get - https://hazards.fema.gov/femaportal/NFHL/searchResult
Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.
(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)
import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup
url = "http://www.floodmaps.fema.gov/NFHL/status.shtml"
def test():
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll("iframe", src=True):
r = urllib2.urlopen(link['src'])
iFrame = link['src']
print iFrame
def connect_patched(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port),
self.timeout, self.source_address)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
ssl_version=ssl.PROTOCOL_SSLv2)
httplib.HTTPSConnection.connect = connect_patched
test()
Error I get when running this test
urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed
I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.
Working Edits with #crmackey
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
print href
url = '/'.join([download_prefix, href])
print url
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)
All I added was the httplib and changed the HTTPConnection at the top. That allowed to me connect to the site using your script. Now here is the current problem. I am only getting 1 zip file in my out_path, and the zip file is empty. I checked the printed source in the debug window and its showing its trying to download the TERRITORY OF THE VIRGIN ISLAND zip file from the table so it looks like its trying but its not downloading anything. After it outputs that one empty zip file the script finishes and brings up no further error messages. I temporarily removed your lines that unzipped the file because they were returning an error since the folder was empty.
I was able to get the zip files downloaded by using the requests module and also opted for using PyQuery instead of Beautiful Soup. I think the issue you were facing has to do with the SSL cert validation, where the requests module will allow you to skip checking the certificate if you set the verify parameter to False.
The function below will download all the zip files and unzip them, from there, you can import the shapefiles into your geodatabase:
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
url = '/'.join([download_prefix, href])
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
# do more stuff like unzip?
unzipped = out_zip.split('.zip')[0]
with zipfile.Zipfile(out_zip, 'r') as f:
f.extractall(unzipped)
My goal is to connect to Youtube API and download the URLs of specific music producers.I found the following script which I used from the following link: https://www.youtube.com/watch?v=_M_wle0Iq9M. In the video the code works beautifully. But when I try it on python 2.7 it gives me KeyError:'items'.
I know KeyErrors can occur when there is an incorrect use of a dictionary or when a key doesn't exist.
I have tried going to the google developers site for youtube to make sure that 'items' exist and it does.
I am also aware that using get() may be helpful for my problem but I am not sure. Any suggestions to fixing my KeyError using the following code or any suggestions on how to improve my code to reach my main goal of downloading the URLs (I have a Youtube API)?
Here is the code:
#these modules help with HTTP request from Youtube
import urllib
import urllib2
import json
API_KEY = open("/Users/ereyes/Desktop/APIKey.rtf","r")
API_KEY = API_KEY.read()
searchTerm = raw_input('Search for a video:')
searchTerm = urllib.quote_plus(searchTerm)
url = 'https://www.googleapis.com/youtube/v3/search?part=snippet&q='+searchTerm+'&key='+API_KEY
response = urllib.urlopen(url)
videos = json.load(response)
videoMetadata = [] #declaring our list
for video in videos['items']: #"for loop" cycle through json response and searches in items
if video['id']['kind'] == 'youtube#video': #makes sure that item we are looking at is only videos
videoMetadata.append(video['snippet']['title']+ # getting title of video and putting into list
"\nhttp://youtube.com/watch?v="+video['id']['videoId'])
videoMetadata.sort(); # sorts our list alphaetically
print ("\nSearch Results:\n") #print out search results
for metadata in videoMetadata:
print (metadata)+"\n"
raw_input('Press Enter to Exit')
The problem is most likely a combination of using an RTF file instead of a plain text file for the API key and you seem to be confused whether to use urllib or urllib2 since you imported both.
Personally, I would recommend requests, but I think you need to read() the contents of the request to get a string
response = urllib.urlopen(url).read()
You can check that by printing the response variable