GAE Python: Parsing compressed XML exceeds memory - python

I am trying to fetch and parse an XML file into a databse. The XML is compressed in GZIP. The GZIP file is ~8MB. When I run the code locally the memory on pythonw.exe builds up to level where the entire system (Windows 7) stops responding, and when I run it online it exceeds the memory limit on Google App Engine. Not sure if the file is too big or if I am doing something wrong. Any help would be very much appreciated!
from google.appengine.ext import webapp
from google.appengine.api.urlfetch import fetch
from xml.dom.minidom import parseString
import gzip
import base64
import StringIO
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
xmlstring = StringIO.StringIO(catalog.content)
gz = gzip.GzipFile(fileobj=xmlstring)
gzcontent = gz.read()
contentxml = parseString(gzcontent)
items = contentxml.getElementsByTagName("Product")
for item in items:
item = DatabaseEntry()
item.name = str(coupon.getElementsByTagName("Manufacturer")[0].firstChild.data)
item.put()
UPDATE
So I tried to follow BasicWolf's suggestion to switch to LXML but am having problems importing it. I downloaded the LXML 2.3 library and put it in the folder of my app (I know this is not ideal, but it's the only way I know how to include a 3rd party library). Also, I added following to my app.yaml:
libraries:
- name: lxml
version: "2.3"
Then I wrote the following code to test if it parses:
import lxml
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
items = etree.iterparse(catalog.content)
def get(self):
for elem in items:
self.response.out.write(str(elem.tag))
However this is resulting in the following error:
ImportError: cannot import name etree
I have checked other questions on this error and it seems that the fact that I run on Windows 7 might play a role. I also tried to install the pre-compiled binary packages from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml but that didn't change anything either.

What do you expect? First, you read a string into the memory, then - unzip it into the memory, then - construct a DOM tree, still in the memory.
Here are some improvements:
del every buffer variable the moment you don't need it.
Get rid of DOM XML parser, use event-driven LXML to save memory.

Related

Crawl and download Readme.md files from GitHub using python

I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:
For a given number n, I want to list the first n GitHub repositories (And Their URLs) based on the number of their stars.
I want to download the Readme.md file from those URLs.
I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.
I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.
My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.
Here's my attempt using the suggestion from #Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.
import urllib.request
import json
import ssl
import os
import time
n = 5 # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)
repos = api_json['items'][:n]
for repo in repos:
full_name = repo['full_name']
print('fetching readme from', full_name)
# find readme url (case senitive)
contents_url = repo['url'] + '/contents'
request = urllib.request.urlopen(contents_url)
page = request.read().decode()
contents_json = contents_json = json.loads(page)
readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
# download readme contents
try:
context = ssl._create_unverified_context() # prevent ssl problems
request = urllib.request.urlopen(readme_url, context=context)
except urllib.error.HTTPError as error:
print(error)
continue # if the url can't be opened, there's no use to try to download anything
readme = request.read().decode()
# create folder named after repo's name and save readme.md there
try:
os.mkdir(repo['name'])
except OSError as error:
print(error)
f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
f.write(readme)
print('ok')
# only 10 requests per min for unauthenticated requests
if n >= 9: # n + 1 initial request
time.sleep(6)

Python - How to search for a zip file that resides in an iframe on https

Python - 2.7.5
Google Chrome
First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.
I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.
The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.
import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup
arcpy.env.overwriteOutput = True
workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"
class global_DataFinder(object):
def __init__(self):
object.__init__(self)
self.gdbSetup()
self.metro()
def gdbSetup(self):
arcpy.CreateFileGDB_management(workPath, gdbName)
def fileDownload(self, key, url, dlPath, dsName):
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll('a', href = True):
if not 'http://' in link['href']:
if urlparse.urljoin(url, link['href']) not in urlList:
zipDL = urlparse.urljoin(url, link['href'])
if zipDL.endswith(".zip"):
if key in zipDL:
urlList.append(zipDL)
for x in urlList:
print x
r = requests.get(x, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(dlPath)
arcpy.CreateFeatureDataset_management(gdbPath, dsName)
arcpy.env.workspace = dlPath
shpList = []
for shp in arcpy.ListFeatureClasses():
shpList.append(shp)
arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))
del shpList[:]
def metro(self):
key = "METRO_GIS_Data_Layers"
url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
dlPath = -- *#Where my zipfiles output to*
dsName = "Metro"
self.fileDownload(key, url, dlPath, dsName)
global_DataFinder()
As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.
The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.
When accessing the iframe source link through Chrome and checking the png source url this is what you get - https://hazards.fema.gov/femaportal/NFHL/searchResult
Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.
(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)
import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup
url = "http://www.floodmaps.fema.gov/NFHL/status.shtml"
def test():
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll("iframe", src=True):
r = urllib2.urlopen(link['src'])
iFrame = link['src']
print iFrame
def connect_patched(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port),
self.timeout, self.source_address)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
ssl_version=ssl.PROTOCOL_SSLv2)
httplib.HTTPSConnection.connect = connect_patched
test()
Error I get when running this test
urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed
I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.
Working Edits with #crmackey
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
print href
url = '/'.join([download_prefix, href])
print url
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)
All I added was the httplib and changed the HTTPConnection at the top. That allowed to me connect to the site using your script. Now here is the current problem. I am only getting 1 zip file in my out_path, and the zip file is empty. I checked the printed source in the debug window and its showing its trying to download the TERRITORY OF THE VIRGIN ISLAND zip file from the table so it looks like its trying but its not downloading anything. After it outputs that one empty zip file the script finishes and brings up no further error messages. I temporarily removed your lines that unzipped the file because they were returning an error since the folder was empty.
I was able to get the zip files downloaded by using the requests module and also opted for using PyQuery instead of Beautiful Soup. I think the issue you were facing has to do with the SSL cert validation, where the requests module will allow you to skip checking the certificate if you set the verify parameter to False.
The function below will download all the zip files and unzip them, from there, you can import the shapefiles into your geodatabase:
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
url = '/'.join([download_prefix, href])
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
# do more stuff like unzip?
unzipped = out_zip.split('.zip')[0]
with zipfile.Zipfile(out_zip, 'r') as f:
f.extractall(unzipped)

How do I submit data to a web form in python?

I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.

Finding Untagged Posts on Tumblr(Python-3.3 Coding Assistance Requested)

I'm extremely new to coding in general; I delved into this project in order to help my friend tag her fifteen thousand and some-odd posts on Tumblr. We've finally finished, but she wants to be sure that we haven't missed anything...
So, I've scoured the internet, trying to find a coding solution. I came across a script found here, that allegedly does exactly what we need -- so I downloaded Python, and...It doesn't work.
More specifically, when I click on the script, a black box appears for about half a second and then disappears. I haven't been able to screenshot the box to find out exactly what it says, but I believe it says there's a syntax error. At first, I tried with Python 2.4; it didn't seem to find the Json module the creator uses, so I switched to Python 3.3 -- the most recent version for Windows, and this is where the Syntax errors occur.
#!/usr/bin/python
import urllib2
import json
hostname = "(Redacted for Privacy)"
api_key = "(Redacted for Privacy)"
url = "http://api.tumblr.com/v2/blog/" + hostname + "/posts?api_key=" + api_key
def api_response(url):
req = urllib2.urlopen(url)
return json.loads(req.read())
jsonresponse = api_response(url)
post_count = jsonresponse["response"]["total_posts"]
increments = (post_count + 20) / 20
for i in range(0, increments):
jsonresponse = api_response(url + "&offset=" + str((i * 20)))
posts = jsonresponse["response"]["posts"]
for i in range(0, len(posts)):
if not posts[i]["tags"]:
print posts[i]["post_url"]
print("All finished!")
So, uhm, my question is this: If this coding has a syntax error that could be fixed and then used to find the Untagged Posts on Tumblr, what might that error be?
If this code is outdated (either via Tumblr or via Python updates), then might someone with a little free time be willing to help create a new script to find Untagged posts on Tumblr? Searching Tumblr, this seems to be a semi-common problem.
In case it matters, Python is installed in C:\Python33.
Thank you for your assistance.
when I click on the script, a black box appears for about half a second and then
disappears
At the very least, you should be able to run a Python script from the command line e.g., do Exercise 0 from "Learn Python The Hard Way".
"Finding Untagged Posts on Tumblr" blog post contains Python 2 script (look at import urllib2 in the source. urllib2 is renamed to urllib.request in Python 3). It is easy to port the script to Python 3:
#!/usr/bin/env python3
"""Find untagged tumblr posts.
Python 3 port of the script from
http://www.alexwlchan.net/2013/08/untagged-tumblr-posts/
"""
import json
from itertools import count
from urllib.request import urlopen
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
url = "https://api.tumblr.com/v2/blog/{blog}/posts?api_key={key}".format(
blog=hostname, key=api_key)
for offset in count(step=20):
r = json.loads(urlopen(url + "&offset=" + str(offset)).read().decode())
posts = r["response"]["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Here's the same functionality implemented using the official Python Tumblr API v2 Client (Python 2 only library):
#!/usr/bin/env python
from itertools import count
import pytumblr # $ pip install pytumblr
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
client = pytumblr.TumblrRestClient(api_key, host="https://api.tumblr.com")
for offset in count(step=20):
posts = client.posts(hostname, offset=offset)["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Tumblr has an API. You probably would have much better success using it.
https://code.google.com/p/python-tumblr/

Locally execute python file that is located on a web server

I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.

Categories

Resources