cURL method in Python for JSON feed [duplicate] - python

This question already has answers here:
How to download a file over HTTP?
(30 answers)
Closed 7 years ago.
While building a flask website, I'm using an external JSON feed to feed the local mongoDB with content. This feed is parsed and fed while repurposing keys from the JSON to keys in Mongo.
One of the available keys from the feed is called "img_url" and contains, guess what, an url to an image.
Is there a way, in Python, to mimic a php style cURL? I'd like to grab that key, download the image, and store it somewhere locally while keeping other associated keys, and have that as an entry to my db.
Here is my script up to now:
import json
import sys
import urllib2
from datetime import datetime
import pymongo
import pytz
from utils import slugify
# from utils import logger
client = pymongo.MongoClient()
db = client.artlogic
def fetch_artworks():
# logger.debug("downloading artwork data from Artlogic")
AL_artworks = []
AL_artists = []
url = "http://feeds.artlogic.net/artworks/artlogiconline/json/"
while True:
f = urllib2.urlopen(url)
data = json.load(f)
AL_artworks += data['rows']
# logger.debug("retrieved page %s of %s of artwork data" % (data['feed_data']['page'], data['feed_data']['no_of_pages']))
# Stop we are at the last page
if data['feed_data']['page'] == data['feed_data']['no_of_pages']:
break
url = data['feed_data']['next_page_link']
# Now we have a list called ‘artworks’ in which all the descriptions are stored
# We are going to put them into the mongoDB database,
# Making sure that if the artwork is already encoded (an object with the same id
# already is in the database) we update the existing description instead of
# inserting a new one (‘upsert’).
# logger.debug("updating local mongodb database with %s entries" % len(artworks))
for artwork in AL_artworks:
# Mongo does not like keys that have a dot in their name,
# this property does not seem to be used anyway so let us
# delete it:
if 'artworks.description2' in artwork:
del artwork['artworks.description2']
# upsert int the database:
db.AL_artworks.update({"id": artwork['id']}, artwork, upsert=True)
# artwork['artist_id'] is not functioning properly
db.AL_artists.update({"artist": artwork['artist']},
{"artist_sort": artwork['artist_sort'],
"artist": artwork['artist'],
"slug": slugify(artwork['artist'])},
upsert=True)
# db.meta.update({"subject": "artworks"}, {"updated": datetime.now(pytz.utc), "subject": "artworks"}, upsert=True)
return AL_artworks
if __name__ == "__main__":
fetch_artworks()

First, you might like the requests library.
Otherwise, if you want to stick to the stdlib, it will be something in the lines of:
def fetchfile(url, dst):
fi = urllib2.urlopen(url)
fo = open(dst, 'wb')
while True:
chunk = fi.read(4096)
if not chunk: break
fo.write(chunk)
fetchfile(
data['feed_data']['next_page_link'],
os.path.join('/var/www/static', uuid.uuid1().get_hex()
)
With the correct exceptions catching (i can develop if you want, but i'm sure the documentation will be clear enough).
You could put the fetchfile() into a pool of async jobs to fetch many files at once.
https://docs.python.org/2/library/json.html
https://docs.python.org/2/library/urllib2.html
https://docs.python.org/2/library/tempfile.html
https://docs.python.org/2/library/multiprocessing.html

Related

Python flask server to retrieve certain records

I have this following python code for a Flask server. I am trying to have this part of the code list all my vehicles that match the horsepower that I put in through my browser. I want it to return all the car names that match the horsepower, but what I have doesn't seem to be working? It returns nothing. I know the issue is somewhere in the "for" statement, but I don't know how to fix it.
This is my first time doing something like this and I've been trying multiple things for hours. I can't figure it out. Could you please help?
from flask import Flask
from flask import request
import os, json
app = Flask(__name__, static_folder='flask')
#app.route('/HORSEPOWER')
def horsepower():
horsepower = request.args.get('horsepower')
message = "<h3>HORSEPOWER "+str(horsepower)+"</h3>"
path = os.getcwd() + "/data/vehicles.json"
with open(path) as f:
data = json.load(f)
for record in data:
horsepower=int(record["Horsepower"])
if horsepower == record:
car=record["Car"]
return message
The following example should meet your expectations.
from flask import Flask
from flask import request
import os, json
app = Flask(__name__)
#app.route('/horsepower')
def horsepower():
# The type of the URL parameters are automatically converted to integer.
horsepower = request.args.get('horsepower', type=int)
# Read the file which is located in the data folder relative to the
# application root directory.
path = os.path.join(app.root_path, 'data', 'vehicles.json')
with open(path) as f:
data = json.load(f)
# A list of names of the data sets is created,
# the performance of which corresponds to the parameter passed.
cars = [record['Car'] for record in data if horsepower == int(record["Horsepower"])]
# The result is then output separated by commas.
return f'''
<h3>HORSEPOWER {horsepower}</h3>
<p>{','.join(cars)}<p>
'''
There are many different ways of writing the loop. I used a short variant in the example. In more detail, you can use these as well.
cars = []
for record in data:
if horsepower == int(record['Horsepower']):
cars.append(record['Car'])
As a tip:
Pay attention to when you overwrite the value of a variable by using the same name.

Python load variables into an array from file

I'm making a reddit bot and I've made it store the comment ids in an array so it doesn't reply to the same comment twice, however if I close the program the array is cleared.
I'm looking for a way to keep the array, such as storing it in an external file and reading it, thanks!
Here's my code:
import praw
import time
import random
import pickle
#logging into the Reddit API
r = praw.Reddit(user_agent="Random Number machine by /u/---")
print("Logging in...")
r.login(---,---, disable_warning=True)
print("Logged in.")
wordsToMatch = ["+randomnumber","+random number","+ randomnumber","+ random number"] #Words which the bot looks for.
cache = [] #If a comment ID is stored here, the bot will not reply back to the same post.
def run_bot():
print("Start of new loop.")
subreddit = r.get_subreddit(---) #Decides which sub-reddit to search for comments.
comments = subreddit.get_comments(limit=100) #Grabbing comments...
print(cache)
for comment in comments:
comment_text = comment.body.lower() #Stores the comment in a variable and lowers it.
isMatch = any(string in comment_text for string in wordsToMatch) #If the bot matches a comment with the wordsToMatch array.
if comment.id not in cache and isMatch: #If a comment is found and the ID isn't in the cache.
print("Comment found: {}".format(comment.id)) #Prints the following line to console, develepors see this only.
#comment.reply("Hey, I'm working!")
#cache.append(comment.id)
while True:
run_bot()
time.sleep(5)
What you're looking for is called serialization. You can use json or yaml or even pickle. They all have very similar APIs:
import json
a = [1,2,3,4,5]
with open("/tmp/foo.json", "w") as fd:
json.dump(a, fd)
with open("/tmp/foo.json") as fd:
b = json.load(fd)
assert b == a
foo.json:
$ cat /tmp/foo.json
[1, 2, 3, 4, 5]
json and yaml only work with basic types like strings, numbers, lists, and dictionaries. Pickle has more flexibility allowing you to serialize more complex types like classes. json is generally used for communication between programs, while yaml tends to be used when the input is needed to be edited/read by humans as well.
For your case, you probably want json. If you want to make it pretty, there are options to the json library to indent the output.

Getting file input into Python script for praw script

So I have a simple reddit bot set up which I wrote using the praw framework. The code is as follows:
import praw
import time
import numpy
import pickle
r = praw.Reddit(user_agent = "Gets the Daily General Thread from subreddit.")
print("Logging in...")
r.login()
words_to_match = ['sdfghm']
cache = []
def run_bot():
print("Grabbing subreddit...")
subreddit = r.get_subreddit("test")
print("Grabbing thread titles...")
threads = subreddit.get_hot(limit=10)
for submission in threads:
thread_title = submission.title.lower()
isMatch = any(string in thread_title for string in words_to_match)
if submission.id not in cache and isMatch:
print("Match found! Thread ID is " + submission.id)
r.send_message('FlameDraBot', 'DGT has been posted!', 'You are awesome!')
print("Message sent!")
cache.append(submission.id)
print("Comment loop finished. Restarting...")
# Run the script
while True:
run_bot()
time.sleep(20)
I want to create a file (text file or xml, or something else) using which the user can change the fields for the various information being queried. For example I want a file with lines such as :
Words to Search for = sdfghm
Subreddit to Search in = text
Send message to = FlameDraBot
I want the info to be input from fields, so that it takes the value after Words to Search for = instead of the whole line. After the information has been input into the file and it has been saved. I want my script to pull the information from the file, store it in a variable, and use that variable in the appropriate functions, such as:
words_to_match = ['sdfghm']
subreddit = r.get_subreddit("test")
r.send_message('FlameDraBot'....
So basically like a config file for the script. How do I go about making it so that my script can take input from a .txt or another appropriate file and implement it into my code?
Yes, that's just a plain old Python config, which you can implement in an ASCII file, or else YAML or JSON.
Create a subdirectory ./config, put your settings in ./config/__init__.py
Then import config.
Using PEP-18 compliant names, the file ./config/__init__.py would look like:
search_string = ['sdfghm']
subreddit_to_search = 'text'
notify = ['FlameDraBot']
If you want more complicated config, just read the many other posts on that.

How can I store URLs from Amazon S3 bucket in Django sqlite db to display & comment on?

I've been using Django for a couple of days & setup a basic blog from a tutorial with django comments.
I've got a totally separate python script that generates screenshots and uploads them to Amazon S3, now I'd like my django app to display all the images in the bucket and use a comment system on the images. Preferably I'd do this by just storing the URLs in my sqlite db, which I've got hard-coded currently to display all images in the db and has comments enabled on these.
My model:
(Does this need a foreign key to the django comments or is that just part of the Django Magic?!)
class Image(models.Model):
imgUrl=models.CharField(max_length=200)
meta=models.CharField(max_length=300)
def __unicode__(self):
return self.imgUrl
My bucket structure:
https://s3-eu-west-1.amazonaws.com/bucket/revision/process/images.png
Almost all the tutorials and packages I'm finding are based on upload/download rather than a simple for keys in bucket type approach that I want.
One of my problems is understanding how I can integrate my Boto functions with Django if I'm using Base.html. In an earlier tutorial I had an index page which had a view and could call functions from there. But base doesn't need that so I'm starting to get a little lost.
haven't looked up if boto api changed, but this is how it worked last time i looked
from boto.s3.connection import S3Connection
from boto.s3.key import Key
import s3config
conn = S3Connection(s3config.passwd, s3config.secret)
bucket = conn.get_bucket(s3config.bucket)
s3_path = '/some/path/in/your/bucket'
keys = bucket.list(s3_path)
# or if you want all keys:
# keys = bucket.get_all_keys()
for key in keys:
print key
# here you can download or do other stuff
# with the keys like get some metadata
print key.name
print key.etag
print key.size
print key.last_modified
#s3config.py
passwd = 'BLABALBALABALA'
secret = 'xvdwv3efefefefefef'
bucket = 'name-of-your-bucket'
Update:
Amazon s3 is a key value store, where key is a string. So nothing prevents you from putting in keys like:
/this/string/key/looks/like/a/unix/path
/folder/images/fileA.jpg
/folder/images/fileB.jpg
/folder/images/folderX/fileX1.jpg
now bucket.list(prefix="/folder/images/") would yield the latter three.
Look here for further details:
http://readthedocs.org/docs/boto/en/latest/ref/s3.html#boto-s3-bucket
http://docs.amazonwebservices.com/AmazonS3/latest/API/RESTBucketGET.html?r=8270
This is my code to store result from s3 to mysql by boto, django.
from demo.models import Movies
import boto
from boto.s3.key import Key
import string
from django.db import connection, transaction
def movietitle(b):
key = b.get_key('netflix/movie_titles.txt')
content = key.get_contents_as_string()
line = content.split('\n')
args = []
for imovie in line:
if len(imovie) > 0:
imovie = imovie.split(',')
movieid = imovie[0]
year = imovie[1]
title = imovie[2]
iargs = [string.atoi(movieid),title,year]
args.append(iargs)
cursor = connection.cursor()
sql = "insert into demo_movies(MovieID,MovieName,ReleaseYear) values(%s,%s,%s)"
cursor.executemany(sql,args)
transaction.commit_unless_managed()
cursor.close()

Delete all data for a kind in Google App Engine

I would like to wipe out all data for a specific kind in Google App Engine. What is the
best way to do this?
I wrote a delete script (hack), but since there is so much data is
timeout's out after a few hundred records.
I am currently deleting the entities by their key, and it seems to be faster.
from google.appengine.ext import db
class bulkdelete(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
try:
while True:
q = db.GqlQuery("SELECT __key__ FROM MyModel")
assert q.count()
db.delete(q.fetch(200))
time.sleep(0.5)
except Exception, e:
self.response.out.write(repr(e)+'\n')
pass
from the terminal, I run curl -N http://...
You can now use the Datastore Admin for that: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk
If I were a paranoid person, I would say Google App Engine (GAE) has not made it easy for us to remove data if we want to. I am going to skip discussion on index sizes and how they translate a 6 GB of data to 35 GB of storage (being billed for). That's another story, but they do have ways to work around that - limit number of properties to create index on (automatically generated indexes) et cetera.
The reason I decided to write this post is that I need to "nuke" all my Kinds in a sandbox. I read about it and finally came up with this code:
package com.intillium.formshnuker;
import java.io.IOException;
import java.util.ArrayList;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.FetchOptions;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.labs.taskqueue.QueueFactory;
import com.google.appengine.api.labs.taskqueue.TaskOptions.Method;
import static com.google.appengine.api.labs.taskqueue.TaskOptions.Builder.url;
#SuppressWarnings("serial")
public class FormsnukerServlet extends HttpServlet {
public void doGet(final HttpServletRequest request, final HttpServletResponse response) throws IOException {
response.setContentType("text/plain");
final String kind = request.getParameter("kind");
final String passcode = request.getParameter("passcode");
if (kind == null) {
throw new NullPointerException();
}
if (passcode == null) {
throw new NullPointerException();
}
if (!passcode.equals("LONGSECRETCODE")) {
response.getWriter().println("BAD PASSCODE!");
return;
}
System.err.println("*** deleting entities form " + kind);
final long start = System.currentTimeMillis();
int deleted_count = 0;
boolean is_finished = false;
final DatastoreService dss = DatastoreServiceFactory.getDatastoreService();
while (System.currentTimeMillis() - start < 16384) {
final Query query = new Query(kind);
query.setKeysOnly();
final ArrayList<Key> keys = new ArrayList<Key>();
for (final Entity entity: dss.prepare(query).asIterable(FetchOptions.Builder.withLimit(128))) {
keys.add(entity.getKey());
}
keys.trimToSize();
if (keys.size() == 0) {
is_finished = true;
break;
}
while (System.currentTimeMillis() - start < 16384) {
try {
dss.delete(keys);
deleted_count += keys.size();
break;
} catch (Throwable ignore) {
continue;
}
}
}
System.err.println("*** deleted " + deleted_count + " entities form " + kind);
if (is_finished) {
System.err.println("*** deletion job for " + kind + " is completed.");
} else {
final int taskcount;
final String tcs = request.getParameter("taskcount");
if (tcs == null) {
taskcount = 0;
} else {
taskcount = Integer.parseInt(tcs) + 1;
}
QueueFactory.getDefaultQueue().add(
url("/formsnuker?kind=" + kind + "&passcode=LONGSECRETCODE&taskcount=" + taskcount).method(Method.GET));
System.err.println("*** deletion task # " + taskcount + " for " + kind + " is queued.");
}
response.getWriter().println("OK");
}
}
I have over 6 million records. That's a lot. I have no idea what the cost will be to delete the records (maybe more economical not to delete them). Another alternative would be to request a deletion for the entire application (sandbox). But that's not realistic in most cases.
I decided to go with smaller groups of records (in easy query). I know I could go for 500 entities, but then I started receiving very high rates of failure (re delete function).
My request from GAE team: please add a feature to delete all entities of a kind in a single transaction.
Presumably your hack was something like this:
# Deleting all messages older than "earliest_date"
q = db.GqlQuery("SELECT * FROM Message WHERE create_date < :1", earliest_date)
results = q.fetch(1000)
while results:
db.delete(results)
results = q.fetch(1000, len(results))
As you say, if there's sufficient data, you're going to hit the request timeout before it gets through all the records. You'd have to re-invoke this request multiple times from outside to ensure all the data was erased; easy enough to do, but hardly ideal.
The admin console doesn't seem to offer any help, as (from my own experience with it), it seems to only allow entities of a given type to be listed and then deleted on a page-by-page basis.
When testing, I've had to purge my database on startup to get rid of existing data.
I would infer from this that Google operates on the principle that disk is cheap, and so data is typically orphaned (indexes to redundant data replaced), rather than deleted. Given there's a fixed amount of data available to each app at the moment (0.5 GB), that's not much help for non-Google App Engine users.
Try using App Engine Console then you dont even have to deploy any special code
I've tried db.delete(results) and App Engine Console, and none of them seems to be working for me. Manually removing entries from Data Viewer (increased limit up to 200) didn't work either since I have uploaded more than 10000 entries. I ended writing this script
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import wsgiref.handlers
from mainPage import YourData #replace this with your data
class CleanTable(webapp.RequestHandler):
def get(self, param):
txt = self.request.get('table')
q = db.GqlQuery("SELECT * FROM "+txt)
results = q.fetch(10)
self.response.headers['Content-Type'] = 'text/plain'
#replace yourapp and YouData your app info below.
self.response.out.write("""
<html>
<meta HTTP-EQUIV="REFRESH" content="5; url=http://yourapp.appspot.com/cleanTable?table=YourData">
<body>""")
try:
for i in range(10):
db.delete(results)
results = q.fetch(10, len(results))
self.response.out.write("<p>10 removed</p>")
self.response.out.write("""
</body>
</html>""")
except Exception, ints:
self.response.out.write(str(inst))
def main():
application = webapp.WSGIApplication([
('/cleanTable(.*)', CleanTable),
])
wsgiref.handlers.CGIHandler().run(application)
The trick was to include redirect in html instead of using self.redirect. I'm ready to wait overnight to get rid of all the data in my table. Hopefully, GAE team will make it easier to drop tables in the future.
The official answer from Google is that you have to delete in chunks spread over multiple requests. You can use AJAX, meta refresh, or request your URL from a script until there are no entities left.
The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
One tip. I suggest you get to know the remote_api for these types of uses (bulk deleting, modifying, etc.). But, even with the remote api, batch size can be limited to a few hundred at a time.
Unfortunately, there's no way to easily do a bulk delete. Your best bet is to write a script that deletes a reasonable number of entries per invocation, and then call it repeatedly - for example, by having your delete script return a 302 redirect whenever there's more data to delete, then fetching it with "wget --max-redirect=10000" (or some other large number).
With django, setup url:
url(r'^Model/bdelete/$', v.bulk_delete_models, {'model':'ModelKind'}),
Setup view
def bulk_delete_models(request, model):
import time
limit = request.GET['limit'] or 200
start = time.clock()
set = db.GqlQuery("SELECT __key__ FROM %s" % model).fetch(int(limit))
count = len(set)
db.delete(set)
return HttpResponse("Deleted %s %s in %s" % (count,model,(time.clock() - start)))
Then run in powershell:
$client = new-object System.Net.WebClient
$client.DownloadString("http://your-app.com/Model/bdelete/?limit=400")
If you are using Java/JPA you can do something like this:
em = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory)
Query q = em.createQuery("delete from Table t");
int number = q.executeUpdate();
Java/JDO info can be found here: http://code.google.com/appengine/docs/java/datastore/queriesandindexes.html#Delete_By_Query
Yes you can:
Go to Datastore Admin, and then select the Entitiy type you want to delete and click Delete.
Mapreduce will take care of deleting!
On a dev server, one can cd to his app's directory then run it like this:
dev_appserver.py --clear_datastore=yes .
Doing so will start the app and clear the datastore. If you already have another instance running, the app won't be able to bind to the needed IP and therefore fail to start...and to clear your datastore.
You can use the task queues to delete chunks of say 100 objects.
Deleting objects in GAE shows how limited the Admin capabilities are in GAE. You have to work with batches on 1000 entities or less. You can use the bulkloader tool that works with csv's but the documentation does not cover java.
I am using GAE Java and my strategy for deletions involves having 2 servlets, one for doing the actually delete and another to load the task queues. When i want to do a delete, I run the queue loading servlet, it loads the queues and then GAE goes to work executing all the tasks in the queue.
How to do it:
Create a servlet that deletes a small number of objects.
Add the servlet to your task queues.
Go home or work on something else ;)
Check the datastore every so often ...
I have a datastore with about 5000 objects that i purge every week and it takes about 6 hours to clean out, so i run the task on Friday night.
I use the same technique to bulk load my data which happens to be about 5000 objects, with about a dozen properties.
This worked for me:
class ClearHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
q = db.GqlQuery("SELECT * FROM SomeModel")
self.response.out.write("deleting...")
db.delete(q)
Thank you all guys, I got what I need. :D
This may be useful if you have lots db models to delete, you can dispatch it in your terminal. And also, you can manage the delete list in DB_MODEL_LIST yourself.
Delete DB_1:
python bulkdel.py 10 DB_1
Delete All DB:
python bulkdel.py 11
Here is the bulkdel.py file:
import sys, os
URL = 'http://localhost:8080'
DB_MODEL_LIST = ['DB_1', 'DB_2', 'DB_3']
# Delete Model
if sys.argv[1] == '10' :
command = 'curl %s/clear_db?model=%s' % ( URL, sys.argv[2] )
os.system( command )
# Delete All DB Models
if sys.argv[1] == '11' :
for model in DB_MODEL_LIST :
command = 'curl %s/clear_db?model=%s' % ( URL, model )
os.system( command )
And here is the modified version of alexandre fiori's code.
from google.appengine.ext import db
class DBDelete( webapp.RequestHandler ):
def get( self ):
self.response.headers['Content-Type'] = 'text/plain'
db_model = self.request.get('model')
sql = 'SELECT __key__ FROM %s' % db_model
try:
while True:
q = db.GqlQuery( sql )
assert q.count()
db.delete( q.fetch(200) )
time.sleep(0.5)
except Exception, e:
self.response.out.write( repr(e)+'\n' )
pass
And of course, you should map the link to model in a file(like main.py in GAE), ;)
In case some guys like me need it in detail, here is part of main.py:
from google.appengine.ext import webapp
import utility # DBDelete was defined in utility.py
application = webapp.WSGIApplication([('/clear_db',utility.DBDelete ),('/',views.MainPage )],debug = True)
To delete all entities in a given kind in Google App Engine you only need to do as follows:
from google.cloud import datastore
query = datastore.Client().query(kind = <KIND>)
results = query.fetch()
for result in results:
datastore.Client().delete(result.key)
In javascript, the following will delete all the entries for on page:
document.getElementById("allkeys").checked=true;
checkAllEntities();
document.getElementById("delete_button").setAttribute("onclick","");
document.getElementById("delete_button").click();
given that you are on the admin-page (.../_ah/admin) with the entities you want to delete.

Categories

Resources