Performance issues formatting using .json()

Performance issues formatting using .json() - python

I am trying to load data from a file located at some URL. I use requests to get it (this happens plenty fast). However, it takes about 10 minutes to use r.json() to format part of the dictionary. How can I speed this up?
match_list = []
for i in range(1, 11):
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches%d.json' % i)
print('matches %d of 10 loaded' % i)
match_list.append(r.json()['matches'])
print('list %d of 10 created' % i)
match_histories = {}
match_histories['matches'] = match_list
I know that there is a related question here: Performance problem transforming JSON data , but I don't see how I can apply that to my case. Thanks! (I'm using Python 3).
Edit:
I have been given quite a few suggestions that seem promising, but with each I hit a roadblock.
I would like to try cjson, but I cannot install it (pip can't find MS visual C++ 10.0, tried using some installation using Lua, but I need cl in my path to begin; ).
json.loads(r.content) causes a TypeError in Python 3.
I'm not sure how to get ijson working.
ujson seems to take about as long as json
json.loads(r.text.encode('utf-8').decode('utf-8')) takes just as long too

The built-in JSON parser isn't particularly fast. I tried another parser, python-cjson, like so:
import requests
import cjson
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
print cjson.decode(r.content)
The whole program took 3.7 seconds on my laptop, including fetching the data and formatting the output for display.
Edit: Wow, we were all on the wrong track. json isn't slow; Requests's charset detection is painfully slow. Try this instead:
import requests
import json
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print json.loads(r.text)
The json.loads part takes 1.5s on my same laptop. That's still slower than cjson.decode (at only .62s), but may be fast enough that you won't care if this isn't something you run very frequently. Caveat: I've only benchmarked this on Python2 and it might be different on Python3.
Edit 2: It seems cjson doesn't install in Python3. That's OK: json.loads in this version only takes .54 seconds. Charset detection is still glacial, though, and commenting the r.encoding = 'UTF-8' still makes the test script run in O(eternal) time. If you can count on those files always being UTF-8 encoded, I think the performance secret is to put that information in your script so that it doesn't have to figure this out at runtime. With that boost, you don't need to bother with supplying your own JSON parser. Just run:
import requests
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print r.json()

It looks like requests uses simplejson to decode the JSON. If you just get the data with r.content and then use the builtin Python json library, json.loads(r.content) works very quickly. It works by raising an error for invalid JSON, but that's better than hanging for a long time.

I would recommend using a streaming JSON parser (take a look at ijson). A streaming approach will increase your memory efficient for the parsing step, but your program may still be sluggish since you are storing a rather large dataset in memory.

Well that's a pretty big file you have there and pure python code (I suspect the requests library doesn't use C bindings for JSON parsing) is often rather slow. Do you really need all the data? If you only need some parts of it, maybe you can find a faster way to find it or use a different API if it is available.
You could also try to use a faster JSON library by using a library like ujson: https://pypi.python.org/pypi/ujson
I didn't try this one myself but it claims to be fast. You can then just call ujson.loads(r.text) to obtain your data.

Related

How to get the complete response from a telnet command using python?

I am trying to use python 3.10.9 on windows to create a telnet session using telnetlib, but I have trouble to read the complete response.
I create a telnet session like
session = telnetlib.Telnet(host, port, timeout)
and then I write a command like
session.write(command + b"\n")
and then I wait some really long time (like 5 seconds) before I try to read the response using
session.read_some()
but I only get half of the response back!
The complete response is e.g.
Invalid arguments
Usage: $IMU,START,<SAMPLING_RATE>,<OUTPUT_RATE>
where SAMPLING_RATE = [1 : 1000] in Hz
OUTPUT_RATE = [1 : SAMPLING_RATE] in Hz
but all I read is the following:
b'\x1b[0GInvalid arguments\r\n\r\nUsage: $IMU,START,<'
More than half of the response is missing! How to read the complete response in a non-blocking way?
Other strange read methods:
read_all: blocking
read_eager: same issue
read_very_eager: sometimes works, sometimes not. Seems to contain a repetition of the message ...
read_lazy: does not read anything
read_very_lazy: does not read anything
I have not the slightest idea what all these different read methods are for. The documentation is not helping at all.
But read_very_eager seems to work sometimes. But sometimes I get a response like
F
FI
FIL
FILT
FILTE
FILTER
and so on. But I am reading only once, not adding the output myself!
Maybe there is a more simple-to-use module I can use instead if telnetlib?

Have you tried read_all(), or any of the other read_* options available?
Available functions here.

How do I get the snapshot length of a .pcap file using dpkt?

I am trying to get the snapshot length of a .pcap file. I have gone to the man page for pcap and pcap_snapshot but have not been able to get the function to work.
I am running a VM Fedora20 and it is written in python
First I try to import the file that the man page says to include but I get a syntax error on the import and the pcap_snapshot()
I am new at python so I imagine its something simple but not sure what it is. Any help is much appreciated!
import <pcap/pcap.h>
import dpkt
myPcap = open('mycapture.pcap')
myFile = dpkt.pcap.Reader(myPcap)
print "Snapshot length = ", myFile.pcap_snapshot()

Don't read the man page first unless you're writing code in C, C++, or Objective-C.
If you're not using a C-flavored language, you'll need to use a wrapper for libpcap, and should read the documentation for the wrapper first, as you won't be calling the C functions from libpcap, you'll be calling functions from the wrapper. If you try to import a C-language header file, such as pcap/pcap.h, in Python, that will not work. If you try to directly call a C-language function, such as pcap_snapshot(), that won't work, either.
Dpkt is not a wrapper; it is, instead, a library to parse packets and to read pcap files, with the code to read pcap files being independent of libpcap. Therefore, it won't offer wrappers for libpcap APIs such as pcap_snapshot().
Dpkt's documentation is, well, rather limited. A quick look at its pcap.py module seems to suggest that
print "Snapshot length = ", myFile.snaplen
would work; give that a try.

Output python into python-readable format

We're using a python based application which reads a configuration file containing a couple of arrays:
Example layout of config file:
array1 = [
'bob',
'sue',
'jayne'
]
Currently changes to the configuration are done by hand, but I've written a little interface to streamline the process (mainly to avoid errors).
It currently reads in the existing configuration, using a simple "import". However what I'm not sure how to do, is get my script to write it's output in valid python, so that the main application can read it again.
How can I can dump the array back into the file, but in valid python?
Cheers!

I'd suggest JSON or YAML (Less verbose than JSON) for configuration files. That way, the configuration file becomes more readable for the less pythonate ;) It's also easier to throw adequate errors, e.g. if the configuration is incomplete.
To save python objects you can always use pickle.

Generally using repr() will create a string that can be re-avaluated. But pprint does a little nicer output.
from pprint import pprint
outf.write("array1 = "); pprint(array1, outf)

repr(array1) (and write that into the file) would be a very simple solution, but it should work here.

Downloading files from an http server in python

Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.
When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?

Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:
import re
import urllib2
import pycurl
url = "http://server.domain/"
path = "path/"
pattern = '(.*?)' % path
response = urllib2.urlopen(url+path).read()
for filename in re.findall(pattern, response):
with open(filename, "wb") as fp:
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url+path+filename)
curl.setopt(pycurl.WRITEDATA, fp)
curl.perform()
curl.close()

You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):
import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')
This should be work :)
and this is a fnction that can do the same thing (using urllib):
def download(url):
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()

Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?
If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.

Download the index file
If it's really huge, it may be worth reading a chunk at a time;
otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser;
else if there is much string processing to do, use regex;
else use simple string methods.
Again, you can parse it all-at-once or incrementally.
Incrementally is somewhat more efficient and elegant,
but unless you are processing multiple tens of thousands
of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try
running multiple download threads;
another (significantly faster) approach might be
to delegate the work to a dedicated downloader
program like Aria2 http://aria2.sourceforge.net/ -
note that Aria2 can be run as a service and controlled
via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython

My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.
Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.

This is a non-convential way, but although it works
fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write)
urllib.urlretrieve(link, picName) - correct way

Here's an untested solution:
import urllib2
response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')
for file in urls:
print 'Downloading ' + file
response = urllib2.urlopen(file)
handle = open(file, 'w')
handle.write(response.read())
handle.close()
It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!

Embed pickle (or arbitrary) data in python script

In Perl, the interpreter kind of stops when it encounters a line with
__END__
in it. This is often used to embed arbitrary data at the end of a perl script. In this way the perl script can fetch and store data that it stores 'in itself', which allows for quite nice opportunities.
In my case I have a pickled object that I want to store somewhere. While I can use a file.pickle file just fine, I was looking for a more compact approach (to distribute the script more easily).
Is there a mechanism that allows for embedding arbitrary data inside a python script somehow?

With pickle you can also work directly on strings.
s = pickle.dumps(obj)
pickle.loads(s)
If you combine that with """ (triple-quoted strings) you can easily store any pickled data in your file.

If the data is not particularly large (many K) I would just .encode('base64') it and include that in a triple-quoted string, with .decode('base64') to get back the binary data, and a pickle.loads() call around it.

In Python, you can use """ (triple-quoted strings) to embed long runs of text data in your program.
In your case, however, don't waste time on this.
If you have an object you've pickled, you'd be much, much happier dumping that object as Python source and simply including the source.
The repr function, applied to most objects, will emit a Python source-code version of the object. If you implement __repr__ for all of your custom classes, you can trivially dump your structure as Python source.
If, on the other hand, your pickled structure started out as Python code, just leave it as Python code.

I made this code. You run something like python comp.py foofile.tar.gz, and it creates decomp.py, with foofile.tar.gz's contents embedded in it. I don't think this is really portable with windows because of the Popen though.
import base64
import sys
import subprocess
inf = open(sys.argv[1],"r+b").read()
outs = base64.b64encode(inf)
decomppy = '''#!/usr/bin/python
import base64
def decomp(data):
fname = "%s"
outf = open(fname,"w+b")
outf.write(base64.b64decode(data))
outf.close()
# You can put the rest of your code here.
#Like this, to unzip an archive
#import subprocess
#subprocess.Popen("tar xzf " + fname, shell=True)
#subprocess.Popen("rm " + fname, shell=True)
''' %(sys.argv[1])
taildata = '''uudata = """%s"""
decomp(uudata)
''' %(outs)
outpy = open("decomp.py","w+b")
outpy.write(decomppy)
outpy.write(taildata)
outpy.close()
subprocess.Popen("chmod +x decomp.py",shell=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance issues formatting using .json() - python

It looks like requests uses simplejson to decode the JSON. If you just get the data with r.content and then use the builtin Python json library, json.loads(r.content) works very quickly. It works by raising an error for invalid JSON, but that's better than hanging for a long time.

I would recommend using a streaming JSON parser (take a look at ijson). A streaming approach will increase your memory efficient for the parsing step, but your program may still be sluggish since you are storing a rather large dataset in memory.

Related

How to get the complete response from a telnet command using python?

How do I get the snapshot length of a .pcap file using dpkt?

Output python into python-readable format

Downloading files from an http server in python

Embed pickle (or arbitrary) data in python script

Categories

Resources