I am going around in circles and tried so many different ways so I guess my core understanding is wrong. I would be grateful for help in understanding my encoding/decoding issues.
import urllib2
result = urllib2.urlopen("https://graph.facebook.com/163146530455639")
rawdata = result.read().decode('utf-8')
print "HEADER: " + str(result.info())
print "I want this to work ", rawdata.find('http://www.facebook.com')
print "I dont want this to work ", rawdata.find('http:\/\/www.facebook.com')
I guess what im getting isnt utf-8 even though the header seems to say it is. Or as a newbie to Python im doing something dumb. :(
Thanks for any help,
Phil
You're getting JSON back from Facebook, so the easiest thing to do is use the built in json module to decode it (provided you're using Python 2.6+, otherwise you'll have to install).
import json
import urllib2
result = urllib2.urlopen("https://graph.facebook.com/163146530455639")
rawdata = result.read()
jsondata = json.load(rawdata)
print jsondata['link']
gives you:
u'http://www.facebook.com/GrosvenorCafe'
Related
I am trying to load data from a file located at some URL. I use requests to get it (this happens plenty fast). However, it takes about 10 minutes to use r.json() to format part of the dictionary. How can I speed this up?
match_list = []
for i in range(1, 11):
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches%d.json' % i)
print('matches %d of 10 loaded' % i)
match_list.append(r.json()['matches'])
print('list %d of 10 created' % i)
match_histories = {}
match_histories['matches'] = match_list
I know that there is a related question here: Performance problem transforming JSON data , but I don't see how I can apply that to my case. Thanks! (I'm using Python 3).
Edit:
I have been given quite a few suggestions that seem promising, but with each I hit a roadblock.
I would like to try cjson, but I cannot install it (pip can't find MS visual C++ 10.0, tried using some installation using Lua, but I need cl in my path to begin; ).
json.loads(r.content) causes a TypeError in Python 3.
I'm not sure how to get ijson working.
ujson seems to take about as long as json
json.loads(r.text.encode('utf-8').decode('utf-8')) takes just as long too
The built-in JSON parser isn't particularly fast. I tried another parser, python-cjson, like so:
import requests
import cjson
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
print cjson.decode(r.content)
The whole program took 3.7 seconds on my laptop, including fetching the data and formatting the output for display.
Edit: Wow, we were all on the wrong track. json isn't slow; Requests's charset detection is painfully slow. Try this instead:
import requests
import json
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print json.loads(r.text)
The json.loads part takes 1.5s on my same laptop. That's still slower than cjson.decode (at only .62s), but may be fast enough that you won't care if this isn't something you run very frequently. Caveat: I've only benchmarked this on Python2 and it might be different on Python3.
Edit 2: It seems cjson doesn't install in Python3. That's OK: json.loads in this version only takes .54 seconds. Charset detection is still glacial, though, and commenting the r.encoding = 'UTF-8' still makes the test script run in O(eternal) time. If you can count on those files always being UTF-8 encoded, I think the performance secret is to put that information in your script so that it doesn't have to figure this out at runtime. With that boost, you don't need to bother with supplying your own JSON parser. Just run:
import requests
r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print r.json()
It looks like requests uses simplejson to decode the JSON. If you just get the data with r.content and then use the builtin Python json library, json.loads(r.content) works very quickly. It works by raising an error for invalid JSON, but that's better than hanging for a long time.
I would recommend using a streaming JSON parser (take a look at ijson). A streaming approach will increase your memory efficient for the parsing step, but your program may still be sluggish since you are storing a rather large dataset in memory.
Well that's a pretty big file you have there and pure python code (I suspect the requests library doesn't use C bindings for JSON parsing) is often rather slow. Do you really need all the data? If you only need some parts of it, maybe you can find a faster way to find it or use a different API if it is available.
You could also try to use a faster JSON library by using a library like ujson: https://pypi.python.org/pypi/ujson
I didn't try this one myself but it claims to be fast. You can then just call ujson.loads(r.text) to obtain your data.
I'm quite new to Python and am trying to learn as much as I can by watching videos/reading tutorials.
I was following this video on how to take data from Quandl. I know there is a specific module for python already, but I wanted to learn how to take it from the website if necessary. My issue is that when I try to emulate the code around 9:50 and print the result, python doesn't split the lines in the CSV file. I understand he's using python 2.x, while I'm using 3.4.
Here's the code I use:
import urllib
from urllib.request import urlopen
def grabQuandl(ticker):
endLink = 'sort_order=desc'#without authtoken
try:
salesRev = urllib.request.urlopen('https://www.quandl.com/api/v1/datasets/SEC/'+ticker+'_SALESREVENUENET_Q.csv?&'+endLink).read()
print (salesRev)
except Exception as e:
print ('failed the main quandl loop for reason of', str(e))
grabQuandl('AAPL')
And this is what gets printed:
b'Date,Value\n2009-06-27,8337000000.0\n2009-12-26,15683000000.0\n2010-03-27,13499000000.0\n2010-06-26,15700000000.0\n2010-09-25,20343000000.0\n2010-12-25,26741000000.0\n2011-03-26,24667000000.0\n2011-06-25,28571000000.0\n2011-09-24,28270000000.0\n2011-12-31,46333000000.0\n2012-03-31,39186000000.0\n2012-06-30,35023000000.0\n2012-09-29,35966000000.0\n2012-12-29,54512000000.0\n2013-03-30,43603000000.0\n2013-06-29,35323000000.0\n2013-09-28,37472000000.0\n2013-12-28,57594000000.0\n2014-03-29,45646000000.0\n2014-06-28,37432000000.0\n2014-09-27,42123000000.0\n2014-12-27,74599000000.0\n2015-03-28,58010000000.0\n'
I get that the \n is some sort of line splitter, but it's not working like in the video. I've googled for possible solutions, such as doing a for loop, using read().split(), but at best they simply remove the \n. I can't get the output into a table like in the video. What am I doing wrong?
.read() gives you back a byte-string , when you directly print it, you get the result you got.You can notice the b at the starting before the quote, it indicates byte-string.
You should decode the string you get, before printing (or directly while using .read() . An example -
import urllib
from urllib.request import urlopen
def grabQuandl(ticker):
endLink = 'sort_order=desc'#without authtoken
try:
salesRev = urllib.request.urlopen('https://www.quandl.com/api/v1/datasets/SEC/'+ticker+'_SALESREVENUENET_Q.csv?&'+endLink).read().decode('utf-8')
print (salesRev)
except Exception as e:
print ('failed the main quandl loop for reason of', str(e))
grabQuandl('AAPL')
The above decodes the returned data using utf-8 encoding, you can use whatever encoding you want (whatever encoding the data is).
Example to show the print behavior -
>>> s = b'asd\nbcd\n'
>>> print(s)
b'asd\nbcd\n'
>>> print(s.decode('utf-8'))
asd
bcd
>>> type(s)
<class 'bytes'>
im trying to sending a response to a web, with multilingual characters on python 3 but all the time, it comes this:
"\\xd8\\xa7\\xd9\\x84\\xd9\\x82\\xd8\\xa7\\xd9\\x85\\xd9\\x88\\xd8\\xb3 \\xd8\\xa7\\xd9\\x84\\xd8\\xb9\\xd8\\xb1\\xd8\\xa8\\xd9\\x8a Espa\\xc3\\xb1a".
When the correct answer is this:
القاموس العربي España.
This is the code:
s="القاموس العربي España".encode(encoding='UTF-8')
Where can be my mistake?
I found it! It was a mess with the JSON responser, that i was writing with ensure_ascii=True, and the response was trying to send it as a JSON not as HTML. By using the ensure_ascii=True the system will print correctly any JSON answer.
I capture screen of my pygame program like this
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
How can I convert it into base64 string? (WITHOUT having to save it to HDD). Its important that there is no saving to HDD. I know I can save it to a file and then just encode the file to base64 but I cant seem to encode "on the fly"
thanks
If you want, you can save it to a StringIO, which is basically a virtual file stored as a string.
However, I'd really recommend using the base64 module, which has a method called base64.b64encode. It handles your 'on the fly' requirement well.
Code example:
import base64
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
base64data = base64.b64encode(data)
Happy coding!
Actually - pygame.image.tostring() is a pretty strange function (really dont understand the binary string it returns, I cant find anythin that can process it right).
There seems to be an enhancement issue on this at pygame bitbucket:
(https://bitbucket.org/pygame/pygame/issue/48/add-optional-format-argument-to)
I got around it like this:
data = cStringIO.StringIO()
pygame.image.save(pygame.display.get_surface(), data)
data = base64.b64encode(data.getvalue())
So in the end you get the valid and RIGHT base64 string. And it seems to work. Not sure about the format yet tho, will add more info tmrw
I'm trying to convert some php code into python and am using curl. I've gotten most of it to be accepted, but when it gets to the result = pycurl.exec(Moe) it keeps throwing a syntax error. I guess that I'm not filling out the exec field correctly, but I can't seem to figure out where it is going wrong.
from urllib2 import urlopen
from ClientForm import ParseResponse
import cgi, cgitb
import webbrowser
import curl, pycurl
Moe = pycurl.Curl(wsdl)
Moe.setopt(pycurl.POST, 1)
Moe.setopt(pycurl.HTTPHEADER, ["Content-Type: text/xml"])
Moe.setopt(pycurl.HTTPAUTH, pycurl.BASIC)
Moe.setopt(pycurl.USERPWD, "userid:password")
Moe.setopt(pycurl.POSTFIELDS, Larry)
Moe.setopt(pycurl.SSL_VERIFYPEER, 0)
Moe.setopt(pycurl.SSLCERT, pemlocation)
Moe.setopt(pycurl.SSLKEY, keylocation)
Moe.setopt(pycurl.SSLKEYPASSWD, keypassword)
Moe.setopt(pycurl.RETURNTRANSFER, 1)
result = pycurl.exec(Moe)
pycurl.close(Moe)
Use result = Moe.perform() to execute your request.
PS:
Moe.setopt(pycurl.POSTFIELDS, Larry)
Is Larry actually a variable? If it's a string, quote it.
exec is a reserved word in Python, you cannot have a function of that name. Try reading the pycurl documentation to see what function you should be calling.
I haven't used pycurl myself, but maybe you want to call Moe.perform()?