YouTube API UnicodeEncodeError in Python 3.4 - python

I was exploring the YouTube Data API and finding that improperly encoded results were holding me back. I got good results until I retrieve a set that includes unmapped characters in the titles. My code is NOW (cleaned up a little for you fine folks):
import urllib.request
import urllib.parse
import json
import datetime
# Look for videos published up to THIS MANY hours ago
IntHoursToSub = 2
RightNow = datetime.datetime.utcnow()
StartedAgo = datetime.timedelta(hours=-(IntHoursToSub))
HourAgo = RightNow + StartedAgo
HourAgo = str(HourAgo).replace(" ", "T")
HourAgo = HourAgo[:HourAgo.find(".")] + "Z"
# Get API Key from your safe place and set up the API link
YouTubeAPIKey = open('YouTubeAPIKey.txt', 'r').read()
locuURL = "https://www.googleapis.com/youtube/v3/search"
values = {"key": YouTubeAPIKey,
"part": "snippet",
"publishedAfter": HourAgo,
"relevanceLanguage": "en",
"regionCode": "us",
"maxResults": "50",
"type": "live"}
postData = urllib.parse.urlencode(values)
fullURL = locuURL + "?" + postData
# Set up response holder and handle exceptions
respData = ""
try:
req = urllib.request.Request(fullURL)
respData = urllib.request.urlopen(req).read().decode()
except Exception as e:
print(str(e))
#print(respData)
# Read JSON response and iterate through for video names/URLs
jsonData = json.loads(respData)
for object in jsonData["items"]:
if object["id"]["kind"] == "youtube#video":
print(object["snippet"]["title"], "https://www.youtube.com/watch?v=" + object["id"]["videoId"])
The error was:
Traceback (most recent call last):
File "C:/Users/Chad LaFarge/PycharmProjects/APIAccess/YouTubeAPI.py", line 33, in <module>
print(respData)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bb' in position 11737: character maps to <undefined>
UPDATE
MJY Called it! Starting from PyCharm menu bar:
File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.
Many thanks!

Check the sys.stdout.encoding.
If this is not UTF-8, the problem is not in YouTube API.
Please check such as environment variables PYTHONIOENCODING, terminal and locale settings.

Related

Convert base64 encoded google service account key to JSON file using Python

hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?
Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')

Not able to read usa.gov data in Python or R

Please go through the archival data USA GOV Sample Data
Now I want to read this file in R then getting below mentioned error
result = fromJSON(textFileName)
Error in fromJSON(textFileName) : unexpected character 'u'
When I want to read it in Python then getting below mentioned error
import json
records = [json.loads(line) for line in open(path)]
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4088: character maps to <undefined>
can someone please help me that how can I read this kind of files.
I couldn't get the codes OP provided on the question on my system too(windows/Rstudio/Jupyter). I dig around and find this for R, adapting it to this case:
library(jsonlite)
out <- lapply(readLines("usagov_bitly_data2013-05-17-1368817803"), fromJSON)
df<-data.frame(Reduce(rbind, out))
Although the error I got in R is curiously different from yours.
result = fromJSON("usagov_bitly_data2013-05-17-1368817803")
#Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
# [ 34.730400, -86.586098 ] } { "a": "Mozilla\/5.0 (Windows N
# (right here) ------^
For Python, as mentioned by juanpa, it seems to be a matter of encoding. The following code works for me.
import json
import os
path=os.path.abspath("usagov_bitly_data2013-05-17-1368817803")
print(path)
file = open(path, encoding="utf8")
records = [json.loads(line) for line in file]
Solution in R:
library(jsonlite)
# if you have a local file
conn <- gzcon(file("usagov_bitly_data2013-05-17-1368817803.gz", "rb"))
# if you read it from URL
conn <- gzcon(url("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368817803.gz"))
data <- stream_in(conn)

Python scraping TOR, script "To Russia, with love"

I am trying scrape with BS4 via TOR, using the To Russia With Love tutorial from the Stem project.
I've rewritten the code a bit, using i.a. this answer, and it now looks like this,
SOCKS_PORT=7000
def query(url):
output = io.BytesIO()
query = pycurl.Curl()
query.setopt(pycurl.URL, url)
query.setopt(pycurl.PROXY, 'localhost')
query.setopt(pycurl.PROXYPORT, SOCKS_PORT)
query.setopt(pycurl.PROXYTYPE, pycurl.PROXYTYPE_SOCKS5_HOSTNAME)
query.setopt(pycurl.WRITEFUNCTION, output.write)
try:
query.perform()
return output.getvalue()
except pycurl.error as exc:
return "Unable to reach %s (%s)" % (url, exc)
def print_bootstrap_lines(line):
if "Bootstrapped " in line:
print(term.format(line, term.Color.BLUE))
print(term.format("Starting Tor:\n", term.Attr.BOLD))
tor_process = stem.process.launch_tor_with_config(
tor_cmd = '/Applications/TorBrowser.app/Contents/MacOS/Tor/tor.real',
config = {
'SocksPort': str(SOCKS_PORT),
'ExitNodes': '{ru}',
'GeoIPFile': r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip',
'GeoIPv6File' : r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip6'
},
init_msg_handler = print_bootstrap_lines,
)
print(term.format("\nChecking our endpoint:\n", term.Attr.BOLD))
print(term.format(query("https://www.atagar.com/echo.php"), term.Color.BLUE))
I am able to Establish a Tor circuit, but at "checking our endpoint", I receive a the following error,
Checking our endpoint:
Traceback (most recent call last):
File "<ipython-input-804-68f8df2c050b>", line 40, in <module>
print(term.format(query('https://www.atagar.com/echo.php'), term.Color.BLUE))
File "/Applications/anaconda/lib/python3.6/site-packages/stem/util/term.py", line 139, in format
if RESET in msg:
TypeError: a bytes-like object is required, not 'str'
What should I change to see the endpoint?
I've temporarily solved it by changing the last line of the above code with,
test=requests.get('https://www.atagar.com/echo.php')
soup = BeautifulSoup(test.content, 'html.parser')
print(soup)
but I'd like to know how to get the 'original' line working.
You must be using Python 3, when that code was made for Python 2. In Python 2, str and bytes are the same thing, and in Python 3, str is Python 2's unicode. You have to add a b directly before the string to make it a byte string in Python 3, e.g.:
b"this is a byte string"

Python 3.5 / Pastebin "Bad API request, invalid api_option"

I'm working on a twitch irc bot and one of the components I wanted to have available was the ability for the bot to save quotes to a pastebin paste on close, and then retrieve the same quotes on start up.
I've started with the saving part, and have hit a road block where I can't seem to get a valid post, and I can't figure out a method.
#!/usr/bin/env python3
import urllib.parse
import urllib.request
# --------------------------------------------- Pastebin Requisites --------------------------------------------------
pastebin_key = 'my pastebin key' # developer api key, required. GET: http://pastebin.com/api
pastebin_password = 'password' # password for pastebin_username
pastebin_postexp = 'N' # N = never expire
pastebin_private = 0 # 0 = Public 1 = unlisted 2 = Private
pastebin_url = 'http://pastebin.com/api/api_post.php'
pastebin_username = 'username' # user corresponding with key
# --------------------------------------------- Value clean up --------------------------------------------------
pastebin_password = urllib.parse.quote(pastebin_password, safe='/')
pastebin_username = urllib.parse.quote(pastebin_username, safe='/')
# --------------------------------------------- Pastebin Functions --------------------------------------------------
def post(title, content): # used for posting a new paste
pastebin_vars = {'api_option': 'paste', 'api_user_key': pastebin_username, 'api_paste_private': pastebin_private,
'api_paste_name': title, 'api_paste_expire_date': pastebin_postexp, 'api_dev_key': pastebin_key,
'api_user_password': pastebin_password, 'api_paste_code': content}
try:
str_to_paste = ', '.join("{!s}={!r}".format(key, val) for (key, val) in pastebin_vars.items()) # dict to str :D
str_to_paste = str_to_paste.replace(":", "") # remove :
str_to_paste = str_to_paste.replace("'", "") # remove '
str_to_paste = str_to_paste.replace(")", "") # remove )
str_to_paste = str_to_paste.replace(", ", "&") # replace dividers with &
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
print('did that work?')
except:
print("post submit failed :(")
print(pastebin_url + "?" + str_to_paste) # print the output for test
post("test", "stuff")
I'm open to importing more libraries and stuff, not really sure what I'm doing wrong after working on this for two days straight :S
import urllib.parse
import urllib.request
PASTEBIN_KEY = 'xxx'
PASTEBIN_URL = 'https://pastebin.com/api/api_post.php'
PASTEBIN_LOGIN_URL = 'https://pastebin.com/api/api_login.php'
PASTEBIN_LOGIN = 'my_login_name'
PASTEBIN_PWD = 'yyy'
def pastebin_post(title, content):
login_params = dict(
api_dev_key=PASTEBIN_KEY,
api_user_name=PASTEBIN_LOGIN,
api_user_password=PASTEBIN_PWD
)
data = urllib.parse.urlencode(login_params).encode("utf-8")
req = urllib.request.Request(PASTEBIN_LOGIN_URL, data)
with urllib.request.urlopen(req) as response:
pastebin_vars = dict(
api_option='paste',
api_dev_key=PASTEBIN_KEY,
api_user_key=response.read(),
api_paste_name=title,
api_paste_code=content,
api_paste_private=2,
)
return urllib.request.urlopen(PASTEBIN_URL, urllib.parse.urlencode(pastebin_vars).encode('utf8')).read()
rv = pastebin_post("This is my title", "These are the contents I'm posting")
print(rv)
Combining two different answers above gave me this working solution.
First, your try/except block is throwing away the actual error. You should almost never use a "bare" except clause without capturing or re-raising the original exception. See this article for a full explanation.
Once you remove the try/except, and you will see the underlying error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "paste.py", line 42, in post
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
req = meth(req)
File "/usr/lib/python3.4/urllib/request.py", line 1112, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
This means you're trying to pass a unicode string into a function that's expecting bytes. When you do I/O (like reading/writing files on disk, or sending/receiving data over HTTP) you typically need to encode any unicode strings as bytes. See this presentation for a good explanation of unicode vs. bytes and when you need to encode and decode.
Next, this line:
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
Is throwing away the response, so you have no way of knowing the result of your API call. Assign this to a variable or return it from your function so you can then inspect the value. It will either be a URL to the paste, or an error message from the API.
Next, I think your code is sending a lot of unnecessary parameters to the API and your str_to_paste statements aren't necessary.
I was able to make a paste using the following, much simpler, code:
import urllib.parse
import urllib.request
PASTEBIN_KEY = 'my-api-key' # developer api key, required. GET: http://pastebin.com/api
PASTEBIN_URL = 'http://pastebin.com/api/api_post.php'
def post(title, content): # used for posting a new paste
pastebin_vars = dict(
api_option='paste',
api_dev_key=PASTEBIN_KEY,
api_paste_name=title,
api_paste_code=content,
)
return urllib.request.urlopen(PASTEBIN_URL, urllib.parse.urlencode(pastebin_vars).encode('utf8')).read()
Here it is in use:
>>> post("test", "hello\nworld.")
b'http://pastebin.com/v8jCkHDB'
I didn't know about pastebin until now. I read their api and tried it for the first time, and it worked perfectly fine.
Here's what I did:
I logged in to fetch the api_user_key.
Included that in the posting along with api_dev_key.
Checked the website, and the post was there.
Here's the code:
import urllib.parse
import urllib.request
def post(url, params):
data = urllib.parse.urlencode(login_params).encode("utf-8")
req = urllib.request.Request(login_url, data)
with urllib.request.urlopen(req) as response:
return response.read()
# Logging in to fetch api_user_key
login_url = "http://pastebin.com/api/api_login.php"
login_params = {"api_dev_key": "<the dev key they gave you",
"api_user_name": "<username goes here>",
"api_user_password": "<password goes here>"}
api_user_key = post(login_url, login_params)
# Posting some random text
post_url = "http://pastebin.com/api/api_post.php"
post_params = {"api_dev_key": "<the dev key they gave you",
"api_option": "paste",
"api_paste_code": "<head>Testing</head>",
"api_paste_private": "0",
"api_paste_name": "testing.html",
"api_paste_expire_date": "10M",
"api_paste_format": "html5",
"api_user_key": api_user_key}
response = post(post_url, post_params)
Only the first three parameters are needed for posting something, the rest are optional.
fwy the API doesn't seem to accept http requests as of writing this, so make sure to have the urls in the format of https://pas...

UnicodeEncodeError Python 2.7

I am using Tweepy for authentication and I am trying to print text, but I am unable to print the text. I am getting some UnicodeEncodeError. I tried some method but I was unable to solve it.
# -*- coding: utf-8 -*-
import tweepy
consumer_key = ""
consumer_secret = ""
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
public_tweets = api.home_timeline()
for tweet in public_tweets:
print tweet.text.decode("utf-8")+'\n'
Error:
(venv) C:\Users\e2sn7cy\Documents\GitHub\Tweepy>python tweepyoauth.py
Throwback to my favourite! Miss this cutie :) #AdityaRoyKapur https://t.co/sxm8g1qhEb/n
Cristiano Ronaldo: 3 hat-tricks in his last 3 matches.
Lionel Messi: 3 trophies in his last 3 matches. http://t.co/For1It4QxF/n
How to Bring the Outdoors in With Indoor Gardens http://t.co/efQjwcszDo http://t.co/1NLxSzHxlI/n
Traceback (most recent call last):
File "tweepyoauth.py", line 17, in <module>
print tweet.text.decode("utf-8")+'/n'
File "C:\myPython\venv\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)
This line print tweet.text.decode("utf-8")+'/n' is the cause.
You decode tweet.text as utf-8 into an unicode string. Fine until here.
But you next try to concatenate it with a raw string '/n' (BTW, I think you really wanted \n) and python try to convert the unicode string to an ascii raw string giving the error.
You should concatenate with a unicode string to obtain a unicode string without conversion :
print tweet.text.decode("utf-8") + u'\n'
If this is not enough, it could be because your environment cannot directly print unicode strings. Then you should explictely encode it in the native charset of your system :
print (tweet.text.decode("utf-8") + u'\n').encode('cp850')
[here replace 'cp850' (my charset) with the charset on your system]

Categories

Resources