My code works fine for English text, but doesn't work for for Russian search_text. How can I fix it?
Error text
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 41-46: Body ('Москва') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
My code
import requests
# search_text = "London" # OK: for english text
search_text = "Москва" # ERROR: 'latin-1' codec can't encode characters in position 41-46: Body ('Москва')
headers = {
'cookie': 'bci=6040686626671285074; _statid=a741e249-8adb-4c9a-8344-6e7e8360700a; viewport=762; _hd=h; tmr_lvid=ea50ffe34e269b16d061756e9a17b263; tmr_lvidTS=1609852383671; AUTHCODE=VCmGBS9d9sIxDnxN-hzApvPxPoLNADWCZLYyW8JOTcolv2dJjwH7ALYd8dNP9ljxZZuLvoKsDXgozEUt-PjSwXYEDt4syizx1I2LS58gb49kCFae-5uIap--mtLsff2ZqGbFqK5r7buboZ0_3; JSESSIONID=adca48748b8f0c58a926f5e4948f42c0c0aa9463798a9240.1f3566ed; LASTSRV=ok.ru; msg_conf=2468555756792551; TZ=6; _flashVersion=0; CDN=; nbp=; tmr_detect=0%7C1609852395541; cudr=0; klos=0; tmr_reqNum=4; TZD=6.200; TD=200',
}
data = '''{\n "id": 24,\n "parameters": {\n "query": "''' + search_text + '''"\n }\n}'''
response = requests.post('https://ok.ru/web-api/v2/search/suggestCities', headers=headers, data=data)
json_data = response.json()
print(json_data['result'][0]['id'])
I tried
city_name = city_name.encode('utf-8')
but received TypeError: must be str, not bytes
Try adding this after the line where you create the data variable before you post the request
data=data.encode() #will produce bytes object encoded with utf-8
in my case:
this worked
import json
import http.client
f = open('YOURFILE.json', encoding="utf8")
data = json.load(f)
message = json.dumps(data['MESSAGE'],ensure_ascii=False).encode('utf-8').decode('unicode-escape')
this is json file content:
{
"MESSAGE": "یو تی اف 8 کاراکتر"
I have code that reads a list of numbers that are parameters for a python order:
def search(number):
url = "http://localhost:8080/sistem/checkNumberStatus"
payload = '{\n\"SessionName\":\"POC\",\n\"celFull\":\"'+number+'\"\n}'
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data = payload)
object = open('numbers_wpp.txt', 'r')
for numbers in object:
search(number)
But, when I print the payload of my json the output is:
{
"SessionName":"POC",
"celFull":"5512997708936
"
}
{
"SessionName":"POC",
"celFull":"5512997709337
"
}
{
"SessionName":"POC",
"celFull":"5512992161195"
}
When reading the file with 3 numbers, the quotes in the celFull attribute closed correctly only in the last loop (last number), while the first two were broken into quotes. This break is giving error in queries.
If numbers_wpp.txt is a file that has each number on a different line, the following code would iterate over the file line by line.
for number in object:
search(number) # <--- numbers is actually a line in your file.
Since each number is on a new line, the preceding line has a '\n' at the end.
So
123
456
Is actually
123\n456
So number is actually 123\n. This causes your payload to have a \n before the quote closes.
You could fix this by calling number.strip() which strips whitespace from either end of the string.
Alternatively, you should consider not using a handcrafted json string and let the requests library do that for you. Documentation
def search(number):
url = "http://localhost:8080/sistem/checkNumberStatus"
payload = {"SessionName": "POC", "celFull": number} # <-- python dict
response = requests.post(url, data=payload)
object = open('numbers_wpp.txt', 'r')
for line in object:
number = int(line.strip()) # <-- convert to an integer
search(number)
I get a decoding error trying to decode my json data obtained from wikidata. I've read it may be because I'm trying to decode bytes instead of UTF-8 but I've tried to decode into UTF-8 and I can't seem to find the way to do so... Here's my method code (the parameter is a string and the return a boolean):
def es_enfermedad(candidato):
url = 'https://query.wikidata.org/sparql'
query = """
SELECT ?item WHERE {
?item rdfs:label ?nombre.
?item wdt:P31 ?tipo.
VALUES ?tipo {wd:Q12135}
FILTER(LCASE(?nombre) = "%s"#en)
}
""" % (candidato)
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()
return len(data['results']['bindings']) > 0
Using requests library to execute http GET that return JSON response i'm getting this error when response string contains unicode char:
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 20 (char 19)
Execute same http request with Postman the json output is:
{ "value": "VILLE D\u0019ANAUNIA" }
My python code is:
data = requests.get(uri, headers=HEADERS).text
json_data = json.loads(data)
Can I remove or replace all Unicode chars before executing conversion with json.loads(...)?
It is likely to be caused by a RIGHT SINGLE QUOTATION MARK U+2019 (’). For reasons I cannot guess, the high order byte has been dropped leaving you with a control character which should be escaped in a correct JSON string.
So the correct way would be to control what exactly the API returns. If id does return a '\u0019' control character, you should contact the API owner because the problem should be there.
As a workaround, you can try to limit the problem for your processing by filtering out non ascii or control characters:
data = requests.get(uri, headers=HEADERS).text
data = ''.join((i for i in data if 0x20 <= ord(i) < 127)) # filter out unwanted chars
json_data = json.loads(data)
You should get {'value': 'VILLE DANAUNIA'}
Alternatively, you can replace all unwanted characters with spaces:
data = requests.get(uri, headers=HEADERS).text
data = ''.join((i if 0x20 <= ord(i) < 127 else ' ' for i in data))
json_data = json.loads(data)
You would get {'value': 'VILLE D ANAUNIA'}
The code below works on python 2.7:
import json
d = json.loads('{ "value": "VILLE D\u0019ANAUNIA" }')
print(d)
The code below works on python 3.7:
import json
d = json.loads('{ "value": "VILLE D\u0019ANAUNIA" }', strict=False)
print(d)
Output:
{u'value': u'VILLE D\x19ANAUNIA'}
Another point is that requests get return the data as json:
r = requests.get('https://api.github.com/events')
r.json()
I'm working with text contained in JS variables on a webpage and extracting strings using regex, then turning it into JSON objects in python using json.loads().
The issue I'm having is the unquoted "keys". Right now, I'm doing a series of replacements (code below) to "" each key in each string, but what I want is to dynamically identify any unquoted keys before passing the string into json.loads().
Example 1 with no space after : character
json_data1 = '[{storeName:"testName",address:"12345 Road",address2:"Suite 500",city:"testCity",storeImage:"http://www.testLink.com",state:"testState",phone:"999-999-9999",lat:99.9999,lng:-99.9999}]'
Example 2 with space after : character
json_data2 = '[{storeName: "testName",address: "12345 Road",address2: "Suite 500",city: "testCity",storeImage: "http://www.testLink.com",state: "testState",phone: "999-999-9999",lat: 99.9999,lng: -99.9999}]'
Example 3 with space after ,: characters
json_data3 = '[{storeName: "testName", address: "12345 Road", address2: "Suite 500", city: "testCity", storeImage: "http://www.testLink.com", state: "testState", phone: "999-999-9999", lat: 99.9999, lng: -99.9999}]'
Example 4 with space after : character and newlines
json_data4 = '''[
{
storeName: "testName",
address: "12345 Road",
address2: "Suite 500",
city: "testCity",
storeImage: "http://www.testLink.com",
state: "testState",
phone: "999-999-9999",
lat: 99.9999, lng: -99.9999
}]'''
I need to create pattern that identifies which are keys and not random string values containing characters such as the string link in storeImage. In other words, I want to dynamically find keys and double-quote them to use json.loads() and return a valid JSON object.
I'm currently replacing each key in the text this way
content = re.sub('storeName:', '"storeName":', content)
content = re.sub('address:', '"address":', content)
content = re.sub('address2:', '"address2":', content)
content = re.sub('city:', '"city":', content)
content = re.sub('storeImage:', '"storeImage":', content)
content = re.sub('state:', '"state":', content)
content = re.sub('phone:', '"phone":', content)
content = re.sub('lat:', '"lat":', content)
content = re.sub('lng:', '"lng":', content)
Returned as string representing valid JSON
json_data = [{"storeName": "testName", "address": "12345 Road", "address2": "Suite 500", "city": "testCity", "storeImage": "http://www.testLink.com", "state": "testState", "phone": "999-999-9999", "lat": 99.9999, "lng": -99.9999}]
I'm sure there is a better way of doing this but I haven't been able to find or come up with a regex pattern to handle these. Any help is greatly appreciated!
Something like this should do the job: ([{,]\s*)([^"':]+)(\s*:)
Replace for: \1"\2"\3
Example: https://regex101.com/r/oV0udR/1
That repetition is of course unnecessary. You could put everything into a single regex:
content = re.sub(r"\b(storeName|address2?|city|storeImage|state|phone|lat|lng):", r'"\1":', content)
\1 contains the match within the first (in this case, only) set of parentheses, so "\1": surrounds it with quotes and adds back the colon.
Note the use of a word boundary anchor to make sure we match only those exact words.
Regex: (\w+)\s?:\s?("?[^",]+"?,?)
Regex demo
import re
text = 'storeName: "testName", '
text = re.sub('(\w+)\s?:\s?("?[^",]+"?,?)', "\"\g<1>\":\g<2>", text)
print(text)
Output: "storeName":"testName",