sanitize unicode from json

sanitize unicode from json - python

how do I properly remove unicode so I can load the json
data = json.loads(json_string)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 72 (char 71)
{"user": {"user_id": 455830511, "username": "dualipa_384", "name": "Dua\xa0Lipa", "private": false, "verified_user": false, "avatar_url": "https://uploads.cdn.triller.co/v1/avatars/455830511/1619366527_avatar.jpg", "profile_cover_url": "None", "dm_registered": true, "storefront_url": "None", "creator_status": false, "contributor_status": false, "user_uuid": "bce20042-a143-4caf-adbc-6b39bbb2d30a", "about_me": "Go stream my new album Future Nostalgia The Moonlight Edition❤️\ndualipa.co/weregood-video", "auto_confirmed": true, "instagram_handle": "#dualipa", "instagram_verified": false, "soundcloud_url": "None", "button_text": "None", "button_text_color": "None", "button_background_color": "None", "button_url": "None", "follower_count": 0, "followed_count": 55, "verified": true, "failed_age_validation": false, "has_snaps": false, "profile_type": "public", "blocking_user": false, "blocked_by_user": false, "followed_by_me": "false", "follower_of_me": "false", "subscription": {"is_subscribed": false}}, "status": true}
I have tried to do the following but it did not work
json_string = json_string.replace(u'\xa0', u'')
json_string = unicodedata.normalize("NFKD", json_string)

There is a newline character within a string. JSON does not allow line breaks withing strings. Replace the line break with an escape sequence:
json.loads(json_string.replace('\n', r'\n'))

this how it worked for me
import json
import unicodedata
json_string = json.loads(json.dumps(json_string))
json_string = json_string.replace("\"false\"", "\"False\"").replace("false", "\"False\"").replace("true", "\"True\"").replace("\n", " ")
json_string = unicodedata.normalize("NFKD", json_string)
json_string = json_string.replace(u'\xa0', u'')
json_string = json_string.replace('\n', r'\n')
data = json.loads(json_string)
print(data)

Related

parse weird yaml file uploaded to server with python

I have a config server where we read the service config from.
In there we have a yaml file that I need to read but it has a weird format on the server looking like:
{
"document[0].Name": "os",
"document[0].Rules.Rule1": false,
"document[0].Rules.Rule2": true,
"document[0].MinScore": 100,
"document[0].MaxScore": 100,
"document[0].ClusterId": 22,
"document[0].Enabled": true,
"document[0].Module": "device",
"document[0].Description": "",
"document[0].Modified": 1577880000000,
"document[0].Created": 1577880000000,
"document[0].RequiredReview": false,
"document[0].Type": "NO_CODE",
"document[1].Name": "rule with params test",
"document[1].Rules.Rule": false,
"document[1].MinScore": 100,
"document[1].MaxScore": 100,
"document[1].ClusterId": 29,
"document[1].Enabled": true,
"document[1].Module": "device",
"document[1].Description": "rule with params test",
"document[1].Modified": 1577880000000,
"document[1].Created": 1577880000000,
"document[1].RequiredReview": false,
"document[1].Type": "NO_CODE",
"document[1].ParametersRules[0].Features.feature1.op": ">",
"document[1].ParametersRules[0].Features.feature1.value": 10,
"document[1].ParametersRules[0].Features.feature2.op": "==",
"document[1].ParametersRules[0].Features.feature2.value": true,
"document[1].ParametersRules[0].Features.feature3.op": "range",
"document[1].ParametersRules[0].Features.feature3.value[0]": 4,
"document[1].ParametersRules[0].Features.feature3.value[1]": 10,
"document[1].ParametersRules[0].Features.feature4.op": "!=",
"document[1].ParametersRules[0].Features.feature4.value": "None",
"document[1].ParametersRules[0].DecisionType": "all",
"document[1].ParametersRules[1].Features.feature5.op": "<",
"document[1].ParametersRules[1].Features.feature5.value": 1000,
"document[1].ParametersRules[1].DecisionType": "any"
}
and this is how the dict supposed to look like (might not be perfect I did it by hand):
[
{
"Name": "os",
"Rules": { "Rule1": false, "Rule2": true },
"MinScore": 100,
"MaxScore": 100,
"ClusterId": 22,
"Enabled": true,
"Module": "device",
"Description": "",
"Modified": 1577880000000,
"Created": 1577880000000,
"RequiredReview": false,
"Type": "NO_CODE"
},
{
"Name": "rule with params test",
"Rules": { "Rule": false},
"MinScore": 100,
"MaxScore": 100,
"ClusterId": 29,
"Enabled": true,
"Module": "device",
"Description": "rule with params test",
"Modified": 1577880000000,
"Created": 1577880000000,
"RequiredReview": false,
"Type": "NO_CODE",
"ParametersRules":[
{"Features": {"feature1": {"op": ">", "value": 10},
"feature2": {"op": "==", "value": true},
"feature3": {"op": "range", "value": [4,10]},
"feature4": {"op": "!=", "value": "None"}} ,
"DecisionType": "all"},
{"Features": { "feature5": { "op": "<", "value": 1000 }},
"DecisionType": "any"}
]
}
]
I don't have a way to change how the file is uploaded to the server (it's a different team and quite the headache) so I need to parse it using python.
My thought is that someone probably encountered it before so there must be a package that solves it, and I hoped that someone here might know.
Thanks.

i have a sample , i hope it'll help you
import yaml
import os
file_dir = os.path.dirname(os.path.abspath(__file__))
config = yaml.full_load(open(f"{file_dir}/file.json"))
yaml_file = open(f'{file_dir}/meta.yaml', 'w+')
yaml.dump(config, yaml_file, allow_unicode=True) # this one make your json file to yaml
your current output is :
- ClusterId: 22
Created: 1577880000000
Description: ''
Enabled: true
MaxScore: 100
MinScore: 100
Modified: 1577880000000
Module: device
Name: os
RequiredReview: false
Rules:
Rule1: false
Rule2: true
Type: NO_CODE
- ClusterId: 29
Created: 1577880000000
Description: rule with params test
Enabled: true
MaxScore: 100
MinScore: 100
Modified: 1577880000000
Module: device
Name: rule with params test
ParametersRules:
- DecisionType: all
Features:
feature1:
op: '>'
value: 10
feature2:
op: ==
value: true
feature3:
op: range
value:
- 4
- 10
feature4:
op: '!='
value: None
- DecisionType: any
Features:
feature5:
op: <
value: 1000
RequiredReview: false
Rules:
Rule: false
Type: NO_CODE

Here is my approach so far. It's far from perfect, but hope it gives you an idea of how to tackle it.
from __future__ import annotations # can be removed in Python 3.10+
def clean_value(o: str | bool | int) -> str | bool | int | None:
"""handle int, None, or bool values encoded as a string"""
if isinstance(o, str):
lowercase = o.lower()
if lowercase.isnumeric():
return int(o)
elif lowercase == 'none':
return None
elif lowercase in ('true', 'false'):
return lowercase == 'true'
# return eval(o.capitalize())
return o
# noinspection PyUnboundLocalVariable
def process(o: dict):
# final return list
docs_list = []
doc: dict[str, list | dict | str | bool | int | None]
doc_idx: int
def add_new_doc(new_idx: int):
"""Push new item to result list, and increment index."""
nonlocal doc_idx, doc
doc_idx = new_idx
doc = {}
docs_list.append(doc)
# add initial `dict` object to return list
add_new_doc(0)
for k, v in o.items():
doc_id, key, *parts = k.split('.')
doc_id: str
key: str
parts: list[str]
curr_doc_idx = int(doc_id.rsplit('[', 1)[1].rstrip(']'))
if curr_doc_idx > doc_idx:
add_new_doc(curr_doc_idx)
if not parts:
final_val = clean_value(v)
elif key in doc:
# For example, when we encounter `document[0].Rules.Rule2`, but we've already encountered
# `document[0].Rules.Rule1` - so in this case, we add value to the existing dict.
final_val = temp_dict = doc[key]
temp_dict: dict
for p in parts[:-1]:
temp_dict = temp_dict.setdefault(p, {})
temp_dict[parts[-1]] = clean_value(v)
else:
final_val = temp_dict = {}
for p in parts[:-1]:
temp_dict = temp_dict[p] = {}
temp_dict[parts[-1]] = clean_value(v)
doc[key] = final_val
return docs_list
if __name__ == '__main__':
import json
from pprint import pprint
j = """{
"document[0].Name": "os",
"document[0].Rules.Rule1": false,
"document[0].Rules.Rule2": "true",
"document[0].MinScore": 100,
"document[0].MaxScore": 100,
"document[0].ClusterId": 22,
"document[0].Enabled": true,
"document[0].Module": "device",
"document[0].Description": "",
"document[0].Modified": 1577880000000,
"document[0].Created": 1577880000000,
"document[0].RequiredReview": false,
"document[0].Type": "NO_CODE",
"document[1].Name": "rule with params test",
"document[1].Rules.Rule": false,
"document[1].MinScore": 100,
"document[1].MaxScore": 100,
"document[1].ClusterId": 29,
"document[1].Enabled": true,
"document[1].Module": "device",
"document[1].Description": "rule with params test",
"document[1].Modified": 1577880000000,
"document[1].Created": 1577880000000,
"document[1].RequiredReview": false,
"document[1].Type": "NO_CODE",
"document[1].ParametersRules[0].Features.feature1.op": ">",
"document[1].ParametersRules[0].Features.feature1.value": 10,
"document[1].ParametersRules[0].Features.feature2.op": "==",
"document[1].ParametersRules[0].Features.feature2.value": true,
"document[1].ParametersRules[0].Features.feature3.op": "range",
"document[1].ParametersRules[0].Features.feature3.value[0]": 4,
"document[1].ParametersRules[0].Features.feature3.value[1]": 10,
"document[1].ParametersRules[0].Features.feature4.op": "!=",
"document[1].ParametersRules[0].Features.feature4.value": "None",
"document[1].ParametersRules[0].DecisionType": "all",
"document[1].ParametersRules[1].Features.feature5.op": "<",
"document[1].ParametersRules[1].Features.feature5.value": 1000,
"document[1].ParametersRules[1].DecisionType": "any"
}"""
d: dict[str, str | bool | int | None] = json.loads(j)
result = process(d)
pprint(result)
Result:
[{'ClusterId': 22,
'Created': 1577880000000,
'Description': '',
'Enabled': True,
'MaxScore': 100,
'MinScore': 100,
'Modified': 1577880000000,
'Module': 'device',
'Name': 'os',
'RequiredReview': False,
'Rules': {'Rule1': False, 'Rule2': True},
'Type': 'NO_CODE'},
{'ClusterId': 29,
'Created': 1577880000000,
'Description': 'rule with params test',
'Enabled': True,
'MaxScore': 100,
'MinScore': 100,
'Modified': 1577880000000,
'Module': 'device',
'Name': 'rule with params test',
'ParametersRules[0]': {'DecisionType': 'all',
'Features': {'feature1': {'value': 10},
'feature2': {'op': '==', 'value': True},
'feature3': {'op': 'range',
'value[0]': 4,
'value[1]': 10},
'feature4': {'op': '!=', 'value': None}}},
'ParametersRules[1]': {'DecisionType': 'any',
'Features': {'feature5': {'value': 1000}}},
'RequiredReview': False,
'Rules': {'Rule': False},
'Type': 'NO_CODE'}]
Of course one of the problems is that it doesn't accounted for nested paths like document[1].ParametersRules[0].Features.feature1.op which should ideally create a new sub-list to add values to.

Is there a way to insert json into postgres database using pycopg2?

I'm trying to insert the following data into a postgres database
{
"id": 131739425477632000,
"user_name": "KithureKindiki",
"content": "#Fchurii You're right, Francis.",
"deleted": 1,
"created": "2011-11-02 14:28:21",
"modified": "2019-01-10 13:05:42",
"tweet": "{\"contributors\": null, \"truncated\": false, \"text\": \"#Fchurii You're right, Francis.\", \"is_quote_status\": false, \"in_reply_to_status_id\": 131738250736971778, \"id\": 131739425477632000, \"favorite_count\": 0, \"source\": \"Twitter Web Client\", \"retweeted\": false, \"coordinates\": null, \"entities\": {\"symbols\": [], \"user_mentions\": [{\"indices\": [0, 8], \"id_str\": \"284946979\", \"screen_name\": \"Fchurii\", \"name\": \"Francis Gachuri\", \"id\": 284946979}], \"hashtags\": [], \"urls\": []}, \"in_reply_to_screen_name\": \"Fchurii\", \"in_reply_to_user_id\": 284946979, \"retweet_count\": 0, \"id_str\": \"131739425477632000\", \"favorited\": false, \"user\": {\"follow_request_sent\": false, \"has_extended_profile\": false, \"profile_use_background_image\": true, \"contributors_enabled\": false, \"id\": 399935104, \"verified\": false, \"translator_type\": \"none\", \"profile_text_color\": \"333333\", \"profile_image_url_https\": \"https://pbs.twimg.com/profile_images/538310980468764672/xpJnlD_-_normal.jpeg\", \"profile_sidebar_fill_color\": \"DDEEF6\", \"entities\": {\"description\": {\"urls\": []}}, \"followers_count\": 23555, \"profile_sidebar_border_color\": \"C0DEED\", \"id_str\": \"399935104\", \"default_profile_image\": false, \"listed_count\": 17, \"is_translation_enabled\": false, \"utc_offset\": null, \"statuses_count\": 246, \"description\": \"Majority Leader, The Senate of the Republic of Kenya\", \"friends_count\": 244, \"location\": \"\", \"profile_link_color\": \"1DA1F2\", \"profile_image_url\": \"http://pbs.twimg.com/profile_images/538310980468764672/xpJnlD_-_normal.jpeg\", \"notifications\": false, \"geo_enabled\": false, \"profile_background_color\": \"C0DEED\", \"profile_background_image_url\": \"http://abs.twimg.com/images/themes/theme1/bg.png\", \"screen_name\": \"KithureKindiki\", \"lang\": \"en\", \"following\": false, \"profile_background_tile\": false, \"favourites_count\": 11, \"name\": \"Kithure Kindiki\", \"url\": null, \"created_at\": \"Fri Oct 28 08:09:57 +0000 2011\", \"profile_background_image_url_https\": \"https://abs.twimg.com/images/themes/theme1/bg.png\", \"time_zone\": null, \"protected\": false, \"default_profile\": true, \"is_translator\": false}, \"geo\": null, \"in_reply_to_user_id_str\": \"284946979\", \"lang\": \"en\", \"created_at\": \"Wed Nov 02 14:28:21 +0000 2011\", \"in_reply_to_status_id_str\": \"131738250736971778\", \"place\": null}",
"politician_id": 41,
"approved": 1,
"reviewed": 1,
"reviewed_at": "2019-01-10 13:05:42",
"review_message": null,
"retweeted_id": null,
"retweeted_content": null,
"retweeted_user_name": null
}
using the following code
qwery = f"INSERT INTO deleted_tweets(id,user_name,content,deleted,created,modified,tweet,politician_id,approved,reviewed,reviewed_at,review_message,retweeted_id,retweeted_content,retweeted_user_name) VALUES {row['id'], row['user_name'], row['content'], bool(row['deleted']), row['created'], row['modified'],row['tweet'],row['politician_id'],bool(row['approved']), bool(row['reviewed']),row['reviewed_at'],row['review_message'],row['retweeted_id'],row['retweeted_content'],row['retweeted_user_name']}"
qwery = qwery.replace('None', 'null')
cursor.execute(qwery)
However, I get the following error
*** psycopg2.errors.SyntaxError: syntax error at or near "re"
LINE 1: ... null, "truncated": false, "text": "#Fchurii You\'re right, ...
I know this is due to the single quote but I'm not sure how to overcome it. I've tried adding backslash to the string something like \"text\": \"#Fchurii You\\'re right, Francis.\",
but still getting the same error. Any ideas on how to bypass this?

Try:
query = "INSERT INTO deleted_tweets (id,user_name,content,deleted,created,modified,tweet,politician_id,approved,reviewed,reviewed_at,review_message,retweeted_id,retweeted_content,retweeted_user_name) VALUES (%s)"
data = [row['id'], row['user_name'], row['content'], bool(row['deleted']), row['created'], row['modified'], row['tweet'], row['politician_id'], bool(row['approved']), bool(row['reviewed']), row['reviewed_at'], row['review_message'], row['retweeted_id'], row['retweeted_content'], row['retweeted_user_name']]
data_without_nulls = ['null' if x is None else x for x in data]
cursor.execute(query, data_without_nulls)

Save csv in python with json list

I am trying parse multiple json files in python from a folder and save them to a single csv.
This is my 'json' file format:
{
"width": 4032,
"height": 3024,
"ispano": false,
"objects": [
{
"key": "vERA48mAToOV36JrGge-8w",
"label": "regulatory--no-heavy-goods-vehicles--g2",
"bbox": {
"xmin": 1702.96875,
"ymin": 812.84765625,
"xmax": 2181.375,
"ymax": 1304.54296875
},
"properties": {
"barrier": false,
"occluded": false,
"out-of-frame": false,
"exterior": false,
"ambiguous": false,
"included": false,
"direction-or-information": false,
"highway": false,
"dummy": false
}
},
{
"key": "MXdgK-YrQrSrATvLYkJ7kQ",
"label": "information--dead-end--g1",
"bbox": {
"xmin": 1283.625,
"ymin": 488.7421875,
"xmax": 1739.390625,
"ymax": 1050.57421875
},
"properties": {
"barrier": false,
"occluded": false,
"out-of-frame": false,
"exterior": false,
"ambiguous": false,
"included": false,
"direction-or-information": false,
"highway": false,
"dummy": false
}
}
]
}
I don't need all information so I went through all sub dictionary. This is how I extracted data in python:
import pandas as pd
import glob
import json
from datetime import datetime
import csv
data = []
root = glob.glob("./labels/*.json")
for single_file in root:
with open(single_file, "r") as f:
json_file = json.load(f)
I iterate sub dict like this append in a list:
for sub_list in json_file["objects"]:
print (sub_info)
lst = []
count = 0
for key, val in sub_list.items():
#print(val)
lst.append([
sub_child["key"],
sub_child["label"],
sub_child["bbox"]["xmin"],
sub_child["bbox"]["ymin"],
sub_child["bbox"]["xmax"],
sub_child["bbox"]["ymax"]
])
#print(lst)
# Add headers
lst.insert(0, ["key","label","xmin","ymin","xmax","ymax"])
dir = "./"
with open(os.path.join(dir,"test.csv"),"w", newline="") as d:
writer = csv.writer(d)
#writer.writerow(lst)
writer.writerows(lst)
count += 1
print('updated csv')
It saves a csv file named 'test.csv' but only with the information of last row not from all json file.
I want to save csv which includes mentioned information from all json files.
I want csv like this
| file_name | key | label | xmin | ymin | xmax | ymax |
It includes corresponding file_name, key, labels, xmin, ymin. xmax, ymax.
Could you please help me to solve my problem?

You can just write each row to the file as you iterate over the objects:
import glob
import json
import csv
with open('test.csv', 'w', newline='') as f_csv:
csv_output = csv.writer(f_csv)
csv_output.writerow(["file_name", "key", "label", "xmin", "ymin", "xmax", "ymax"])
for single_file in glob.glob("*.json"):
print(single_file)
with open(single_file) as f_json:
json_data = json.load(f_json)
for object in json_data["objects"]:
csv_output.writerow([
single_file,
object["key"],
object["label"],
object["bbox"]["xmin"],
object["bbox"]["ymin"],
object["bbox"]["xmax"],
object["bbox"]["ymax"]
])
Giving you test.txt as follows:
file_name,key,label,xmin,ymin,xmax,ymax
test1.json,vERA48mAToOV36JrGge-8w,regulatory--no-heavy-goods-vehicles--g2,1702.96875,812.84765625,2181.375,1304.54296875
test1.json,MXdgK-YrQrSrATvLYkJ7kQ,information--dead-end--g1,1283.625,488.7421875,1739.390625,1050.57421875
test2.json,vERA48mAToOV36JrGge-8w,regulatory--no-heavy-goods-vehicles--g3,1702.96875,812.84765625,2181.375,1304.54296875
test2.json,MXdgK-YrQrSrATvLYkJ7kQ,information--dead-end--g1,1283.625,488.7421875,1739.390625,1050.57421875

Post Requests from python Issue

I have been doing a lot of reading on this website regarding post requests from Python to an API. But despite all the recommendations to using the json library within Python, I still cant quite get my head around it.
My current predicament is that I need to make an API call, grab certain fields and post them to another API.
An example of the information i receive from my initial API request:
{
"metadata": {
"configurationVersions": [
3
],
"clusterVersion": "1.174.168.20190814-173650"
},
"id": "5c1547a6-61ca-4dc3-8971-ec8d2f542592",
"name": "Registration",
"enabled": false,
"dataType": "STRING",
"dataSources": [
{
"enabled": true,
"source": "POST_PARAMETER",
"valueProcessing": {
"splitAt": "",
"trim": false
},
"parameterName": "f=register",
"scope": {
"tagOfProcessGroup": "Production"
}
}
],
"normalization": "ORIGINAL",
"aggregation": "FIRST",
"confidential": true,
"skipPersonalDataMasking": true
}
After this call, I extract the data in the following way:
def ReqOutput(output):
x=""
out = ()
inReq = ["name","enabled","dataType","dataSources","normalization","aggregation","confidential","skipPersonalDataMasking"]
for i in output.items():
for item in inReq:
if item in i:
x = x + str(i)
out=out + i
return json.dumps(out)
As recommended in other threads, I used to json.dumps method to convert my python tuple to JSON. However, I feel like it is not working as intended
Pre json.dumps output:
'name', 'Registration', 'enabled', False, 'dataType', 'STRING', 'dataSources', [{'enabled': True, 'source': 'POST_PARAMETER', 'valueProcessing': {'splitAt': '', 'trim': False}, 'parameterName': 'f=register', 'scope': {'tagOfProcessGroup': 'Production'}}], 'normalization', 'ORIGINAL', 'aggregation', 'FIRST', 'confidential', True, 'skipPersonalDataMasking', True)
Post json.dumps output:
["name", "Registration", "enabled", false, "dataType", "STRING", "dataSources", [{"enabled": true, "source": "POST_PARAMETER", "valueProcessing": {"splitAt": "", "trim": false}, "parameterName": "f=register", "scope": {"tagOfProcessGroup": "Production"}}], "normalization", "ORIGINAL", "aggregation", "FIRST", "confidential", true, "skipPersonalDataMasking", true]
I then try and POST this to another API using:
def PostRequest (data):
postURL = "XXXX"
headers = {'Content-type': 'application/json'}
r = requests.post(postURL,data = data,headers = headers)
print(r.text)
Where I am finally met with the error:
{"error":{"code":400,"message":"Could not map JSON at '' near line 1 column 1"}}

Try getting rid of the for loops in favor of a dict comprehension:
def ReqOutput(output):
inReq = ["name","enabled","dataType","dataSources","normalization","aggregation","confidential","skipPersonalDataMasking"]
out = {output: val for output, val in output.items() if output in inReq}
return json.dumps(out)
This is more readable and will always give you a dict with attributes from output that are in inReq. The reason your JSON looks like that is because serializing a tuple will give you an Array-like structure. If what you want is an Object structure, you should serialize a dict-like object.

TypeError: string indices must be integer

I tried to get data from JSON. But instead data i get TypeError:
'TypeError: string indices must be integers'
Here ist my code:
api.authenticate(LOGIN, CONN)
profil = api.get_profile("abc")
data = json.dumps(profil, indent=4)
print(data["login"])
JSON file:
{
"public_email": "",
"violation_url": null,
"is_blocked": false,
"links_published": 0,
"login": "abc",
"links_added": 1,
"gg": "",
"signup_date": "2014-10-26 21:15:41",
}
I've looking for solution (google, SO) but I cannot find or not working for me.

There is an extra , in your json file at the end. Delete this:
{
"public_email": "",
"violation_url": null,
"is_blocked": false,
"links_published": 0,
"login": "abc",
"links_added": 1,
"gg": "",
"signup_date": "2014-10-26 21:15:41"
}
Then load the json file (not dump):
with open('abc.json') as data_file:
data = json.load(data_file)
print(data["login"])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sanitize unicode from json - python

There is a newline character within a string. JSON does not allow line breaks withing strings. Replace the line break with an escape sequence: json.loads(json_string.replace('\n', r'\n'))

Related

parse weird yaml file uploaded to server with python

Is there a way to insert json into postgres database using pycopg2?

Save csv in python with json list

Post Requests from python Issue

TypeError: string indices must be integer

Categories

Resources