Regex to reformat improper JSON data

Regex to reformat improper JSON data - python

I have some data that are not properly saved in an old database. I am moving the system to a new database and reformatting the old data as well. The old data looks like this:
a:10:{
s:7:"step_no";s:1:"1";
s:9:"YOUR_NAME";s:14:"Firtname Lastname";
s:11:"CITIZENSHIP"; s:7:"Indian";
s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited";
s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment";
s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";
s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";
s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content";
s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital";
s:14:"ANOTHER_AMOUNT";s:0:"";
}
I want the new look to be in proper JSON format so I can read in python jut like this:
data = {
"step_no": "1",
"YOUR_NAME":"Firtname Lastname",
"CITIZENSHIP":"Indian",
"PROPOSE_NAME_BUSINESS1":"ABC Limited",
"PROPOSE_NAME_BUSINESS2":"XYZ Investment",
"PROPOSE_NAME_BUSINESS3":"",
"PROPOSE_NAME_BUSINESS4":"",
"PURPOSE_NATURE_BUSINESS":"Some dummy content",
"CAPITAL_COMPANY":"20 Million Capital",
"ANOTHER_AMOUNT":""
}
I am thinking using regex to strip out the unwanted parts and reformatting the content using the names in caps would work but I don't know how to go about this.

Regexes would be the wrong approach here. There is no need, and the format is a little more complex than you assume it is.
You have data in the PHP serialize format. You can trivially deserialise it in Python with the phpserialize library:
import phpserialize
import json
def fixup_php_arrays(o):
if isinstance(o, dict):
if isinstance(next(iter(o), None), int):
# PHP has no lists, only mappings; produce a list for
# a dictionary with integer keys to 'repair'
return [fixup_php_arrays(o[i]) for i in range(len(o))]
return {k: fixup_php_arrays(v) for k, v in o.items()}
return o
json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True)))
Note that PHP strings are byte strings, not Unicode text, so especially in Python 3 you'd have to decode your key-value pairs after the fact if you want to be able to re-encode to JSON. The decode_strings=True flag takes care of this for you. The default is UTF-8, pass in an encoding argument to pick a different codec.
PHP also uses arrays for sequences, so you may have to convert any decoded dict object with integer keys to a list first, which is what the fixup_php_arrays() function does.
Demo (with repaired data, many string lengths were off and whitespace was added):
>>> import phpserialize, json
>>> from pprint import pprint
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}'
>>> pprint(phpserialize.loads(data, decode_strings=True))
{'ANOTHER_AMOUNT': '',
'CAPITAL_COMPANY': '20 Million Capital',
'CITIZENSHIP': 'Indian',
'PROPOSE_NAME_BUSINESS1': 'ABC Limited',
'PROPOSE_NAME_BUSINESS2': 'XYZ Investment',
'PROPOSE_NAME_BUSINESS3': '',
'PROPOSE_NAME_BUSINESS4': '',
'PURPOSE_NATURE_BUSINESS': 'Some dummy content',
'YOUR_NAME': 'Firstname Lastname',
'step_no': '1'}
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4))
{
"ANOTHER_AMOUNT": "",
"CAPITAL_COMPANY": "20 Million Capital",
"CITIZENSHIP": "Indian",
"PROPOSE_NAME_BUSINESS1": "ABC Limited",
"PROPOSE_NAME_BUSINESS2": "XYZ Investment",
"PROPOSE_NAME_BUSINESS3": "",
"PROPOSE_NAME_BUSINESS4": "",
"PURPOSE_NATURE_BUSINESS": "Some dummy content",
"YOUR_NAME": "Firstname Lastname",
"step_no": "1"
}

Related

Parse Json without quotes in Python

I am trying to parse JSON input as string in Python, not able to parse as list or dict since the JSON input is not in a proper format (Due to limitations in the middleware can't do much here.)
{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
I tried json.loads and ast.literal (invalid syntax error).
How can I load this?

The sad answer is: the contents of your "Records" field are simply not JSON. No amount of ad-hoc patching (= to :, adding quotes) will change that. You have to find out the language/format specification for what the producing system emits and write/find a proper parser for that particular format.
As a clutch, and only in the case that the above example already captures all the variability you might see in production data, a much simpler approach based on regular expressions (see package re or edd's pragmatic answer) might be sufficient.

If the producer of the data is consistent, you can start with something like the following, that aims to bridge the JSON gap.
import re
import json
source = {
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}
s = source["Records"]
# We'll start by removing any extraneous white spaces
s2 = re.sub('\s', '', s)
# Surrounding any word with "
s3 = re.sub('(\w+)', '"\g<1>"', s2)
# Replacing = with :
s4 = re.sub('=', ':', s3)
# Lastly, fixing missing closing ], }
## Note that }} is an escaped } for f-string.
s5 = f"{s4}]}}"
>>> json.loads(s5)
{'Output': [{'_fields': [{'Entity': 'ABC', 'No': '12345', 'LineNo': '1', 'EffDate': '20200630'}, {'Entity': 'ABC', 'No': '567', 'LineNo': '1', 'EffDate': '20200630'}]}]}
Follow up with some robust testing and have a nice polished ETL with your favorite tooling.

As i understand you are trying to parse the value of the Records item in the dictionary as JSON, unfortunately you cannot.
The string in that value is not JSON, and you must write a parser that will first parse the string into a JSON string according to the format that the string is written in by yourself. ( We don't know what "middleware" you are talking of unfortunately ).
tldr: Parse it into a JSON string, then parse the JSON into a python dictionary. Read this to find out more about JSON ( Javascript Object Notation ) rules.

There you go. This code will make it valid json:
notjson = """{
"Records": "{Output=[{_fields=[{Entity=ABC , No=12345, LineNo= 1, EffDate=20200630}, {Entity=ABC , No=567, LineNo= 1, EffDate=20200630}]}"
}"""
notjson = notjson.replace("=","':") #adds a singlequote and makes it more valid
notjson = notjson.replace("{","{'")
notjson = notjson.replace(", ",", '")
notjson = notjson.replace("}, '{","}, {")
json = "{" + notjson[2:]
print(json)
print(notjson)

Python parse string environment variables

I would like to parse environment variables from a string.
For example:
envs = parse_env('name="John Doe" age=21 gender=male')
print(envs)
# outputs: {"name": "John Doe", "age": 21, "gender": "male"}
What is the best and most minimalistic way to achieve this?
Thank you.

If you can dictate that your values will never contain the special characters that you use in your input format (namely = and ), this is very easy to do with split:
>>> def parse_env(envs):
... pairs = [pair.split("=") for pair in envs.split(" ")]
... return {var: val for var, val in pairs}
...
>>> parse_env("name=John_Doe age=21 gender=male")
{'name': 'John_Doe', 'age': '21', 'gender': 'male'}
If your special characters mean different things in different contexts (e.g. a = can be the separator between a var and value OR it can be part of a value), the problem is harder; you'll need to use some kind of state machine (e.g. a regex) to break the string into tokens in a way that takes into account the different ways that a character might be used in the string.

You can use Python Dotevn to parse a .env file with your variables. Your file .env
should be something like:
NAME="John Doe"
AGE=21
GENDER="male"
Then you should be able to run:
from dotenv import load_dotenv
load_dotenv(dotenv_path=env_path)
Good luck!

Pyspark dataframe corrupted record when reading from python dictionary(json) got from requests, encoding problem

I am making a REST api call with Requests library.
response = requests.get("https://urltomaketheapicall", headers={'authorization': 'bearer {0}'.format("7777777777777777777777777777")}, timeout=5)
When I do response.json()
I get a key with these values
{'devices': '....iPhone\xa05S, iPhone\xa06, iPhone\xa06\xa0Plus, iPhone\xa06S'}
When I do print(response.encoding) I get None
When I do print(type(data[devices])) I get <class 'str'>
If i do print(data[devices]) I get '....iPhone 5S, iPhone 6, iPhone 6 Plus, iPhone 6S' without the special characters.
Now if do
new_dict={}
new_val = data[devices]
new_dict["devices"] = new_val
print(new_dict["devices"])
I will get the special characters in the new dictionary as well.
Any ideas?
I want to get rid of the special characters since I need to read these json and put it in a pyspark dataframe and with those characters i get a _corrupted_record
rd= spark.sparkContext.parallelize([data])
df = spark.read.json(rd)
I want to avoid solutions like .replace("\\xa0"," ")

A0 is a no-break space. It's simply part of the string. It simply prints like that because you're dumping the repr of an entire dict. It'll simply print as proper no-break space if you print the individual string:
>>> print({'a': '\xa0'})
{'a': '\xa0'}
>>> print('\xa0')
 
>>>

Replace single quotes with double quotes but leave ones within double quotes untouched

The ultimate goal or the origin of the problem is to have a field compatible with in json_extract_path_text Redshift.
This is how it looks right now:
{'error': "Feed load failed: Parameter 'url' must be a string, not object", 'errorCode': 3, 'event_origin': 'app', 'screen_id': '6118964227874465', 'screen_class': 'Promotion'}
To extract field I need from the string in Redshift, I replaced single quotes with double quotes.
The particular record is giving error because inside value of error, there is a single quote there. With that, the string will be a invalid json if those get replaced as well.
So what I need is:
{"error": "Feed load failed: Parameter 'url' must be a string, not object", "errorCode": 3, "event_origin": "app", "screen_id": "6118964227874465", "screen_class": "Promotion"}

Several ways, one is to use the regex module with
"[^"]*"(*SKIP)(*FAIL)|'
See a demo on regex101.com.
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
new_string = rx.sub('"', old_string)
With the original re module, you'd need to use a function and see if the group has been matched or not - (*SKIP)(*FAIL) lets you avoid exactly that.

I tried a regex approach but found it to complicated and slow. So i wrote a simple "bracket-parser" which keeps track of the current quotation mode. It can not do multiple nesting you'd need a stack for that. For my usecase converting str(dict) to proper JSON it works:
example input:
{'cities': [{'name': "Upper Hell's Gate"}, {'name': "N'zeto"}]}
example output:
{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}'
python unit test
def testSingleToDoubleQuote(self):
jsonStr='''
{
"cities": [
{
"name": "Upper Hell's Gate"
},
{
"name": "N'zeto"
}
]
}
'''
listOfDicts=json.loads(jsonStr)
dictStr=str(listOfDicts)
if self.debug:
print(dictStr)
jsonStr2=JSONAble.singleQuoteToDoubleQuote(dictStr)
if self.debug:
print(jsonStr2)
self.assertEqual('''{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}''',jsonStr2)
singleQuoteToDoubleQuote
def singleQuoteToDoubleQuote(singleQuoted):
'''
convert a single quoted string to a double quoted one
Args:
singleQuoted(string): a single quoted string e.g. {'cities': [{'name': "Upper Hell's Gate"}]}
Returns:
string: the double quoted version of the string e.g.
see
- https://stackoverflow.com/questions/55600788/python-replace-single-quotes-with-double-quotes-but-leave-ones-within-double-q
'''
cList=list(singleQuoted)
inDouble=False;
inSingle=False;
for i,c in enumerate(cList):
#print ("%d:%s %r %r" %(i,c,inSingle,inDouble))
if c=="'":
if not inDouble:
inSingle=not inSingle
cList[i]='"'
elif c=='"':
inDouble=not inDouble
doubleQuoted="".join(cList)
return doubleQuoted

Python3 .format() align usage

Can anyone help me to change the writing of these lines?
I want to get my code to be more elegant using .format(), but I don't really know how to use it.
print("%3s %-20s %12s" %("Id", "State", "Population"))
print("%3d %-20s %12d" %
(state["id"],
state["name"],
state["population"]))

Your format is easily translated to the str.format() formatting syntax:
print("{:>3s} {:20s} {:>12s}".format("Id", "State", "Population"))
print("{id:3d} {name:20s} {population:12d}".format(**state))
Note that left-alignment is achieved by prefixing the width with <, not -, and default alignment for strings is to left-align, so a > is needed for the header strings and the < can be omitted, but otherwise the formats are closely related.
This extracts the values directly from the state dictionary by using the keys in the format itself.
You may as well just use the actual output result of the first format directly:
print(" Id State Population")
Demo:
>>> state = {'id': 15, 'name': 'New York', 'population': 19750000}
>>> print("{:>3s} {:20s} {:>12s}".format("Id", "State", "Population"))
Id State Population
>>> print("{id:3d} {name:20s} {population:12d}".format(**state))
15 New York 19750000

You can write:
print("{id:>3s} {state:20s} {population:>12s}".format(id='Id', state='State', population='Population'))
print("{id:>3d} {state:20s} {population:>12d}".format(id=state['id'], state=state['name'], population=state['population']))
Note that you have to use > to right-align as the items are left-aligned by default. You can also name the items in the formatted string which makes it more readable to see what value goes where.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to reformat improper JSON data - python

Related

Parse Json without quotes in Python

Python parse string environment variables

Pyspark dataframe corrupted record when reading from python dictionary(json) got from requests, encoding problem

Replace single quotes with double quotes but leave ones within double quotes untouched

Python3 .format() align usage

Categories

Resources