I need to export my DF into a specific JSON format, but I'm struggling to format it in the right way.
I'd like to create a subsection with shop_details that show the city and location for the shop if it's known, otherwise it should be left empty.
Code for my DF:
from pandas import DataFrame
Data = {'item_type': ['Iphone','Computer','Computer'],
'purch_price': [1200,700,700],
'sale_price': [1150,'NaN','NaN'],
'city': ['NaN','Los Angeles','San Jose'],
'location': ['NaN','1st street', '2nd street']
}
DF looks like this:
item_type purch_price sale_price city location
0 Iphone 1200 1150 NaN NaN
1 Computer 700 NaN Los Angeles 1st street
2 Computer 700 NaN San Jose 2nd street
The output format should look like below:
[{
"item_type": "Iphone",
"purch_price": "1200",
"sale_price": "1150",
"shop_details": []
},
{
"item_type": "Computer",
"purch_price": "700",
"sale_price": "600",
"shop_details": [{
"city": "Los Angeles",
"location": "1st street"
},
{
"city": "San Jose",
"location": "2nd street"
}
]
}
]
import json
df = df.fillna('')
def shop_details(row):
if row['city'] != '' and row['location'] !='':
return [{'city': row['city'], 'location': row['location']}]
else:
return []
df['shop_details'] = df.apply(lambda row: shop_details(row), axis = 1)
df = df.drop(['city', 'location'], axis = 1)
json.dumps(df.to_dict('records'))
Only problem is we do not group by item_type, but you should do some of the work ;)
You can do like below to achieve your output. Thanks
from pandas import DataFrame
Data = {'item_type': ['Iphone','Computer','Computer'],
'purch_price': [1200,700,700],
'sale_price': [1150,'NaN','NaN'],
'city': ['NaN','Los Angeles','San Jose'],
'location': ['NaN','1st street', '2nd street']
}
df = DataFrame(Data, columns= ['item_type', 'purch_price', 'sale_price', 'city','location' ])
Export = df.to_json ('path where you want to export your json file')
Related
I want to transform a Dictionary in Python, from Dictionary 1 Into Dictionary 2 as follows.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
I want to transform the above dictionary to the following
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}
I tried the following but it did not work
dic1 = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA"
}
print(type(dic1))
copiedDic = dic1.copy()
print("copiedDic = ",copiedDic)
updatekeys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
for key in dic1:
if key == 'description':
list_words = dic1[key].split(" ")
newdict = {updatekeys[i]: x for i, x in enumerate(list_words)}
copiedDic.update(newdict)
print(copiedDic)
I got The following result
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': '239.95 USD DHL.COM TEXAS USA',
'amount': '239.95',
'currency': 'USD',
'merchant_name': 'DHL.COM',
'location': 'TEXAS',
'trans_category': 'USA'
}
My Intended output should look like this:
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}
I think it would be easier to turn the value into an array of words and parse it. Here, an array of words 'aaa ' is created from the dictionary string 'transaction['description']'. Where there are more than one word(array element) 'join' is used to turn the array back into a string. The currency value itself is converted to fractional format from the string. In 'merchant_name', the segment up to the point is taken.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
aaa = transaction['description'].split()
transaction['description'] = ' '.join(aaa[2:])
transaction['amount'] = float(aaa[0])
transaction['currency'] = aaa[1]
transaction['location'] = ' '.join(aaa[3:])
transaction['merchant_name'] = aaa[2].partition('.')[0]
print(transaction)
Output
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': 'DHL.COM TEXAS USA',
'amount': 239.95,
'currency': 'USD',
'location': 'TEXAS USA',
'merchant_name': 'DHL'}
If you want to transform, you do not need the copy to the original dictionary.
Just do something like this:
new_keys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
values = transaction["description"].split(' ')
for idx, key in enumerate(new_keys):
if key == "amount":
transaction[key] = float(values[idx])
else:
transaction[key] = values[idx]
I'm trying to create a dataframe using the following JSON structure -
{
"tables" : {
"name" : "PrimaryResult",
"columns" : [
{
"name" : "EmployeeID",
"type" : "Int"
},
{
"name" : "EmployeeName",
"type" : "String"
},
{
"name" : "DepartmentName",
"type" : "String"
}
],
"rows" : [
[
123,
"John Doe",
"IT"
],
[
234,
"Jane Doe",
"HR"
]
]
}
}
I tried few of the suggestions from - How to create pandas DataFrame from nested Json with list , How to parse nested JSON objects in spark sql?.
But I'm still confused. Essentially the output should look somewhat like below -
+----------+------------+--------------+
|EmployeeId|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+
I'm trying to refrain from using pandas as it shows lot of memory issues if the data is huge (not sure if there is a way to handle them).
Please help.
See below logic -
import json
data = [json.loads(js)]
print(data)
# Output
[{'tables': {'name': 'PrimaryResult', 'columns': [{'name': 'EmployeeID', 'type': 'Int'}, {'name': 'EmployeeName', 'type': 'String'}, {'name': 'DepartmentName', 'type': 'String'}], 'rows': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}}]
Now fetch the columns as below -
columns = []
for i in range(len(data[0]['tables']['columns'])):
columns.append(data[0]['tables']['columns'][i]['name'])
print(columns)
#Output
['EmployeeID', 'EmployeeName', 'DepartmentName']
Create a dictionary of columns and rows as below -
dict_JSON = {}
dict_JSON["columns"] = columns
dict_JSON["data"] = data[0]['tables']['rows']
print(dict_JSON)
#Output
{'columns': ['EmployeeID', 'EmployeeName', 'DepartmentName'], 'data': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}
Now once you have this dictionary create pandas dataframe and from there create the spark dataframe as below -
import pandas as pd
pdf = pd.read_json(json.dumps(dict_JSON), orient='split')
df = spark.createDataFrame(pdf)
df.show()
+----------+------------+--------------+
|EmployeeID|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+
My dataframe is
fname lname city state code
Alice Lee Athens Alabama PXY
Nor Xi Mesa Arizona ABC
The output of json should be
{
"Employees":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
df.to_json() gives no hierarchy to the json. Can you please suggest what am I missing? Is there a way to combine columns and give them a 'keyname' while writing json in pandas?
Thank you.
Try:
names = df[["fname", "lname"]].apply(" ".join, axis=1)
addresses = df[["city", "state"]].apply(", ".join, axis=1)
codes = df["code"]
out = {"Employees": {}}
for n, a, c in zip(names, addresses, codes):
out["Employees"][n] = {"code": c, "Address": a}
print(out)
Prints:
{
"Employees": {
"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"},
}
}
We can populate a new dataframe with columns being "code" and "Address", and index being "full_name" where the latter two are generated from the dataframe's columns with string addition:
new_df = pd.DataFrame({"code": df["code"],
"Address": df["city"] + ", " + df["state"]})
new_df.index = df["fname"] + " " + df["lname"]
which gives
>>> new_df
code Address
Alice Lee PXY Athens, Alabama
Nor Xi ABC Mesa, Arizona
We can now call to_dict with orient="index":
>>> d = new_df.to_dict(orient="index")
>>> d
{"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"}}
To match your output, we wrap d with a dictionary:
>>> {"Employee": d}
{
"Employee":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
json = json.loads(df.to_json(orient='records'))
employees = {}
employees['Employees'] = [{obj['fname']+' '+obj['lname']:{'code':obj['code'], 'Address':obj['city']+', '+obj['state']}} for obj in json]
This outputs -
{
'Employees': [
{
'Alice Lee': {
'code': 'PXY',
'Address': 'Athens, Alabama'
}
},
{
'Nor Xi': {
'code': 'ABC',
'Address': 'Mesa, Arizona'
}
}
]
}
you can solve this using df.iterrows()
employee_dict = {}
for row in df.iterrows():
# row[0] is the index number, row[1] is the data respective to that index
row_data = row[1]
employee_name = row_data.fname + ' ' + row_data.lname
employee_dict[employee_name] = {'code': row_data.code, 'Address':
row_data.city + ', ' + row_data.state}
json_data = {'Employees': employee_dict}
Result:
{'Employees': {'Alice Lee': {'code': 'PXY', 'Address': 'Athens, Alabama'},
'Nor Xi': {'code': 'ABC', 'Address': 'Mesa, Arizona'}}}
I'm reading a dataframe and converting it into a json file. I'm using python 3 and 0.25.3 version of pandas for it. I already got some help from you guys (Manipulating data of Pandas dataframe), but I have some questions about the code and how it works.
My dataframe:
id label id_customer label_customer part_number number_client
6 Sao Paulo CUST-99992 Brazil 7897 982
6 Sao Paulo CUST-99992 Brazil 888 12
92 Hong Kong CUST-88888 China 147 288
Code:
import pandas as pd
data = pd.read_excel(path)
data[["part_number","number_client"]] = data[["part_number","number_client"]].astype(str)
f = lambda x: x.split('_')[0]
j =(data.groupby(["id","label","id_customer","label_customer"])['part_number','number_client']
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Number')
.groupby(["id", "label"])[ "id_customer", "label_customer", "Number"]
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Customer')
.to_json(orient='records'))
print (j)
Json I'm getting:
[{
"id": 6,
"label": "Sao Paulo",
"Customer": [{
"id": "CUST-99992",
"label": "Brazil",
"number": [{
"part": "7897",
"client": "982"
},
{
"part": "888",
"client": "12"
}
]
}]
},
{
"id": 92,
"label": "Hong Kong",
"Customer": [{
"id": "CUST-888888",
"label": "China",
"number": [{
"part": "147",
"client": "288"
}]
}]
}
]
1st Question: lambda and apply function are spliting my columns' name when a _ is found.. That is just a piece of my dataframe and some columns I'd like to preserve the name.. e.g: I want get part_number and number_client instead part and client in my json structure. How can I fix this?
2nd Question: I can have different lists with the same key name. E.g: In customer list I have part_number key, but I can also have the same name of key inside another list with another value. E.g: part_number inside test list.
3rd Question: In my complete dataframe, I have a column called Additional_information when I have a simple text. I have to get a structure like this:
...
"Additional_information":[{
{
"text": "testing",
}
},
{
"text": "testing again",
}
]
for a dataframe like this:
id label id_customer label_customer part_number number_client Additional_information
6 Sao Paulo CUST-99992 Brazil 7897 982 testing
6 Sao Paulo CUST-99992 Brazil 7897 982 testing again
What should I change?
1st Question:
You can write custom function for rename, e.g. like:
def f(x):
vals = ['part_number', 'number_client']
if x in vals:
return x
else:
return x.split('_')[0]
2nd Question
If I understand correctly keys in final json are created from columns of original Dataframe, and also by parameter name by reset_index of my solution. If want some another logic for change keys (columns names) is possible change first solution.
3rd Question
In original solution is changed to_json to to_dict for possible modify final list of dict like append text info, for json is used json.dumps in last step:
import json
def f(x):
vals = ['part_number', 'number_client']
if x in vals:
return x
else:
return x.split('_')[0]
d =(data.groupby(["id","label","id_customer","label_customer"])['part_number','number_client']
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Number')
.groupby(["id", "label"])[ "id_customer", "label_customer", "Number"]
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Customer')
.to_dict(orient='records'))
#print (d)
d1 = (data[['Additional_information']].rename(columns={'Additional_information':'text'})
.to_dict(orient='records'))
d1 = {'Additional_information':d1}
print (d1)
{'Additional_information': [{'text': 'testing'}, {'text': 'testing again'}]}
d.append(d1)
#print (d)
j = json.dumps(d)
#print (j)
I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks
Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)