Write json format using pandas Series and DataFrame - python

I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks

Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)

Related

How to add duplicate columns together after converting from excel to json in python?

I have excel file in the format :
Name
Question
Answer
N1
Q1
a1
N2
Q2
a2
N3
Q3
a3
N4
Q4
a4
N3
Q5
a3
Here some name are same and their correspondings answers are also same. I want to convert this into json in the format where all the columns with same name are merged.
{
{
"name":"N1",
"exampleSentences": ["Q1"],
"defaultReply": {
"text": ["a1"],
"type": "text"
}
},
{
"name":"N2",
"exampleSentences": ["Q2"],
"defaultReply": {
"text": ["a2"],
"type": "text"
}
},
{
"name":"N3",
"exampleSentences": ["Q3","Q5"],
"defaultReply": {
"text": ["a3"],
"type": "text"
}
},
{
"name":"N4",
"exampleSentences": ["Q4"],
"defaultReply": {
"text": ["a4"],
"type": "text"
}
},
}
Here is the code that I wrote:
# Import the required python modules
import pandas as pd
import math
import json
import csv
# Define the name of the Excel file
fileName = "FAQ_eng"
# Read the Excel file
df = pd.read_excel("{}.xlsx".format(fileName))
intents = []
intentNames = df["Name"]
# Loop through the list of Names and create a new intent for each row
for index, name in enumerate(intentNames):
if name is not None:
exampleSentences = []
defaultReplies = []
if df["Question"][index] is not None and df["Question"][index] is not float:
try:
exampleSentences = df["Question"][index]
exampleSentences = [exampleSentences]
defaultReplies = df["Answer"][index]
defaultReplies = [defaultReplies]
except:
continue
intents.append({
"name": name,
"exampleSentences": exampleSentences,
"defaultReply": {
"text": defaultReplies,
"type": "text"
}
})
# Write the list of created intents into a JSON file
with open("{}.json".format(fileName), "w", encoding="utf-8") as outputFile:
json.dump(intents, outputFile, ensure_ascii=False)
My code adds another json data
{
"name":"N3",
"exampleSentences": ["Q5"],
"defaultReply": {
"text": ["a3"],
"type": "text"
}
instead of merging Q3 and Q5. What should I do?
The problem in your code is you are iterating through a set of items and at every iteration you should check the previous items to see if your current element is already present. You can avoid this problem if you use an initially empty dictionary d storing key, value pairs in the form d[name] = {"exampleSentences": [question], "text": [answer]}. You can iterate so over df["Name"] like below:
intentNames = df["Name"]
d = {}
# Loop through intentNames and create the dictionary
for index, name in enumerate(intentNames):
question = df["Question"][index]
answer = df["Answer"][index]
if name not in d:
d[name] = {"exampleSentences": [question], "text": [answer]}
else:
d[name]["exampleSentences"].append(question)
Then you can use the created dictionary to create the json file with the expected output like below:
intentNames = df["Name"]
d = {}
# Loop through intentNames and create the dictionary
for index, name in enumerate(intentNames):
question = df["Question"][index]
answer = df["Answer"][index]
if name not in d:
d[name] = {"exampleSentences": [question], "text": [answer]}
else:
d[name]["exampleSentences"].append(question)
#create the json array file
intents = []
for k, v in d.items():
intents.append({
"name": k,
"exampleSentences": v['exampleSentences'],
"defaultReply": {
"text": v['text'],
"type": "text"
}
})
# Write the list of created intents into a JSON file
with open("{}.json".format(fileName), "w", encoding="utf-8") as outputFile:
json.dump(intents, outputFile, ensure_ascii=False)

Excel to JSON format with python

I have an excel sheet which is in the below format
I want to convert this excel sheet into JSON format using Python. each JSON object is a diagonal value and column headings in the below format.
{
"Records": [
{
"RecordId": "F1",
"Assets": [
{
"AssetId": "A1",
"Support": "S11"
},
{
"AssetId": "A2",
"Support": "S12"
},
{
"AssetId": "A3",
"Support": "S13"
}
]
},
{
"RecordId": "F2",
"Assets": [
{
"AssetId": "A1",
"Support": "S21"
},
{
"AssetId": "A2",
"Support": "S22"
},
{
"AssetId": "A3",
"Support": "S23"
}
]
}
]
}
I have written some code it seems not working as I expected.
import json
import pandas as pd
df = pd.read_excel (r'test.xlsx', sheet_name='Sheet2')
#initialize data
data=[0 for i in range(len(df))]
datac=[0 for c in range(len(df.columns))]
newset=dict()
for i in range(len(df)):
# data[i] = r'{"'+str(df.columns.values[0])+'": "' +str(df.loc[i][0])+'", '+str(df.columns.values[1])+'": "' +str(df.loc[i][1])+'", '+str(df.columns.values[2])+'": "' +str(df.loc[i][2])+'"}'
#data[i] = {str(df.columns.values[1]) : str(df.loc[i][0]), str(df.columns.values[1]): str(df.loc[i][1]), str(df.columns.values[2]): str(df.loc[i][2])}
for c in range(1,len(df.columns)):
#data[i] = {str('RecordId') : str(df.loc[i][0]),str('Assets'):[{"AssetId": str(df.columns.values[c]),"Support": str(df.loc[i][c])}]}
datac[c] = {"AssetId": str(df.columns.values[c]),"Support": str(df.loc[i][c])}
data[i]={str('RecordId') : str(df.loc[i][0]),str('Assets'):datac[c]}
print(data[i])
output_lines = [json.dumps(line)+",\n" for line in data]
output_lines[-1] = output_lines[-1][:-2] # remove ",\n" from last line
with open(r'Savedwork.json', 'w') as json_file:
json_file.writelines(output_lines)
What you need is the iterrows() method, it will iterate over the
dataframe's rows as (index, series) pairs. The columns() method will give you
the list of column names, so you'll be able to iterate over the columns in the
series, and access them by name.
import json
import pandas as pd
df = pd.read_excel('test.xlsx')
recs = []
for i, row in df.iterrows():
rec = {
'RecordId': row[0],
'Assets': [{'AssetId': c, 'Support': row[c]} for c in df.columns[1:]]
}
recs.append(rec)
out = {'Records': recs}
(yes, it could all be done in a single list comprehension, but abusing those hinders readability)
Also, you don't need to do json.dumps on lines, and then assemble them with
newlines (don't work at the text level): build a dictionary with the entire
data, and then json.dump that:
print(json.dumps(out, indent=4))
You can create the dicts directly in pandas.
First set the first column with F1, F2 as index:
df.set_index(0, inplace = True)
df.index.name = None
Then create the dicts in pandas with dict keys as column names, export it to a dict and save it to json:
import json
df = df.apply(lambda x: [{"AssetId": x.name, "Support": i} for i in x], axis =1).reset_index().rename(columns={'index': 'RecordId', 0: 'Assets'})
json_data = {"Records": df.to_dict('records')}
with open('r'Savedwork.json', 'w') as fp:
json.dump(json_data, fp)
another solution is to take a snapshot of the entire workbook in json format and reorganize it out of the box. Using the collect function of XLtoy is possible to do that via command line, this approach allows you more degrees of freedom.
[i'm the main developer of XLtoy]

how to extract columns for dictionary that do not have keys

so I have tried resources of how transform dict in data frame, but the problem this is an weird Dict.
it is not like key: {} , key: {} and etc..
the data has lots of items. But the goal is extract only the stuff inside of dict {}, if possible the dates also is a plus.
data:
id,client,source,status,request,response,queued,created_at,updated_at
54252,sdf,https://asdasdadadad,,"{
"year": "2010",
"casa": "aca",
"status": "p",
"Group": "57981",
}",,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"
54252,msc-lp,https://discover,,"{
"year": "27",
"casa": "Na",
"status": "p",
"Group": "57981",
}"
my attempts:
#attempt 1
with open('data.csv') as fd:
pairs = (line.split(None) for line in fd)
res = {int(pair[0]):pair[1] for pair in pairs if len(pair) == 2 and pair[0].isdigit()}
#attempt 2
import json
# reading the JSON data using json.load()
file = 'data.json'
with open(file) as train_file:
dict_train = json.load(train_file)
# converting json dataset from dictionary to dataframe
train = pd.DataFrame.from_dict(dict_train, orient='index')
train.reset_index(level=0, inplace=True)
#attempt 3
df = pd.read_csv("data.csv")
df = df.melt(id_vars=["index", "Date"], var_name="variables",value_name="values")
Nothening works due the data be weird shaped
Expected output:
All the items inside of the dictionary, every key will be one column at df
Date year casa status Group
2020-05-02 11:06:17 2010 aca p 57981
2020-05-02 11:06:17 27 Na p 57981
Format data into a valid csv stucture:
id,client,source,status,request,response,queued,created_at,updated_at
54252,sdf,https://asdasdadadad,,'{ "ag": "2010", "ca": "aca", "ve": "p", "Group": "57981" }',,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"
54252,msc-lp,https://discover,,'{ "ag": "27", "ca": "Na", "ve": "p", "Group": "57981" }',,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"
This should work for the worst-case scenario as well,
check it out.
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df =pd.read_csv('tmp.csv',converters={"request":parse_column}, quotechar="'")

convert csv file to multiple nested json format

I have written a code to convert csv file to nested json format. I have multiple columns to be nested hence assigning separately for each column. The problem is I'm getting 2 fields for the same column in the json output.
import csv
import json
from collections import OrderedDict
csv_file = 'data.csv'
json_file = csv_file + '.json'
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
row['TYPE'] = 'REVIEW', # adding new key, value
row['RAWID'] = 1,
row['CUSTOMER'] = {
"ID": row['CUSTOMER_ID'],
"NAME": row['CUSTOMER_NAME']
}
row['CATEGORY'] = {
"ID": row['CATEGORY_ID'],
"NAME": row['CATEGORY']
}
del (row["CUSTOMER_NAME"], row["CATEGORY_ID"],
row["CATEGORY"], row["CUSTOMER_ID"]) # deleting since fields coccuring twice
csv_rows.append(row)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
The output is as below:
[
{
"CATEGORY": {
"ID": "1",
"NAME": "Consumers"
},
"CATEGORY_ID": "1",
"CUSTOMER_ID": "41",
"CUSTOMER": {
"ID": "41",
"NAME": "SA Port"
},
"CUSTOMER_NAME": "SA Port",
"RAWID": [
1
]
}
]
I'm getting 2 entries for the fields I have assigned using row[''].
Is there any other way to get rid of this? I want only one entry for a particular field in each record.
Also how can I convert the keys to lower case after reading from csv.DictReader(). In my csv file all the columns are in upper case and hence I'm using the same to assign. But I want to convert all of them to lower case.
In order to convert the keys to lower case, it would be simpler to generate a new dict per row. BTW, it should be enough to get rid of the duplicate fields:
for row in reader:
orow = collection.OrderedDict()
orow['type'] = 'REVIEW', # adding new key, value
orow['rawid'] = 1,
orow['customer'] = {
"id": row['CUSTOMER_ID'],
"name": row['CUSTOMER_NAME']
}
orow['category'] = {
"id": row['CATEGORY_ID'],
"name": row['CATEGORY']
}
csv_rows.append(orow)

filter json file with python

How to filter a json file to show only the information I need?
To start off I want to say I'm fairly new to python and working with JSON so sorry if this question was asked before and I overlooked it.
I have a JSON file that looks like this:
[
{
"Store": 417,
"Item": 10,
"Name": "Burger",
"Modifiable": true,
"Price": 8.90,
"LastModified": "09/02/2019 21:30:00"
},
{
"Store": 417,
"Item": 15,
"Name": "Fries",
"Modifiable": false,
"Price": 2.60,
"LastModified": "10/02/2019 23:00:00"
}
]
I need to filter this file to only show Item and Price, like
[
{
"Item": 10,
"Price": 8.90
},
{
"Item": 15,
"Price": 2.60
}
]
I have a code that looks like this:
# Transform json input to python objects
with open("StorePriceList.json") as input_file:
input_dict = json.load(input_file)
# Filter python objects with list comprehensions
output_dict = [x for x in input_dict if ] #missing logical test here.
# Transform python object back into json
output_json = json.dumps(output_dict)
# Show json
print(output_json)
What logical test I should be doing here to do that?
Let's say we can use dict comprehension, then it will be
output_dict = [{k:v for k,v in x.items() if k in ["Item", "Price"]} for x in input_dict]
You can also do it like this :)
>>> [{key: d[key] for key in ['Item', 'Price']} for d in input_dict] # you should rename it to `input_list` rather than `input_dict` :)
[{'Item': 10, 'Price': 8.9}, {'Item': 15, 'Price': 2.6}]
import pprint
with open('data.json', 'r') as f:
qe = json.load(f)
list = []
for item in qe['<your data>']:
query = (f'{item["Item"]} {item["Price"]}')
print("query")

Categories

Resources