python: custom pandas.DataFrame to dictionary function: some entries are lost

python: custom pandas.DataFrame to dictionary function: some entries are lost - python

I want to read a .xlsx file, do some things with the data and convert it to a dict to save it in a .json file. To do that I use Python3 and pandas.
This is the code:
import pandas as pd
import json
xls = pd.read_excel(
io = "20codmun.xlsx",
converters = {
"CODAUTO" : str,
"CPRO" : str,
"CMUN" : str,
"DC" : str
}
)
print(xls)
#print(xls.columns.values)
outDict = {}
print(len(xls["NOMBRE"])) # 8131 rows
for i in range(len(xls.index)):
codauto = xls["CODAUTO"][i]
cpro = xls["CPRO"][i]
cmun = xls["CMUN"][i]
dc = xls["DC"][i]
aemetId = cpro + cmun
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(i) # 8130
print(len(outDict)) # 8114 entries, SOME ENTIRES ARE LOST!!!!!
#print(outDict["Petrer"])
with open("data.json", "w") as outFile:
json.dump(outDict, outFile)
I add here the source of the .xlsx file (Spanish government). Select "Fichero con todas las provincias". You have to delete the first row.
As you can see, the pandas.DataFrame has 8131 rows, the for index at the end is 8130, but the length of the final dict is 8114, so some data is lost!
You can check that "Aljucén" is on the .xlsx file, but not in the .json one. Edit: Solved using json.dump(outDict, outFile, ensure_ascii=False)

I have analyzed the file and seems like some "NOMBRE" values are duplicated. Try executing xls["NOMBRE"].value_counts() and you will see that for example "Sada" is twice. You will also see that the unique values are 8114 exactly.
As you are using the city name as the dictionary key, when the key is duplicated, you are modifying the previous value of the dict.

I agree with gontxomde that if column "NOMBRE" has not only unique values, than it may lead to overwriting this key in the new dictionary.
To make a proof of concept I made a minimal example based on your approach:
import pandas as pd
feature_str = ['a', 'b', 'c']
df = pd.DataFrame({"NOMBRE": [1, 1, 3],
"CODAUTO": feature_str,
"CPRO" : feature_str,
"CMUN" : feature_str,
"DC" : feature_str
})
outDict = {}
print(len(df["NOMBRE"])) # 8131 rows
for i in range(len(df.index)):
codauto = df["CODAUTO"][i]
cpro = df["CPRO"][i]
cmun = df["CMUN"][i]
dc = df["DC"][i]
aemetId = cpro + cmun
outDict[df["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(outDict)
Which yields:
{1: {'CODAUTO': 'b', 'CPRO': 'b', 'CMUN': 'b', 'DC': 'b', 'AEMET_ID': 'bb'},
3: {'CODAUTO': 'c', 'CPRO': 'c', 'CMUN': 'c', 'DC': 'c', 'AEMET_ID': 'cc'}}
If I could suggest, instead of iterating over the index of the DataFrame, it would be better to use DataFrame methods:
df.set_index("NOMBRE") \
.to_dict(orient='index')
If you would use this in a dataset with unique values at NOMBRE, it would yield the same result, than the function you created. Additionally, in case you had duplicates it would raise an ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [15], in <module>
----> 1 df.set_index("NOMBRE").to_dict(orient='index')
File ~/.pyenv/versions/3.8.7/envs/jupyter/lib/python3.8/site-packages/pandas/core/frame.py:1607, in DataFrame.to_dict(self, orient, into)
1605 elif orient == "index":
1606 if not self.index.is_unique:
-> 1607 raise ValueError("DataFrame index must be unique for orient='index'.")
1608 return into_c(
1609 (t[0], dict(zip(self.columns, t[1:])))
1610 for t in self.itertuples(name=None)
1611 )
1613 else:
ValueError: DataFrame index must be unique for orient='index'.

If you have duplicated values in xls["NOMBRE"], each new duplicated will overwrite the previous one. So, you need to choose the strategy deal with duplicates, e.g. do you want different entries, like Sada and Sada(2)? Or do you want a single key Sada with the data from all the duplicates?
For the first example:
for i in range(len(xls.index)):
# if it's the first time the value appears, just do the "normal" thing
if xls["NOMBRE"][i] not in outDict.keys():
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
# if the value was read before, add number of duplicate after the name
else:
for i in range(1, xls['NOMBRE'].value_counts()[xls["NOMBRE"][i]]):
if xls["NOMBRE"][i] + '(' + str(i+1) + ')' not in outDict.keys():
outDict[xls["NOMBRE"][i] + '(' + str(i+1) + ')'] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
For the second case, there're good solutions here and here.

Related

How to parse file with different structures in python

I am working on a file where data with a lot of structures. But I cannot figure out an efficient way to handle all of these. My idea is read line by line and find paratheses in pair. Is there any efficient way to match paratheses then I handle each type in specific logic?
Here is the file I am facing:
.....
# some header info that can be discarded
object node {
name R2-12-47-3_node_453;
phases ABCN;
voltage_A 7200+0.0j;
voltage_B -3600-6235j;
voltage_C -3600+6235j;
nominal_voltage 7200;
bustype SWING;
}
...
# a lot of objects node
object triplex_meter {
name R2-12-47-3_tm_403;
phases AS;
voltage_1 120;
voltage_2 120;
voltage_N 0;
nominal_voltage 120;
}
....
# a lot of object triplex_meter
object triplex_line {
groupid Triplex_Line;
name R2-12-47-3_tl_409;
phases AS;
from R2-12-47-3_tn_409;
to R2-12-47-3_tm_409;
length 30;
configuration triplex_line_configuration_1;
}
...
# a lot of object triplex_meter
#some nested objects...awh...
So my question is there way to quickly match "{" and "}" so that I can focus on the type inside.
I am expecting some logic like after parsing the file:
if obj_type == "node":
# to do 1
elif obj_type == "triplex_meter":
# to do 2
It seems easy to deal with this structure, but I am not sure exactly where to get started.

Code with comments
file = """
object node {
name R2-12-47-3_node_453
phases ABCN
voltage_A 7200+0.0j
voltage_B - 3600-6235j
voltage_C - 3600+6235j
nominal_voltage 7200
bustype SWING
}
object triplex_meter {
name R2-12-47-3_tm_403
phases AS
voltage_1 120
voltage_2 120
voltage_N 0
nominal_voltage 120
}
object triplex_line {
groupid Triplex_Line
name R2-12-47-3_tl_409
phases AS
from R2-12-47-3_tn_409
to R2-12-47-3_tm_409
length 30
configuration triplex_line_configuration_1
}"""
# New python dict
data = {}
# Generate a list with all object taken from file
x = file.replace('\n', '').replace(' - ', ' ').strip().split('object ')
for i in x:
# Exclude null items in the list to avoid errors
if i != '':
# Hard split
a, b = i.split('{')
c = b.split(' ')
# Generate a new list with non null elements
d = [e.replace('}', '') for e in c if e != '' and e != ' ']
# Needing a sub dict here for paired values
sub_d = {}
# Iterating over list to get paired values
for index in range(len(d)):
# We are working with paired values so we unpack only pair indexes
if index % 2 == 0:
# Inserting paired values in sub_dict
sub_d[d[index]] = d[index+1]
# Inserting sub_dict in main dict "data" using object name
data[a.strip()] = sub_d
print(data)
Output
{'node': {'name': 'R2-12-47-3_node_453', 'phases': 'ABCN', 'voltage_A': '7200+0.0j', 'voltage_B': '3600-6235j', 'voltage_C': '3600+6235j', 'nominal_voltage': '7200', 'bustype': 'SWING'}, 'triplex_meter': {'name': 'R2-12-47-3_tm_403', 'phases': 'AS', 'voltage_1': '120', 'voltage_2': '120', 'voltage_N': '0', 'nominal_voltage': '120'}, 'triplex_line': {'groupid': 'Triplex_Line', 'name': 'R2-12-47-3_tl_409', 'phases': 'AS', 'from': 'R2-12-47-3_tn_409', 'to': 'R2-12-47-3_tm_409', 'length': '30', 'configuration': 'triplex_line_configuration_1'}}
You can now use the python dict how you want.
For e.g.
print(data['triplex_meter']['name'])
EDIT
If you have got lots of "triplex_meter" objects in your file group it in a Python list before inserting them in the main dict

Databricks - Pyspark - Handling nested json with a dynamic key

I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?

The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+

Filter nested JSON structure and get field names as values in Pyspark

I have the following complex data that would like to parse in PySpark:
records = '[{"segmentMembership":{"ups":{"FF6KCPTR6AQ0836R":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"exited"},"QMS3YRT06JDEUM8O":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"realized"},"8XH45RT87N6ZV4KQ":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"exited"}}},"_aepgdcdevenablement2":{"emailId":{"address":"stuff#someemail.com"},"person":{"name":{"firstName":"Name2"}},"identities":{"customerid":"PH25PEUWOTA7QF93"}}},{"segmentMembership":{"ups":{"FF6KCPTR6AQ0836R":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"realized"},"D45TOO8ZUH0B7GY7":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"realized"},"QMS3YRT06JDEUM8O":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"existing"}}},"_aepgdcdevenablement2":{"emailId":{"address":"stuff4#someemail.com"},"person":{"name":{"firstName":"TestName"}},"identities":{"customerid":"9LAIHVG91GCREE3Z"}}}]'
df = spark.read.json(sc.parallelize([records]))
df.show()
df.printSchema()
The problem I am having is with the segmentMembership object. The JSON object looks like this:
"segmentMembership": {
"ups": {
"FF6KCPTR6AQ0836R": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "exited"
},
"QMS3YRT06JDEUM8O": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "realized"
},
"8XH45RT87N6ZV4KQ": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "exited"
}
}
}
The annoying thing with this is, the key values ("FF6KCPTR6AQ0836R", "QMS3YRT06JDEUM8O", "8XH45RT87N6ZV4KQ") end up being defined as a column in pyspark.
In the end, if the status of the segment is "exited", I was hoping to get the results as follows.
+--------------------+----------------+---------+------------------+
|address |customerid |firstName|segment_id |
+--------------------+----------------+---------+------------------+
|stuff#someemail.com |PH25PEUWOTA7QF93|Name2 |[8XH45RT87N6ZV4KQ]|
|stuff4#someemail.com|9LAIHVG91GCREE3Z|TestName |[8XH45RT87N6ZV4KQ]|
+--------------------+----------------+---------+------------------+
After loading the data into a dataframe(above), I tried the following:
dfx = df.select("_aepgdcdevenablement2.emailId.address", "_aepgdcdevenablement2.identities.customerid", "_aepgdcdevenablement2.person.name.firstName", "segmentMembership.ups")
dfx.show(truncate=False)
seg_list = array(*[lit(k) for k in ["8XH45RT87N6ZV4KQ", "QMS3YRT06JDEUM8O"]])
print(seg_list)
# if v["status"] in ['existing', 'realized']
def confusing_compare(ups, seg_list):
seg_id_filtered_d = dict((k, ups[k]) for k in seg_list if k in ups)
# This is the line I am having a problem with.
# seg_id_status_filtered_d = {key for key, value in seg_id_filtered_d.items() if v["status"] in ['existing', 'realized']}
return list(seg_id_filtered_d)
final_conf_dx_pred = udf(confusing_compare, ArrayType(StringType()))
result_df = dfx.withColumn("segment_id", final_conf_dx_pred(dfx.ups, seg_list)).select("address", "customerid", "firstName", "segment_id")
result_df.show(truncate=False)
I am not able to check the status field within the value field of the dic.

You can actually do that without using UDF. Here I'm using all the segment names present in the schema and filtering out those with status = 'exited'. You can adapt it depending on which segments and status you want.
First, using the schema fields, get the list of all segment names like this:
segment_names = df.select("segmentMembership.ups.*").schema.fieldNames()
Then, by looping throught the list created above and using when function, you can create a column that can have either segment_name as value or null depending on status:
active_segments = [
when(col(f"segmentMembership.ups.{c}.status") != lit("exited"), lit(c))
for c in segment_names
]
Finally, add new column segments of array type and use filter function to remove null elements from the array (which corresponds to status 'exited'):
dfx = df.withColumn("segments", array(*active_segments)) \
.withColumn("segments", expr("filter(segments, x -> x is not null)")) \
.select(
col("_aepgdcdevenablement2.emailId.address"),
col("_aepgdcdevenablement2.identities.customerid"),
col("_aepgdcdevenablement2.person.name.firstName"),
col("segments").alias("segment_id")
)
dfx.show(truncate=False)
#+--------------------+----------------+---------+------------------------------------------------------+
#|address |customerid |firstName|segment_id |
#+--------------------+----------------+---------+------------------------------------------------------+
#|stuff#someemail.com |PH25PEUWOTA7QF93|Name2 |[QMS3YRT06JDEUM8O] |
#|stuff4#someemail.com|9LAIHVG91GCREE3Z|TestName |[D45TOO8ZUH0B7GY7, FF6KCPTR6AQ0836R, QMS3YRT06JDEUM8O]|
#+--------------------+----------------+---------+------------------------------------------------------+

Json file not formatted correctly when writing json differences with pandas and numpy

I am trying to compare two json and then write another json with columns names and with differences as yes or no. I am using pandas and numpy
The below is sample files i am including actually, these json are dynamic, that mean we dont know how many key will be there upfront
Input files:
fut.json
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Curr.json:
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Below code I have tried:
import pandas as pd
import numpy as np
with open(r"c:\csv\fut.json", 'r+') as f:
data_b = json.load(f)
with open(r"c:\csv\curr.json", 'r+') as f:
data_a = json.load(f)
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
_, df_a = df_b.align(df_a, fill_value=np.NaN)
_, df_b = df_a.align(df_b, fill_value=np.NaN)
with open(r"c:\csv\report.json", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_curr'], df_temp[col + '_fut'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
#[df_temp.rename(columns={c:'Missing'}, inplace=True) for c in df_temp.columns if df_temp[c].isnull().all()]
df_temp.fillna('Missing', inplace=True)
with pd.option_context('display.max_colwidth', -1):
_file.write(df_temp.to_json(orient='records'))
Expected output:
[
{
"AlarmName_curr": "test",
"AlarmName_fut": "test",
"AlarmName_diff": "No"
},
{
"StateValue_curr": "OK",
"StateValue_fut": "OK",
"StateValue_diff": "No"
}
]
Coming output: Not able to parse it in json validator, below is the problem, those [] should be replaed by ',' to get right json dont know why its printing like that
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}][{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]
Edit1:
Tried below as well
_file.write(df_temp.to_json(orient='records',lines=True))
now i get json which is again not parsable, ',' is missing and unless i add , between two dic and [ ] at beginning and end manually , its not parsing..
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]

Honestly pandas is overkill for this... however
load dataframes as you did
concat them as columns. rename columns
do calcs and map boolean to desired Yes/No
to_json() returns a string so json.loads() to get it back into a list/dict. Filter columns to get to your required format
import json
data_b = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
data_a = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
df = pd.concat([df_a, df_b], axis=1)
df.columns = [c+"_curr" for c in df_a.columns] + [c+"_fut" for c in df_a.columns]
df["AlarmName_diff"] = df["AlarmName_curr"] == df["AlarmName_fut"]
df["StateValue_diff"] = df["StateValue_curr"] == df["StateValue_fut"]
df = df.replace({True:"Yes", False:"No"})
js = json.loads(df.loc[:,(c for c in df.columns if c.startswith("Alarm"))].to_json(orient="records"))
js += json.loads(df.loc[:,(c for c in df.columns if c.startswith("State"))].to_json(orient="records"))
js
output
[{'AlarmName_curr': 'test', 'AlarmName_fut': 'test', 'AlarmName_diff': 'Yes'},
{'StateValue_curr': 'OK', 'StateValue_fut': 'OK', 'StateValue_diff': 'Yes'}]

Python - append to dictionary by name with multilevels 1, 1.1, 1.1.1, 1.1.2 (hierarchical)

I use openpyxl to read data from excel files to provide a json file at the end. The problem is that I cannot figure out an algorithm to do a hierarchical organisation of the json (or python dictionary).
The data form is like the following:
The output should be like this:
{
'id' : '1',
'name' : 'first',
'value' : 10,
'children': [ {
'id' : '1.1',
'name' : 'ab',
'value': 25,
'children' : [
{
'id' : '1.1.1',
'name' : 'abc' ,
'value': 16,
'children' : []
}
]
},
{
'id' : '1.2',
...
]
}
Here is what I have come up with, but i can't go beyond '1.1' because '1.1.1' and '1.1.1.1' and so on will be at the same level as 1.1.
from openpyxl import load_workbook
import re
from json import dumps
wb = load_workbook('resources.xlsx')
sheet = wb.get_sheet_by_name(wb.get_sheet_names()[0])
resources = {}
prev_dict = {}
list_rows = [ row for row in sheet.rows ]
for nrow in range(list_rows.__len__()):
id = str(list_rows[nrow][0].value)
val = {
'id' : id,
'name' : list_rows[nrow][1].value ,
'value' : list_rows[nrow][2].value ,
'children' : []
}
if id[:-2] == str(list_rows[nrow-1][0].value):
prev_dict['children'].append(val)
else:
resources[nrow] = val
prev_dict = resources[nrow]
print dumps(resources)

You need to access your data by ID, so first step is to create a dictionary where the IDs are the keys. For easier data manipulation, string "1.2.3" is converted to ("1","2","3") tuple. (Lists are not allowed as dict keys). This makes the computation of a parent key very easy (key[:-1]).
With this preparation, we could simply populate the children list of each item's parent. But before doing that a special ROOT element needs to be added. It is the parent of top-level items.
That's all. The code is below.
Note #1: It expects that every item has a parent. That's why 1.2.2 was added to the test data. If it is not the case, handle the KeyError where noted.
Note #2: The result is a list.
import json
testdata="""
1 first 20
1.1 ab 25
1.1.1 abc 16
1.2 cb 18
1.2.1 cbd 16
1.2.1.1 xyz 19
1.2.2 NEW -1
1.2.2.1 poz 40
1.2.2.2 pos 98
2 second 90
2.1 ezr 99
"""
datalist = [line.split() for line in testdata.split('\n') if line]
datadict = {tuple(item[0].split('.')): {
'id': item[0],
'name': item[1],
'value': item[2],
'children': []}
for item in datalist}
ROOT = ()
datadict[ROOT] = {'children': []}
for key, value in datadict.items():
if key != ROOT:
datadict[key[:-1]]['children'].append(value)
# KeyError = parent does not exist
result = datadict[ROOT]['children']
print(json.dumps(result, indent=4))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: custom pandas.DataFrame to dictionary function: some entries are lost - python

Related

How to parse file with different structures in python

Databricks - Pyspark - Handling nested json with a dynamic key

Filter nested JSON structure and get field names as values in Pyspark

Json file not formatted correctly when writing json differences with pandas and numpy

Python - append to dictionary by name with multilevels 1, 1.1, 1.1.1, 1.1.2 (hierarchical)

Categories

Resources