Manipulating data of a dataframe in Pandas - python

I'm reading a dataframe and converting it into a json file. I'm using python 3 and 0.25.3 version of pandas for it. I already got some help from you guys (Manipulating data of Pandas dataframe), but I have some questions about the code and how it works.
My dataframe:
id label id_customer label_customer part_number number_client
6 Sao Paulo CUST-99992 Brazil 7897 982
6 Sao Paulo CUST-99992 Brazil 888 12
92 Hong Kong CUST-88888 China 147 288
Code:
import pandas as pd
data = pd.read_excel(path)
data[["part_number","number_client"]] = data[["part_number","number_client"]].astype(str)
f = lambda x: x.split('_')[0]
j =(data.groupby(["id","label","id_customer","label_customer"])['part_number','number_client']
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Number')
.groupby(["id", "label"])[ "id_customer", "label_customer", "Number"]
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Customer')
.to_json(orient='records'))
print (j)
Json I'm getting:
[{
"id": 6,
"label": "Sao Paulo",
"Customer": [{
"id": "CUST-99992",
"label": "Brazil",
"number": [{
"part": "7897",
"client": "982"
},
{
"part": "888",
"client": "12"
}
]
}]
},
{
"id": 92,
"label": "Hong Kong",
"Customer": [{
"id": "CUST-888888",
"label": "China",
"number": [{
"part": "147",
"client": "288"
}]
}]
}
]
1st Question: lambda and apply function are spliting my columns' name when a _ is found.. That is just a piece of my dataframe and some columns I'd like to preserve the name.. e.g: I want get part_number and number_client instead part and client in my json structure. How can I fix this?
2nd Question: I can have different lists with the same key name. E.g: In customer list I have part_number key, but I can also have the same name of key inside another list with another value. E.g: part_number inside test list.
3rd Question: In my complete dataframe, I have a column called Additional_information when I have a simple text. I have to get a structure like this:
...
"Additional_information":[{
{
"text": "testing",
}
},
{
"text": "testing again",
}
]
for a dataframe like this:
id label id_customer label_customer part_number number_client Additional_information
6 Sao Paulo CUST-99992 Brazil 7897 982 testing
6 Sao Paulo CUST-99992 Brazil 7897 982 testing again
What should I change?

1st Question:
You can write custom function for rename, e.g. like:
def f(x):
vals = ['part_number', 'number_client']
if x in vals:
return x
else:
return x.split('_')[0]
2nd Question
If I understand correctly keys in final json are created from columns of original Dataframe, and also by parameter name by reset_index of my solution. If want some another logic for change keys (columns names) is possible change first solution.
3rd Question
In original solution is changed to_json to to_dict for possible modify final list of dict like append text info, for json is used json.dumps in last step:
import json
def f(x):
vals = ['part_number', 'number_client']
if x in vals:
return x
else:
return x.split('_')[0]
d =(data.groupby(["id","label","id_customer","label_customer"])['part_number','number_client']
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Number')
.groupby(["id", "label"])[ "id_customer", "label_customer", "Number"]
.apply(lambda x: x.rename(columns=f).to_dict('r')).reset_index(name='Customer')
.to_dict(orient='records'))
#print (d)
d1 = (data[['Additional_information']].rename(columns={'Additional_information':'text'})
.to_dict(orient='records'))
d1 = {'Additional_information':d1}
print (d1)
{'Additional_information': [{'text': 'testing'}, {'text': 'testing again'}]}
d.append(d1)
#print (d)
j = json.dumps(d)
#print (j)

Related

Python Pandas json_normalize with multiple lists of dicts

I'm trying to flatten a JSON file that was originally converted from XML using xmltodict(). There are multiple fields that may have a list of dictionaries. I've tried using record_path with meta data to no avail, but I have not been able to get it to work when there are multiple fields that may have other nested fields. It's expected that some fields will be empty for any given record
I have tried searching for another topic and couldn't find my specific problem with multiple nested fields. Can anyone point me in the right direction?
Thanks for any help that can be provided!
Sample base Python (without the record path)
import pandas as pd
import json
with open('./example.json', encoding="UTF-8") as json_file:
json_dict = json.load(json_file)
df = pd.json_normalize(json_dict['WIDGET'])
print(df)
df.to_csv('./test.csv', index=False)
Sample JSON
{
"WIDGET": [
{
"ID": "6",
"PROBLEM": "Electrical",
"SEVERITY_LEVEL": "1",
"TITLE": "Battery's Missing",
"CATEGORY": "User Error",
"LAST_SERVICE": "2020-01-04T17:39:37Z",
"NOTICE_DATE": "2022-01-01T08:00:00Z",
"FIXABLE": "1",
"COMPONENTS": {
"WHATNOTS": {
"WHATNOT1": "Battery Compartment",
"WHATNOT2": "Whirlygig"
}
},
"DIAGNOSIS": "Customer needs to put batteries in the battery compartment",
"STATUS": "0",
"CONTACT_TYPE": {
"CALL": "1"
}
},
{
"ID": "1004",
"PROBLEM": "Electrical",
"SEVERITY_LEVEL": "4",
"TITLE": "Flames emit from unit",
"CATEGORY": "Dangerous",
"LAST_SERVICE": "2015-06-04T21:40:12Z",
"NOTICE_DATE": "2022-01-01T08:00:00Z",
"FIXABLE": "0",
"DIAGNOSIS": "A demon seems to have possessed the unit and his expelling flames from it",
"CONSEQUENCE": "Could burn things",
"SOLUTION": "Call an exorcist",
"KNOWN_PROBLEMS": {
"PROBLEM": [
{
"TYPE": "RECALL",
"NAME": "Bad Servo",
"DESCRIPTION": "Bad servo's shipped in initial product"
},
{
"TYPE": "FAILURE",
"NAME": "Operating outside normal conditions",
"DESCRIPTION": "Device failed when customer threw into wood chipper"
}
]
},
"STATUS": "1",
"REPAIR_BULLETINS": {
"BULLETIN": [
{
"#id": "4",
"#text": "Known target of the occult"
},
{
"#id": "5",
"#text": "Not meant to be thrown into wood chippers"
}
]
},
"CONTACT_TYPE": {
"CALL": "1"
}
}
]
}
Sample CSV
ID
PROBLEM
SEVERITY_LEVEL
TITLE
CATEGORY
LAST_SERVICE
NOTICE_DATE
FIXABLE
DIAGNOSIS
STATUS
COMPONENTS.WHATNOTS.WHATNOT1
COMPONENTS.WHATNOTS.WHATNOT2
CONTACT_TYPE.CALL
CONSEQUENCE
SOLUTION
KNOWN_PROBLEMS.PROBLEM
REPAIR_BULLETINS.BULLETIN
6
Electrical
1
Battery's Missing
User Error
2020-01-04T17:39:37Z
2022-01-01T08:00:00Z
1
Customer needs to put batteries in the battery compartment
0
Battery Compartment
Whirlygig
1
1004
Electrical
4
Flames emit from unit
Dangerous
2015-06-04T21:40:12Z
2022-01-01T08:00:00Z
0
A demon seems to have possessed the unit and his expelling flames from it
1
1
Could burn things
Call an exorcist
[{'TYPE': 'RECALL', 'NAME': 'Bad Servo', 'DESCRIPTION': "Bad servo's shipped in initial product"}, {'TYPE': 'FAILURE', 'NAME': 'Operating outside normal conditions', 'DESCRIPTION': 'Device failed when customer threw into wood chipper'}]
[{'#id': '4', '#text': 'Known target of the occult'}, {'#id': '5', '#text': 'Not meant to be thrown into wood chippers'}]
I have attempted to extract the data and turned it into nested dictionary (instead of nested with list), so that pd.json_normalize() can work
for row in range(len(json_dict['WIDGET'])):
try:
lis = json_dict['WIDGET'][row]['KNOWN_PROBLEMS']['PROBLEM']
del json_dict['WIDGET'][row]['KNOWN_PROBLEMS']['PROBLEM']
for i, item in enumerate(lis):
json_dict['WIDGET'][row]['KNOWN_PROBLEMS'][str(i)] = item
lis = json_dict['WIDGET'][row]['REPAIR_BULLETINS']['BULLETIN']
del json_dict['WIDGET'][row]['REPAIR_BULLETINS']['BULLETIN']
for i, item in enumerate(lis):
json_dict['WIDGET'][row]['REPAIR_BULLETINS'][str(i)] = item
except KeyError:
continue
df = pd.json_normalize(json_dict['WIDGET']).T
print(df)
If you have to manually add the varying keys from the larger dataset, here's a way to extract them automatically by identifying them as type list (and provided they are nested by 2 levels only)
linkage = []
for item in json_dict['WIDGET']:
for k1 in item.keys(): #get keys from first level
if isinstance(item[k1], str):
continue
#print(item[k1])
for k2 in item[k1].keys(): #get keys from second level
if isinstance(item[k1][k2], str):
continue
#print(item[k1][k2])
if isinstance(item[k1][k2], list):
linkage.append((k1, k2))
print(linkage)
# [('KNOWN_PROBLEMS', 'PROBLEM'), ('REPAIR_BULLETINS', 'BULLETIN')]
for row in range(len(json_dict['WIDGET'])):
for link in linkage:
try:
lis = json_dict['WIDGET'][row][link[0]][link[1]]
del json_dict['WIDGET'][row][link[0]][link[1]] #delete original dict value (which is a list)
for i, item in enumerate(lis):
json_dict['WIDGET'][row][link[0]][str(i)] = item #replace list with dict value (which is a dict)
except KeyError:
continue
df = pd.json_normalize(json_dict['WIDGET']).T
print(df)
Output:
0 1
ID 6 1004
PROBLEM Electrical Electrical
SEVERITY_LEVEL 1 4
TITLE Battery's Missing Flames emit from unit
CATEGORY User Error Dangerous
LAST_SERVICE 2020-01-04T17:39:37Z 2015-06-04T21:40:12Z
NOTICE_DATE 2022-01-01T08:00:00Z 2022-01-01T08:00:00Z
FIXABLE 1 0
DIAGNOSIS Customer needs to put batt... A demon seems to have poss...
STATUS 0 1
COMPONENTS.WHATNOTS.WHATNOT1 Battery Compartment NaN
COMPONENTS.WHATNOTS.WHATNOT2 Whirlygig NaN
CONTACT_TYPE.CALL 1 1
CONSEQUENCE NaN Could burn things
SOLUTION NaN Call an exorcist
KNOWN_PROBLEMS.0.TYPE NaN RECALL
KNOWN_PROBLEMS.0.NAME NaN Bad Servo
KNOWN_PROBLEMS.0.DESCRIPTION NaN Bad servo's shipped in ini...
KNOWN_PROBLEMS.1.TYPE NaN FAILURE
KNOWN_PROBLEMS.1.NAME NaN Operating outside normal c...
KNOWN_PROBLEMS.1.DESCRIPTION NaN Device failed when custome...
REPAIR_BULLETINS.0.#id NaN 4
REPAIR_BULLETINS.0.#text NaN Known target of the occult
REPAIR_BULLETINS.1.#id NaN 5
REPAIR_BULLETINS.1.#text NaN Not meant to be thrown int...

How to remove delimeted pipe from my json column and split them to different columns and their respective values

"description": ID|100|\nName|Sam|\nCity|New York City|\nState|New York|\nContact|1234567890|\nEmail|1234#yahoo.com|
This is how my code in json looks like. I wanted to convert this json file to excel sheet to split the nested column to separate columns and have used pandas for it, but couldn't achieve it. The output I want in my excel sheet is:
ID Name City State Contact Email
100 Sam New York City New York 1234567890 1234#yahoo.com
I want to remove those pipes and the solution should be in pandas. Please help me out with this.
The code I am trying:
I want output as:
The output on excel sheet:
[2]: https://i.stack.imgur.com/QjSUU.png
The list of dict column looks like:
"assignees": [{
"id": 1234,
"username": "xyz",
"name": "XYZ",
"state": "active",
"avatar_url": "aaaaaaaaaaaaaaa",
"web_url": "bbbbbbbbbbb"
},
{
"id": 5678,
"username": "abcd",
"name": "ABCD",
"state": "active",
"avatar_url": "hhhhhhhhhhh",
"web_url": "mmmmmmmmm"
}
],
This could be one way:
import pandas as pd
df = pd.read_json('Sample.json')
df2 = pd.DataFrame()
for i in df.index:
desc = df['description'][i]
attributes = desc.split("\n")
d = {}
for attrib in attributes:
if not(attrib.startswith('Name') or attrib.startswith('-----')):
kv = attrib.split("|")
d[kv[0]] = kv[1]
df2 = df2.append(d, ignore_index=True)
print(df2)
df2.to_csv("output.csv")
Output xls:

Updating nested documents in mongodb using pymongo

I have a mongodb collection that looks something like the following:
_id: ObjectId("123456789")
continent_name: "Europe"
continent_id: "001"
countries: Array
0
country_name: "France"
country_id: "011"
cities: Array
0
city_name: "Paris"
city_id: "101"
1
city_name: "Maseille"
city_id: "102"
1
country_name: "England"
country_id: "012"
cities: Array
0
city_name: "London"
city_id: "201"
1
city_name: "Bath"
city_id: "202"
And so on for other continents>countries>cities.
I'm unclear on what approach to take when updating this collection.
Let's say I run my data collection again and discover a new city in England, resulting in an array for [London, Bath, Manchester], how do I update the value of Europe>England>[] pythonically without touching France?
Is it possible to search where(continent=Europe && country=England)?
My current working theory is to do something like the following:
def mongo_add_document(continent, country, cities):
data= {
"continent_name": continent["name"],
"continent_id": continent["id"],
"countries": [
{
"country_name": country["name"],
"country_id": country["id"],
"cities": [
{"city_id": city["id"], "city_id": city["id"]} for city in cities
]
}
]
}
cities.find_one_and_update(
{"continent_id": continent["id"]},
data,
upsert=True
)
But my concern is this will overwrite other countries in the continent document.
db.getCollection('cities')
.updateOne({"continent_name": "Europe", "countries.country_name": "England", "countries.cities.city_name" : {$ne: "Manchester"}},
{$push: {"countries.$.cities": {"city_name": "Manchester", "city_id":"whatever"}}})

Creating pandas dataframe from accessing specific keys of nested dictionary

How can below dictionary converted to expected dataframe like below?
{
"getArticleAttributesResponse": {
"attributes": [{
"articleId": {
"id": "2345",
"locale": "en_US"
},
"keyValuePairs": [{
"key": "tags",
"value": "[{\"displayName\": \"Nice\", \"englishName\": \"Pradeep\", \"refKey\": \"Key2\"}, {\"displayName\": \"Family Sharing\", \"englishName\": \"Sarvendra\", \"refKey\": \"Key1\", \"meta\": {\"customerDisplayable\": [false]}}}]"
}]
}]
}
}
Expected dataframe:
id displayName englistname refKey
2345 Nice Pradeep Key2
2345 Family Sharing Sarvendra Key1
df1 = pd.DataFrame(d['getDDResponse']['attributes']).explode('keyValuePairs')
df2 = pd.concat([df1[col].apply(pd.Series) for col in df1],1).assign(value = lambda x :x.value.apply(eval)).explode('value')
df = pd.concat([df2[col].apply(pd.Series) for col in df2],1)
OUTPUT:
0 0 display englishName reference source
0 1234 tags Unarchived Unarchived friend monster

Write json format using pandas Series and DataFrame

I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks
Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)

Categories

Resources