I have a dataframe df with column 'ColumnA'. How do i count the keys in this column using python.
df = pd.DataFrame({
'ColA': [{
"a": 10,
"b": 5,
"c": [1, 2, 3],
"d": 20
}, {
"f": 1,
"b": 3,
"c": [0],
"x": 71
}, {
"a": 1,
"m": 99,
"w": [8, 6],
"x": 88
}, {
"a": 9,
"m": 99,
"c": [3],
"x": 55
}]
})
Here i want to calculate count for each key like this. Then visualise the frequency using a chart
Expected Answers :
a=3,
b=2,
c=3,
d=1,
f=1,
x=3,
m=2,
w=1
try this, Series.explode transform's list-like to a row, Series.value_counts to get counts of unique values, Series.plot to create plot out of the series generated.
df.ColA.apply(lambda x : list(x.keys())).explode().value_counts()
a 3
c 3
x 3
b 2
m 2
f 1
d 1
w 1
Name: ColA, dtype: int64
Related
I have the following data frame:
data = [
{"id": 1, "parent_id": -1, "level": 1, "name": "Company"},
{"id": 2, "parent_id": 1, "level": 2, "name": "Bakery"},
{"id": 3, "parent_id": 1, "level": 2, "name": "Frozen"},
{"id": 4, "parent_id": 2, "level": 3, "name": "Bread"},
{"id": 5, "parent_id": 2, "level": 3, "name": "Pastry"},
{"id": 6, "parent_id": 3, "level": 3, "name": "Ice Cream"},
{"id": 7, "parent_id": 3, "level": 3, "name": "Sorbet"},
]
df = pd.DataFrame(data)
that looks like this:
id parent_id level name
0 1 -1 1 Company
1 2 1 2 Bakery
2 3 1 2 Frozen
3 4 2 3 Bread
4 5 2 3 Pastry
5 6 3 3 Ice Cream
6 7 3 3 Sorbet
I'm trying to represent the data as a dictionay like this:
data = {
"Company": {
"Bakery": [
"Bread",
"Pastry",
],
"Frozen": [
"Ice Cream",
"Sorbet",
],
},
}
Heavily struggling with achieving this result, so any help is appreciated! I've tried various for-loops but getting muddled up!
This is what I came up with (this code assumes consistency between parent_ids and levels and that all parent_ids exist):
# to store the final result
result = {}
# to store references of dictionaries by their ids
by_id = {}
for d in sorted(data, key=lambda d: d['level']):
new_dict = {}
if d['parent_id'] == -1:
result[d['name']] = new_dict
else:
by_id[d['parent_id']][d['name']] = new_dict
by_id[d['id']] = new_dict
At this point:
>>> result
{'Company': {'Bakery': {'Bread': {}, 'Pastry': {}}, 'Frozen': {'Ice Cream': {}, 'Sorbet': {}}}}
Now to convert empty dictionaries to a list of items, we use a recursive function:
def transform_dicts_to_lists(r):
if any(r.values()):
for k, v in r.items():
r[k] = transform_dicts_to_lists(v)
return r
else:
return list(r.keys())
result = transform_dicts_to_lists(result)
>>> result
{'Company': {'Bakery': ['Bread', 'Pastry'], 'Frozen': ['Ice Cream', 'Sorbet']}}
You can avoid final processing if you know that the maximum level is always 3.
index print_type_solid print_type_floral cluster
A 10 10 2
B 20 20 2
A 10 10 3
B 20 20 3
C 25 30 3
Can someone help me convert the above dataframe into the following nested dictionary where the cluster becomes the main key and and the print_type_x as key and then the values as shown in the expected output below ?
{
"2" :{
"print_type_solid" : {
"A": 10,
"B": 20
},
"print_type_floral" : {
"A": 10,
"B": 20
}
},
"3" :{
"print_type_solid" : {
"A": 10,
"B": 20,
"C": 25,
},
"print_type_floral" : {
"A": 10,
"B": 20,
"C": 30,
}
}
}
I tried this :
from collections import defaultdict
d = defaultdict()
d2={}
for k1, s in dct.items():
for k2, v in s.items():
for k3, r in v.items():
d.setdefault(k3, {})[k2] = r
d2[k1]=d
But I'm getting this :
{
"2" :{
"print_type_solid" : {
"A": 10,
"B": 20,
"C": 25
},
"print_type_floral" : {
"A": 10,
"B": 20,
"C": 30
}
},
"3" :{
"print_type_solid" : {
"A": 10,
"B": 20,
"C": 25,
},
"print_type_floral" : {
"A": 10,
"B": 20,
"C": 30,
}
}
}
And this is wrong because I'm getting C also in the dictionary for cluster 2.
You can use df.iterrows() to iterate your dataframe row-wise. To create the dictionary you can use this:
import pandas as pd
df = pd.DataFrame( {"index":list("ABABC"),
"print_type_solid":[10,20,10,20,25],
"print_type_floral":[10,20,10,20,30],
"cluster":[2,2,3,3,3] })
print(df)
d = {}
pts = "print_type_solid"
ptf = "print_type_floral"
for idx, row in df.iterrows():
key = d.setdefault(row["cluster"],{})
key_pts = key.setdefault(pts,{})
key_pts[row["index"]] = row[pts]
key_ptf = key.setdefault(ptf,{})
key_ptf[row["index"]] = row[ptf]
from pprint import pprint
pprint(d)
Output:
# df
index print_type_solid print_type_floral cluster
0 A 10 10 2
1 B 20 20 2
2 A 10 10 3
3 B 20 20 3
4 C 25 30 3
# dict
{2: {'print_type_floral': {'A': 10, 'B': 20},
'print_type_solid': {'A': 10, 'B': 20}},
3: {'print_type_floral': {'A': 10, 'B': 20, 'C': 30},
'print_type_solid': {'A': 10, 'B': 20, 'C': 25}}}
You could also use collections.defaultdict - but for that few datapoints this is not needed.
I have json like this:
json = {
"b": 22,
"x": 12,
"a": 2,
"c": 4
}
When i generate an Excel file from this json like this:
import pandas as pd
df = pd.read_json(json_text)
file_name = 'test.xls'
file_path = "/tmp/" + file_name
df.to_excel(file_path, index=False)
print("path to excel " + file_path)
Pandas does its own ordering in the Excel file like this:
pandas_json = {
"a": 2,
"b": 22,
"c": 4,
"x": 12
}
I don't want this. I need the ordering which exists in the json. Please give me some advice how to do this.
UPDATE:
if i have json like this:
json = [
{"b": 22, "x":12, "a": 2, "c": 4},
{"b": 22, "x":12, "a": 2, "c": 2},
{"b": 22, "x":12, "a": 4, "c": 4},
]
pandas will generate its own ordering like this:
panas_json = [
{"a": 2, "b":22, "c": 4, "x": 12},
{"a": 2, "b":22, "c": 2, "x": 12},
{"a": 4, "b":22, "c": 4, "x": 12},
]
How can I make pandas preserve my own ordering?
You can read the json as OrderedDict which will help to retain original order:
import json
from collections import OrderedDict
json_ = """{
"b": 22,
"x": 12,
"a": 2,
"c": 4
}"""
data = json.loads(json_, object_pairs_hook=OrderedDict)
pd.DataFrame.from_dict(data,orient='index')
0
b 22
x 12
a 2
c 4
Edit, updated json also works:
j="""[{"b": 22, "x":12, "a": 2, "c": 4},
{"b": 22, "x":12, "a": 2, "c": 2},{"b": 22, "x":12, "a": 4, "c": 4}]"""
data = json.loads(j, object_pairs_hook=OrderedDict)
pd.DataFrame.from_dict(data).to_json(orient='records')
'[{"b":22,"x":12,"a":2,"c":4},{"b":22,"x":12,"a":2,"c":2},
{"b":22,"x":12,"a":4,"c":4}]'
I have a json file that I'm trying to read into Pandas. The file looks like this:
{"0": {"a": 0, "b": "some_text", "c": "other_text"},
"1": {"a": 1, "b": "some_text1", "c": "other_text1"},
"2": {"a": 2, "b": "some_text2", "c": "other_text2"}}
When I do:
df = pd.read_json("my_file.json")
df = df.transpose()
df.head()
I see:
a b c
0 0 some_text other_text
1 1 some_text1 other_text1
10 10 some_text2 other_text2
So the dataframe's index and column a have somehow gotten mangled in the process. What am I doing incorrectly?
Thanks!
I have a CSV file in a format similar to this
order_id, customer_name, item_1_id, item_1_quantity, Item_2_id, Item_2_quantity, Item_3_id, Item_3_quantity
1, John, 4, 1, 24, 4, 16, 1
2, Paul, 8, 3, 41, 1, 33, 1
3, Andrew, 1, 1, 34, 4, 8, 2
I want to export to json, currently I am doing this.
df = pd.read_csv('simple.csv')
print ( df.to_json(orient = 'records') )
And the output is
[
{
"Item_2_id": 24,
"Item_2_quantity": 4,
"Item_3_id": 16,
"Item_3_quantity": 1,
"customer_name": "John",
"item_1_id": 4,
"item_1_quantity": 1,
"order_id": 1
},
......
However, I would like the output to be
[
{
"customer_name": "John",
"order_id": 1,
"items": [
{ "id": 4, "quantity": 1 },
{ "id": 24, "quantity": 4 },
{ "id": 16, "quantity": 1 },
]
},
......
Any suggestions on a good way to do this?
In this particular project, there will not be more than 5 times per order
Try the following:
import pandas as pd
import json
output_lst = []
##specify the first row as header
df = pd.read_csv('simple.csv', header=0)
##iterate through all the rows
for index, row in df.iterrows():
dict = {}
items_lst = []
## column_list is a list of column headers
column_list = df.columns.values
for i, col_name in enumerate(column_list):
## for the first 2 columns simply copy the value into the dictionary
if i<2:
element = row[col_name]
if isinstance(element, str):
## strip if it is a string type value
element = element.strip()
dict[col_name] = element
elif "_id" in col_name:
## i+1 is used assuming that the item_quantity comes right after the corresponding item_id for each item
item_dict = {"id":row[col_name], "quantity":row[column_list[i+1]]}
items_lst.append(item_dict)
dict["items"] = items_lst
output_lst.append(dict)
print json.dumps(output_lst)
If you run the above file with the sample.csv described in the question then you get the following output:
[
{
"order_id": 1,
"items": [
{
"id": 4,
"quantity": 1
},
{
"id": 24,
"quantity": 4
},
{
"id": 16,
"quantity": 1
}
],
" customer_name": "John"
},
{
"order_id": 2,
"items": [
{
"id": 8,
"quantity": 3
},
{
"id": 41,
"quantity": 1
},
{
"id": 33,
"quantity": 1
}
],
" customer_name": "Paul"
},
{
"order_id": 3,
"items": [
{
"id": 1,
"quantity": 1
},
{
"id": 34,
"quantity": 4
},
{
"id": 8,
"quantity": 2
}
],
" customer_name": "Andrew"
}
]
Source DF:
In [168]: df
Out[168]:
order_id customer_name item_1_id item_1_quantity Item_2_id Item_2_quantity Item_3_id Item_3_quantity
0 1 John 4 1 24 4 16 1
1 2 Paul 8 3 41 1 33 1
2 3 Andrew 1 1 34 4 8 2
Solution:
In [169]: %paste
import re
x = df[['order_id','customer_name']].copy()
x['id'] = \
pd.Series(df.loc[:, df.columns.str.contains(r'item_.*?_id',
flags=re.I)].values.tolist(),
index=df.index)
x['quantity'] = \
pd.Series(df.loc[:, df.columns.str.contains(r'item_.*?_quantity',
flags=re.I)].values.tolist(),
index=df.index)
x.to_json(orient='records')
## -- End pasted text --
Out[169]: '[{"order_id":1,"customer_name":"John","id":[4,24,16],"quantity":[1,4,1]},{"order_id":2,"customer_name":"Paul","id":[8,41,33],"qua
ntity":[3,1,1]},{"order_id":3,"customer_name":"Andrew","id":[1,34,8],"quantity":[1,4,2]}]'
Intermediate helper DF:
In [82]: x
Out[82]:
order_id customer_name id quantity
0 1 John [4, 24, 16] [1, 4, 1]
1 2 Paul [8, 41, 33] [3, 1, 1]
2 3 Andrew [1, 34, 8] [1, 4, 2]
j = df.set_index(['order_id','customer_name']) \
.groupby(lambda x: x.split('_')[-1], axis=1) \
.agg(lambda x: x.values.tolist()) \
.reset_index() \
.to_json(orient='records')
import json
Beatufied result:
In [122]: print(json.dumps(json.loads(j), indent=2))
[
{
"order_id": 1,
"customer_name": "John",
"id": [
4,
24,
16
],
"quantity": [
1,
4,
1
]
},
{
"order_id": 2,
"customer_name": "Paul",
"id": [
8,
41,
33
],
"quantity": [
3,
1,
1
]
},
{
"order_id": 3,
"customer_name": "Andrew",
"id": [
1,
34,
8
],
"quantity": [
1,
4,
2
]
}
]