I have data as
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
I need to represent it as
Cluster Number Subset Name
0 ['X_1', 'X_A', 'X_B'] A, C
1 ['D_1', 'D_2', 'D_3', 'D_4'] D
2 ['B_1', 'B_A'] B
For the sake of completeness, I think it is fair to mention that you can actually create a dataframe without json_normalize in your case and apply groupby as originally shown here:
import pandas as pd
data = [{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
df = pd.DataFrame(data).groupby('cluster')
.agg({'subsets':'first','name':', '.join})
.reset_index()
.set_index('cluster')
.rename_axis('Cluster Number')
subsets name
Cluster Number
0 [X_1, X_A, X_B] A, C
1 [D_1, D_2, D_3, D_4] D
2 [B_1, B_A] B
You can use json_normalize + groupby "cluster" and apply join to "name" and first to "subsets":
df = pd.json_normalize(data).groupby('cluster').agg({'subsets':'first','name':', '.join}).reset_index()
Output:
cluster subsets name
0 0 [X_1, X_A, X_B] A, C
1 1 [D_1, D_2, D_3, D_4] D
2 2 [B_1, B_A] B
Related
It's hard to explain what I'm trying to do so I'll give an example. In the example below, I am trying to get df3. I have done it with the code below but it is very "anti-pandas" and I am looking for a better (faster, cleaner, more pandas-esque) way to do it:
import pandas as pd
df1 = pd.DataFrame({"begin": [{"a", "b"}, {"b"}, {"c"}], "end": [{"x"}, {"z", "y"}, {"z"}]})
df2 = pd.DataFrame(
{"a": [10, 10, 15], "b": [15, 20, 30], "c": [8, 12, 10], "x": [1, 2, 3], "y": [1, 3, 4], "z": [1, 3, 1]}
)
df3 = df1.copy()
for i in range(len(df1)):
for j in range(len(df1.loc[i])):
df3.at[i, df1.columns[j]] = []
for v in df1.loc[i][j]:
df3.at[i, df1.columns[j]].append({"letter": v, "value": df2.loc[i][v]})
print(df3)
Here's my goal (which this code does, just probably not in the best way):
begin end
0 [{'letter': 'b', 'value': 15}, {'letter': 'a', 'value': 10} [{'letter': 'x', 'value': 1}]
1 [{'letter': 'b', 'value': 20}] [{'letter': 'y', 'value': 3}, {'letter': 'z', 'value': 3}
2 [{'letter': 'c', 'value': 10}] [{'letter': 'z', 'value': 1}]
Here is one way to approach the problem using pandas
# Reshape and explode the dataframe
s = df1.stack().explode().reset_index(name='letter')
# Map the values corresponding to the letters
s['value'] = s.set_index(['level_0', 'letter']).index.map(df2.stack())
# Assign list of records
s['records'] = s[['letter', 'value']].to_dict('records')
# Pivot with aggfunc as list
s = s.pivot_table('records', 'level_0', 'level_1', aggfunc=list)
print(s)
level_1 begin end
level_0
0 [{'letter': 'a', 'value': 10}, {'letter': 'b', 'value': 15}] [{'letter': 'x', 'value': 1}]
1 [{'letter': 'b', 'value': 20}] [{'letter': 'z', 'value': 3}, {'letter': 'y', 'value': 3}]
2 [{'letter': 'c', 'value': 10}] [{'letter': 'z', 'value': 1}]
I have a table like this:
Group
Item
A
a, b, c
B
b, c, d
And I want to convert to like this:
Item
Group
a
A
b
A, B
c
A, B
d
B
What is the best way to achieve this?
Thank you!!
If you are working in pandas, you can use 'explode' to unpack items, and can use 'to_list' lambda for the grouping stage.
Here is some info on 'explode' method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html.
import pandas as pd
df = pd.DataFrame(data={'Group': ['A', 'B'], 'Item': [['a','b','c'], ['b','c','d']]})
Exploding
df.explode('Item').reset_index(drop=True).to_dict(orient='records')
[{'Group': 'A', 'Item': 'a'},
{'Group': 'A', 'Item': 'b'},
{'Group': 'A', 'Item': 'c'},
{'Group': 'B', 'Item': 'b'},
{'Group': 'B', 'Item': 'c'},
{'Group': 'B', 'Item': 'd'}]
Exploding and then using 'to_list' lambda
df.explode('Item').groupby('Item')['Group'].apply(lambda x: x.tolist()).reset_index().to_dict(orient='records')
[{'Item': 'a', 'Group': ['A']},
{'Item': 'b', 'Group': ['A', 'B']},
{'Item': 'c', 'Group': ['A', 'B']},
{'Item': 'd', 'Group': ['B']}]
Not the most efficient, but very short:
>>> table = {'A': ['a', 'b', 'c'], 'B': ['b', 'c', 'd']}
>>> reversed_table = {v: [k for k, vs in table.items() if v in vs] for v in set(v for vs in table.values() for v in vs)}
>>> print(reversed_table)
{'b': ['A', 'B'], 'c': ['A', 'B'], 'd': ['B'], 'a': ['A']}
With dictionaries, you wouldtypically approach it like this:
table = {'A': ['a', 'b', 'c'], 'B': ['b', 'c', 'd']}
revtable = dict()
for v,keys in table.items():
for k in keys:
revtable.setdefault(k,[]).append(v)
print(revtable)
# {'a': ['A'], 'b': ['A', 'B'], 'c': ['A', 'B'], 'd': ['B']}
Assuming that your tables are in the form of a pandas dataframe, you could try something like this:
import pandas as pd
import numpy as np
# Create initial dataframe
data = {'Group': ['A', 'B'], 'Item': [['a','b','c'], ['b','c','d']]}
df = pd.DataFrame(data=data)
Group Item
0 A [a, b, c]
1 B [b, c, d]
# Expand number of rows based on list column ("Item") contents
list_col = 'Item'
df = pd.DataFrame({
col:np.repeat(df[col].values, df[list_col].str.len())
for col in df.columns.drop(list_col)}
).assign(**{list_col:np.concatenate(df[list_col].values)})[df.columns]
Group Item
0 A a
1 A b
2 A c
3 B b
4 B c
5 B d
*Above snippet taken from here, which includes a more detailed explanation of the code
# Perform groupby operation
df = df.groupby('Item')['Group'].apply(list).reset_index(name='Group')
Item Group
0 a [A]
1 b [A, B]
2 c [A, B]
3 d [B]
I have a list of dicts like this:
[{'ID': 'a', 'Number': 2}, {'ID': 'b', 'Number': 5} , {'ID': 'a', 'Number': 6}, {'ID': 'a', 'Number': 8}, {'ID': 'c', 'Number': 3}]
I want to remove the dicts that have same key and only keep the one with smallest value. The expected result should be:
[{'ID': 'a', 'Number': 2}, {'Id': 'b', 'Number': 5}, {'ID': 'c', 'Number': 3}]
Most efficient solution would be to use a temporary lookup dictionary with keys as IDs and values as the current dict which has the lowest Number corresponding to that ID.
l = [{'ID': 'a', 'Number': 2},
{'ID': 'b', 'Number': 5}, # note that I corrected a typo Id --> ID
{'ID': 'a', 'Number': 6},
{'ID': 'a', 'Number': 8},
{'ID': 'c', 'Number': 3}]
lookup_dict = {}
for d in l:
if d['ID'] not in lookup_dict or d['Number'] < lookup_dict[d['ID']]['Number']:
lookup_dict[d['ID']] = d
output = list(lookup_dict.values())
which gives output as:
[{'ID': 'a', 'Number': 2}, {'ID': 'b', 'Number': 5}, {'ID': 'c', 'Number': 3}]
A piece of advice: given your final data structure, I wonder if you may be better off now representing this final data as a dictionary - with the IDs as keys since these are now unique. This would allow for more convenient data access.
My dataframe df is like:
col_1 col_2 col_3
A Product 1
B product 2
C Offer 1
D Product 1
What i want is to convert all this column to json with the condition that row of col_2 and col_1 should be key value pair. I have tried the following:
df['col_1_2'] = df.apply(lambda row: {row['col_2']:row['col_1']}, axis=1)
df['final_col']=df[['col_1_2','col_3']].to_dict('r')
My first row of df['final_col'] is:
{'col_1_2': {'product': A}, 'value': 1.0},
But what i want is :
{'product': A, 'value': 1.0}
Add missing key with value by col_3:
df['final_col'] = df.apply(lambda row: {row['col_2']:row['col_1'], 'value':row['col_3']},
axis=1)
print (df)
col_1 col_2 col_3 final_col
0 A Product 1 {'Product': 'A', 'value': 1}
1 B product 2 {'product': 'B', 'value': 2}
2 C Offer 1 {'Offer': 'C', 'value': 1}
3 D Product 1 {'Product': 'D', 'value': 1}
If need output in list:
L = [{b:a, 'value':c} for a,b,c in zip(df['col_1'], df['col_2'], df['col_3'])]
print (L)
[{'Product': 'A', 'value': 1},
{'product': 'B', 'value': 2},
{'Offer': 'C', 'value': 1},
{'Product': 'D', 'value': 1}]
Or json:
import json
j = json.dumps([{b:a, 'value':c} for a,b,c in zip(df['col_1'], df['col_2'], df['col_3'])])
print (j)
[{"Product": "A", "value": 1},
{"product": "B", "value": 2},
{"Offer": "C", "value": 1},
{"Product": "D", "value": 1}]
I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]