Pandas : create a list based on index of a Dataframe - python

I am trying to obtain a list from a Dataframe based on a common value of the index.
In the example below I am trying to obtain the lists for 'type' and 'xx' based on 'date'.
Here is the Dataframe:
import pandas as pd
import numpy as np
idx = [np.array(['Jan', 'Jan', 'Feb', 'Mar', 'Mar', 'Mar']),np.array(['A1', 'A2', 'A2', 'A1', 'A3', 'A4'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df.reset_index(inplace=True)
df=df.set_index(['date'])
Which looks like this:
type xx
date
Jan A1 1
Jan A2 5
Feb A2 3
Mar A1 2
Mar A3 7
Mar A4 3
What I am trying to do is to create these two lists:
#list_type
[['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']]
#list_xx
[['1', '5'], ['3'], ['2', '7', '3']]
As you can see, the elements of the lists are constructed based on a common date.
I would really value an efficient way of doing this in Python.

Use GroupBy.agg with list and then convert DataFrame to dictionary of lists by DataFrame.to_dict:
d = df.groupby(level=0, sort=False).agg(list).to_dict('l')
print (d)
{'type': [['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']], 'xx': [[1, 5], [3], [2, 7, 3]]}
print (d['type'])
[['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']]
print (d['xx'])
[[1, 5], [3], [2, 7, 3]]

Related

Some weird transformation to pandas dataframe

My dataframe:
df = pd.DataFrame({'a':['A', 'B'], 'b':[{5:1, 11:2}, {5:3}]})
Expected output (Each Key will be transformed to 'n' keys. Example row 1, key =5 (with value =2) get transformed to 5, 6. This change also need to reflect on 'a' column)
df_expected = pd.DataFrame({'a':['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'key':[5, 6, 11, 12, 5, 6, 7]})
My present state:
df['key']=df.apply(lambda x: x['b'].keys(), axis=1)
df['value']=df.apply(lambda x: max(x['b'].values()), axis=1)
df = df.loc[df.index.repeat(df.value)]
Stuck here. What should be next step?
Expected output:
df_expected = pd.DataFrame({'a':['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'key':[5, 6, 11, 12, 5, 6, 7]})
This will do your transform, outside of pandas.
d = {'a':['A', 'B'], 'b':[{5:1, 11:2}, {5:3}]}
out = { 'a':[], 'b':[] }
for a,b in zip(d['a'],d['b']):
n = max(b.values())
for k in b:
for i in range(n):
out['a'].append(f'{a}{i+1}')
out['b'].append(k+i)
print(out)
Output:
{'a': ['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'b': [5, 6, 11, 12, 5, 6, 7]}
First you need to preprocess your input dictionary like this
import pandas as pd
d = {'a':['A', 'B'], 'b':[{5:2, 11:2}, {5:3}]} # Assuming 5:2 instead of 5:1.
res = {"a": [], "keys": []}
for idx, i in enumerate(d['b']):
res['a'].extend([f"{d['a'][idx]}{k}" for j in i for k in range(1,i[j]+1) ])
res['keys'].extend([k for j in i for k in range(j, j+i[j])])
df = pd.DataFrame(res)
output
{'a': ['A1', 'A2', 'A1', 'A2', 'B1', 'B2', 'B3'], 'keys': [5, 6, 11, 12, 5, 6, 7]}
For a pandas solution:
df2 = (df.drop(columns='b')
.join(pd.json_normalize(df['b'])
.rename_axis(columns='key')
.stack().reset_index(-1, name='repeat')
)
.loc[lambda d: d.index.repeat(d.pop('repeat'))]
)
g = df2.groupby(['a', 'key']).cumcount()
df2['a'] += g.add(1).astype(str)
df2['key'] += g
print(df2)
Output:
a key
0 A1 5
0 A1 11
0 A2 6
0 A2 12
0 A3 7
0 A3 13
1 B1 5
1 B2 6
1 B3 7

how to append 'new columns' at pivot table..? (pandas)

import numpy as np
import math
import pandas as pd
# making an example DataFrame
data = DataFrame({'cust_id': ['c1', 'c1', 'c1', 'c2', 'c2', 'c2', 'c3', 'c3', 'c3',
'c1', 'c1', 'c1', 'c2', 'c2', 'c2', 'c3', 'c3', 'c3'],
'step_seq': ['123', '123', '123', '123', '123', '123', '123', '123', '123',
'456','456','456','456','456','456','456','456','456'],
'grade' : ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B',
'C','C','C','C','C','C','C','C','D'],
'pch_amt': [1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 5, 6, 7, 8, 9]})
print(data)
data = pd.pivot_table(data, index='step_seq', columns='pch_amt', values='grade', aggfunc=np.sum)
a = data.iloc[0,:].tolist()
b = set(a)
len(b)
for i in range(len(data.index)):
a = data.iloc[i,:].tolist()
print(a)
b = set(a)
# Qestion1 Related
print(b)
print(len(b))
data.loc[i,'Number of types']=len(b)
data
# Qestion2 Related
Before asking questions, thank you for your help all the time.
I ask two question as above
Q1) Why second set get 'nan' ??.. and how can I remove it..?
Q2) How to make to append 'Number of types' in Coumuns(pivot) ?

Can I make 4 new columns aggregating 4 previous ones?

I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!
df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!
You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7

Remove element from every list in a column in pandas dataframe based on another column

I'd like to remove values in list from column B based on column A, wondering how.
Given:
df = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []]
})
I want:
result = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []],
'Output': [['a2'], ['a1', 'a3'], ['a1'], []]
})
One way of doing that is applying a filtering function to each row via DataFrame.apply:
df['Output'] = df.apply(lambda x: [i for i in x.B if i != x.A], axis=1)
Another solution using iterrows():
for i,value in df.iterrows():
try:
value['B'].remove(value['A'])
except ValueError:
pass
print(df)
Output:
A B
0 a1 [a2]
1 a2 [a1, a3]
2 a3 [a1]
3 a4 []

Compare nested list values within columns of a dataframe

How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.
The dataframe looks something like this:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']
'E': ['b2', ['b1','b2'],'']})
Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.
df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]
print(df)
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b2]
1 a2 [b1, b2, b3] [b3] d2 [b1, b2]
2 a3 [b2] [b2, b1] d3 []
def Desintersection(i):
Output = [b for b in df['B'][i] if b not in df['C'][i]]
if(len(Output) == 0):
return ''
elif(len(Output) == 1):
return Output[0]
else:
return Output
df['E'] = df.index.map(Desintersection)
df
Like what I do for my previous answer
(df.B.map(set)-df.C.map(set)).map(list)
Out[112]:
0 [b2]
1 [b2, b1]
2 []
dtype: object
I agree with #jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.
This would work on E, as it converts single str values to [str] before comparison.
import pandas as pd
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
def difference(df, A, B):
elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
diff = [d if d else "" for d in diff] # replace empty lists with empty strings
return [d if len(d) != 1 else d[0] for d in diff] # return with single values extracted from the list
df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))
['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']

Categories

Resources