How can I make this kind of aggregation in pandas? - python

I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??

The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0

Related

how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

Fill a dataframe with Carthesian product of variably shaped input lists

I want to create a script that fills a dataframe with values that are the Carthesian product of parameters I want to vary in a series of experiments.
My first thought was to use the product function of itertools, however it seems to require a fixed set of input lists.
The output I'm looking for can be generated using this sample:
cols = ['temperature','pressure','power']
l1 = [1, 100, 50.0 ]
l2 = [1000, 10, np.nan]
l3 = [0, 100, np.nan]
data = []
for val in itertools.product(l1,l2,l3): #use itertools to get the Carthesian product of the lists
data.append(val) #make a list of lists to store each variation
df = pd.DataFrame(data, columns=cols).dropna(0) #make a dataframe from the list of lists (dropping NaN values)
However, I would like instead to extract the parameters from dataframes of arbitrary shape and then fill up a dataframe with the product, like so (code doesn't work):
data = [{'parameter':'temperature','value1':1,'value2':100,'value3':50},
{'parameter':'pressure','value1':1000,'value2':10},
{'parameter':'power','value1':0,'value2':100},
]
df = pd.DataFrame(data)
l = []
cols = []
for i in range(df.shape[0]):
l.append(df.iloc[i][1:].to_list()) #store the values of each df row to a separate list
cols.append(df.iloc[i][0]) #store the first value of the row as column header
data = []
for val in itertools.product(l): #ask itertools to parse a list of lists
data.append(val)
df2 = pd.DataFrame(data, columns=cols).dropna(0)
Can you recommend a way about this? My goal is creating the final dataframe, so it's not a requirement to use itertools.
Another alternative without product (nothing wrong with product, though) could be to use .join() with how="cross" to produce successive cross-products:
df2 = df.T.rename(columns=df.iloc[:, 0]).drop(df.columns[0])
df2 = (
df2.iloc[:, [0]]
.join(df2.iloc[:, [1]], how="cross")
.join(df2.iloc[:, [2]], how="cross")
.dropna(axis=0)
)
Result:
temperature pressure power
0 1 1000 0
1 1 1000 100
3 1 10 0
4 1 10 100
9 100 1000 0
10 100 1000 100
12 100 10 0
13 100 10 100
18 50.0 1000 0
19 50.0 1000 100
21 50.0 10 0
22 50.0 10 100
A compacter version with product:
from itertools import product
df2 = pd.DataFrame(
product(*df.set_index("parameter", drop=True).itertuples(index=False)),
columns=df["parameter"]
).dropna(axis=0)

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Pandas DataFrame groupby apply and re-expand along grouped axis

Say I have a dataframe
A B C D
2019-01-01 1 10 100 12
2019-01-02 2 20 200 23
2019-01-03 3 30 300 34
And an array to group the columns by
array([0, 1, 0, 2])
I wish to group the dataframe by the array (on the column axis), apply a function, then return a Series with length of the number of columns, containing the result of the applied function on each column.
So, for the above (with the applied function taking the group's sum), would want to output:
A 606
B 60
C 606
D 69
dtype: int64
My best attempt:
func = lambda a: np.full(a.shape[1], np.sum(a.values))
df.groupby(groups, axis=1).apply(func)
0 [606, 606]
1 [60]
2 [69]
dtype: object
(in this example the applied function returns equal values inside a group, but this can't be guaranteed for the real case)
I can not see how to do this with pandas grouping syntax, unless I am missing something. Could anyone lend a hand, thanks!
Try this:
import numpy as np
import pandas as pd
groups = [0, 1, 0, 2]
df = pd.DataFrame({'A': [1, 2, 3],
'B': [10, 20, 30],
'C': [100, 200, 300],
'D': [12, 23, 34]})
temp = df.apply(sum).to_frame()
temp.index = pd.MultiIndex.from_arrays(
np.stack([temp.index, groups]),
names=("df columns", "groups")
)
temp_filter = temp.groupby(level=1).agg(sum)
result = temp.join(temp_filter, rsuffix='0'). \
set_index(temp.index.get_level_values(0))["00"]
# df columns
# A 606
# B 60
# C 606
# D 69
# Name: 00, dtype: int64

Array in DataFrame Panda Python

I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132
You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df

Categories

Resources