Efficient data manipulation with pandas based on 2 dataframes - python

Here's my code with 2 dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, 2, 3, 5, 2], [2, 2, 3, 5, 2], [3, 2, 3, 5, 2], [10, 2, 3, 5, 2]]),
columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
df2 = pd.DataFrame(np.array([[1,1,1, 2,2,2, 3,3,3, 10,10,10], [2,3,5, 2,3,5, 2,3,5, 2,3,5], [20,40,60, 80,100,200, 220,240,260, 500,505,520]]).T,
columns=['ID', 'Item_id', 'value_to_assign'])
Based on df2 I want to modify df1
Expected output:
df_expected_output = pd.DataFrame(np.array([[1, 20, 40, 60, 20], [2, 80, 100, 200, 80], [3, 220, 240, 260, 220], [10, 500, 505, 520, 500]]),
columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
I have done it with iterating over columns and some operations. Hovever in my example i got more columns and rows in dataframes, so its pretty slow. Someone know how to do it in fast efficient way? Thanks

Here is one solution. pivot df2 to have a format similar to df1 and then replace column by column by matching on the number after the last '_'.
df2_pivot = df2.pivot(index='ID', columns='Item_id', values='value_to_assign').rename_axis(None, axis=1)
df3 = df1.set_index('ID')
for c in df3:
df3[c] = df2_pivot[int(c.rsplit('_', 1)[-1])]
Or, using a dictionary comprehension for the second part:
df3 = pd.DataFrame({c: df2_pivot[int(c.rsplit('_', 1)[-1])]
for c in df1.columns[1:]},
index=df1['ID']).reset_index()
output:
>>> df3.reset_index()
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500

Another way would be:
Stack the original df which is to be replaced.
grab the index and split the second index to get values after _
using pd.Index.map, map the values of these index from df2
Create a dataframe keeping this mapped value as value and the stacked multiindex as index and then unstack them.
s = df1.set_index("ID").stack()
i = s.index.map(lambda x: (x[0],x[1].split("_")[1]))
v = i.map(df2.set_index(["ID",df2['Item_id'].map(str)])['value_to_assign'])
out = pd.DataFrame({"value":v},index=s.index)['value'].unstack().reset_index()
print(out)
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500

DataFrame.replace
We can use pivot to reshape the dataframe df2 so that we can easily use replace method to substitute the values in df1
df1.set_index('ID').T.replace(df2.pivot('Item_id', 'ID', 'value_to_assign')).T
itemX_2 itemK_3 itemC_5 itemH_2
ID
1 20 40 60 20
2 80 100 200 80
3 220 240 260 220
10 500 505 520 500

You can iterate over the columns of df1 and perform a pd.merge :
for col in df1.columns:
if col == 'ID': continue
df_temp = pd.merge(df1.loc[:, ['ID', col]], df2, on = 'ID')
df1[col] = df_temp[df_temp[col] == df_temp['Item_id']]['value_to_assign'].reset_index(drop=True)
output:
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500

Related

Create calculated field within dataset using Python

I have a dataset, df, where I would like to create columns that display the output of a subtraction calculation:
Data
count power id p_q122 p_q222 c_q122 c_q222
100 1000 aa 200 300 10 20
100 2000 bb 400 500 5 10
Desired
cnt pwr id p_q122 avail1 p_q222 avail2 c_q122 count1 c_q222 count2
100 1000 aa 200 800 300 700 10 90 20 80
100 2000 bb 400 1600 500 1500 5 95 10 90
Doing
a = df['avail1'] = + df['pwr'] - df['p_q122']
b = df['avail2'] = + df['pwr'] - df['p_q222']
I am looking for a more elegant way that provides the desire output. Any suggestion is appreciated.
We can perform 2D subtraction with numpy:
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}')
avail1 avail2
0 800 700
1 1600 1500
Benefit here is that, no matter how many p_ columns there are, all will be subtracted from the power column.
We can concat all of the computations with df like:
df = pd.concat([
df,
# power calculations
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}'),
# Count calculations
pd.DataFrame(
df['count'].to_numpy()[:, None] - df.filter(like='c_').to_numpy()
).rename(columns=lambda i: f'count{i + 1}'),
], axis=1)
which gives df:
count power id p_q122 p_q222 ... c_q222 avail1 avail2 count1 count2
0 100 1000 aa 200 300 ... 20 800 700 90 80
1 100 2000 bb 400 500 ... 10 1600 1500 95 90
[2 rows x 11 columns]
If we have many column groups to do, we can build the list of DataFrames programmatically as well:
df = pd.concat([df, *(
pd.DataFrame(
df[col].to_numpy()[:, None] - df.filter(like=filter_prefix).to_numpy()
).rename(columns=lambda i: f'{new_prefix}{i + 1}')
for col, filter_prefix, new_prefix in [
('power', 'p_', 'avail'),
('count', 'c_', 'count')
]
)], axis=1)
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'count': [100, 100], 'power': [1000, 2000], 'id': ['aa', 'bb'],
'p_q122': [200, 400], 'p_q222': [300, 500], 'c_q122': [10, 5],
'c_q222': [20, 10]
})
Try:
df['avail1'] = df['power'].sub(df['p_q122'])
df['avail2'] = df['power'].sub(df['p_q222'])

How can I make this kind of aggregation in pandas?

I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??
The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0

Pandas to create new rows from each exisitng rows

A short data frame and I want to create new rows from the existing rows.
What it does now is, each row, each column multiple a random number between 3 to 5:
import pandas as pd
import random
data = {'Price': [59,98,79],
'Stock': [53,60,60],
'Delivery': [11,7,6]}
df = pd.DataFrame(data)
for row in range(df.shape[0]):
new_row = round(df.loc[row] * random.randint(3,5))
new_row.name = 'new row'
df = df.append([new_row])
print (df)
Price Stock Delivery
0 59 53 11
1 98 60 7
2 79 60 6
new row 295 265 55
new row 294 180 21
new row 316 240 24
Is it possible that it can multiple different random numbers to each row? For example:
the 1st row 3 cells multiple (random) [3,4,5]
the 2nd row 3 cells multiple (random) [4,4,3] etc?
Thank you.
Change the random to numpy random.choice in your for loop
np.random.choice(range(3,5),3)
Use np.random.randint(3,6, size=3). Actually, you can do at once:
df * np.random.randint(3,6, size=df.shape)
You may also generate the multiplication coefficients with the same shape of df independently, and then concat the element-wise multiplied df * mul with the original df:
N.B. This method avoids the notoriously slow .append(). Benchmark: 10,000 rows finished almost instantly with this method, while .append() took 40 seconds!
import numpy as np
np.random.seed(111) # reproducibility
mul = np.random.randint(3, 6, df.shape) # 6 not inclusive
df_new = pd.concat([df, df * mul], axis=0).reset_index(drop=True)
Output:
print(df_new)
Price Stock Delivery
0 59 53 11
1 98 60 7
2 79 60 6
3 177 159 33
4 294 300 28
5 395 300 30
print(mul) # check the coefficients
array([[3, 3, 3],
[3, 5, 4],
[5, 5, 5]])

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Pandas DataFrame groupby apply and re-expand along grouped axis

Say I have a dataframe
A B C D
2019-01-01 1 10 100 12
2019-01-02 2 20 200 23
2019-01-03 3 30 300 34
And an array to group the columns by
array([0, 1, 0, 2])
I wish to group the dataframe by the array (on the column axis), apply a function, then return a Series with length of the number of columns, containing the result of the applied function on each column.
So, for the above (with the applied function taking the group's sum), would want to output:
A 606
B 60
C 606
D 69
dtype: int64
My best attempt:
func = lambda a: np.full(a.shape[1], np.sum(a.values))
df.groupby(groups, axis=1).apply(func)
0 [606, 606]
1 [60]
2 [69]
dtype: object
(in this example the applied function returns equal values inside a group, but this can't be guaranteed for the real case)
I can not see how to do this with pandas grouping syntax, unless I am missing something. Could anyone lend a hand, thanks!
Try this:
import numpy as np
import pandas as pd
groups = [0, 1, 0, 2]
df = pd.DataFrame({'A': [1, 2, 3],
'B': [10, 20, 30],
'C': [100, 200, 300],
'D': [12, 23, 34]})
temp = df.apply(sum).to_frame()
temp.index = pd.MultiIndex.from_arrays(
np.stack([temp.index, groups]),
names=("df columns", "groups")
)
temp_filter = temp.groupby(level=1).agg(sum)
result = temp.join(temp_filter, rsuffix='0'). \
set_index(temp.index.get_level_values(0))["00"]
# df columns
# A 606
# B 60
# C 606
# D 69
# Name: 00, dtype: int64

Categories

Resources