apply function only within the same row index?

apply function only within the same row index? - python

I have a dataframe with 2 sorted indexes and I want to apply diff on the column only within col1 in the order sorted by col2.
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.col3.diff(1)
This gives me
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 1
C 3 7 3
Above it applys diff by row.
What I want is
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 nan
C 3 7 nan

You'll want to use groupby to apply diff to each group:
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.groupby(axis=0, level='col1')['col3'].diff()

Since you already go through the heavy lifting of sort, you can diff and only assign within the group. You can't shift non-datetime indices, so either make a Series, or use np.roll, though that wraps around, and would lead to the wrong answer for a single group DataFrame
import pandas as pd
s = pd.Series(mini_df.index.get_level_values('col1'))
mini_df['diff'] = mini_df.col3.diff().where(s.eq(s.shift(1)).values)
col3 diff
col1 col2
A 1 1 NaN
4 3 2.0
B 2 4 NaN
C 3 7 NaN

Related

Pandas how to perform outer merge with specific order of adding rows

I have two data frames:
df:
col1 col2
0 x 1
1 a 2
2 b 3
3 c 4
and
df2:
col1 col2
0 x 1
1 a 2
2 f 6
3 c 4
And I want to obtain data frame in which new row from df2 will be added to new data frame after row with the same index from the df, like this:
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4

df1 = pd.DataFrame({
'col1': ['x', 'a', 'b', 'c'],
'col2': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'col1': ['x', 'a', 'f', 'c'],
'col2': [1, 2, 6, 4]
})
In order to get your output, I concatenated the two dataframes and sorted by the index, as requested, then you can set the index in order with reset_index().
df = pd.concat([df1, df2]).drop_duplicates().sort_index().reset_index(drop=True)
# Output
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4

Confused about the meaning of calculating sum in Pandas Pivot Table

pd.pivot_table(df1,index=['Staff'],columns=['Products'],values=['Price'],margins=True)
Results:
Price
Products A B C All
Staff
Staff_A 14472.000000 15877.777778 14982.352941 14890.196078
Staff_B 14775.000000 15620.000000 16330.000000 15815.789474
Staff_C 15293.333333 14262.500000 15214.285714 15000.000000
All 14779.545455 15231.818182 15426.470588 15099.000000
The result can be shown successfully, however I do not understand where the values from the column and row All come from. Can anyone explain where the value of sum comes from, because I thought it should be the sum of the first three columns data added together?

To illustrate:
import pandas as pd
import numpy as np
Create dummy dataframe:
df = pd.DataFrame({
'id': ['a', 'b', 'c', 'a', 'b', 'c'],
'stat': ['col1', 'col1', 'col1', 'col2', 'col2', 'col2'],
'val': [2, 4, 6, 6, 8, 10],
})
This will give us:
>>> df
id stat val
0 a col1 2
1 b col1 4
2 c col1 6
3 a col2 6
4 b col2 8
5 c col2 10
Pivoting without giving aggfunc argument
table = pd.pivot_table(df, index='id', columns='stat', values='val', margins=True)
The above will give us:
>>> table
stat col1 col2 All
id
a 2 6 4.0
b 4 8 6.0
c 6 10 8.0
All 4 8 6.0
Notice that column All and row All both giving us the mean of the columns and rows, respectively.
Pivoting and using aggfunc=np.sum:
new_table = pd.pivot_table(df, index='id', columns='stat', values='val', margins=True, aggfunc=np.sum)
This will give us:
>>> new_table
stat col1 col2 All
id
a 2 6 8
b 4 8 12
c 6 10 16
All 12 24 36

Unexpected row getting changed in pandas loc assignment

I want to copy a portion of a pandas dataframe onto a different portion, overwriting the existing values there. I am using .loc but more rows are changing than the ones I am referencing.
My example:
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': range(1, 6),
'col3': range(6, 11)
})
print(df)
col1 col2 col3
0 A 1 6
1 B 2 7
2 C 3 8
3 D 4 9
4 E 5 10
I want to write the values of col2 and col3 from the C and D rows onto the A and B rows. Using .loc:
df.loc[0:2, ["col2", "col3"]] = df.loc[2:4, ["col2", "col3"]].values
print(df)
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 5 10
3 D 4 9
4 E 5 10
This does what I want for rows A and B, but row C has also changed. I expect only the first two rows to change, i.e. my expected output is
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 3 8
3 D 4 9
4 E 5 10
Why did the C row also change, and how may I do this with only changing the first two rows?

Unlike list slicing pandas.DataFrame.loc slicing is inclusive-inclusive
Warning Note that contrary to usual python slices, both the start and the stop
are included
so you should do
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

In addition, you can also pass a list of exhaustive elements, this way the rows need not to be consecutive:
df.loc[[0,1], ["col2", "col3"]] = df.loc[[2,3], ["col2", "col3"]].values

You went too far with the indices:
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

Sum column in one dataframe based on row value of another dataframe

Say, I have one data frame df:
a b c d e
0 1 2 dd 5 Col1
1 2 3 ee 9 Col2
2 3 4 ff 1 Col4
There's another dataframe df2:
Col1 Col2 Col3
0 1 2 4
1 2 3 5
2 3 4 6
I need to add a column sum in the first dataframe, wherein it sums values of columns in the second dataframe df2, based on values of column e in df1.
Expected output
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col4 0
The Sum value in the last row is 0 because Col4 doesn't exist in df2.
What I tried: Writing some lamdas, apply function. Wasn't able to do it.
I'd greatly appreciate the help. Thank you.

Try
df['Sum']=df.e.map(df2.sum()).fillna(0)
df
Out[89]:
a b c d e Sum
0 1 2 dd 5 Col1 6.0
1 2 3 ee 9 Col2 9.0
2 3 4 ff 1 Col4 0.0

Try this. The following solution sums all values for a particular column if present in df2 using apply method and returns 0 if no such column exists in df2.
df1.loc[:,"sum"]=df1.loc[:,"e"].apply(lambda x: df2.loc[:,x].sum() if(x in df2.columns) else 0)

Use .iterrows() to iterate through a data frame pulling out the values for each row as well as index.
A nest for loop style of iteration can be used to grab needed values from the second dataframe and apply them to the first
import pandas as pd
df1 = pd.DataFrame(data={'a': [1,2,3], 'b': [2,3,4], 'c': ['dd', 'ee', 'ff'], 'd': [5,9,1], 'e': ['Col1','Col2','Col3']})
df2 = pd.DataFrame(data={'Col1': [1,2,3], 'Col2': [2,3,4], 'Col3': [4,5,6]})
df1['Sum'] = df1['a'].apply(lambda x: None)
for index, value in df1.iterrows():
sum = 0
for index2, value2 in df2.iterrows():
sum += value2[value['e']]
df1['Sum'][index] = sum
Output:
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col3 15

for loop for searching value in dataframe and updating values next to it

I want python to perform updating of values next to a value found in both dataframes (somewhat similar to VLOOKUP in MS Excel). So, for
import pandas as pd
df1 = pd.DataFrame(data = {'col1':['a', 'b', 'd'], 'col2': [1, 2, 4], 'col3': [2, 3, 4]})
df2 = pd.DataFrame(data = {'col1':['a', 'f', 'c', 'd']})
In [3]: df1
Out[3]:
col1 col2 col3
0 a 1 2
1 b 2 3
2 d 4 4
In [4]: df2
Out[4]:
col1
0 a
1 f
2 c
3 d
Outcome must be the following:
In [6]: df3 = *somecode*
df3
Out[6]:
col1 col2 col3
0 a 1 2
1 f
2 c
3 d 4 4
The main part is that I want some sort of "for loop" to do this.
So, for instance python searches for first value in col1 in df2, finds it in df1, and updates col2 and col3 respectivly, then moves forward.

First for loop in pandas is best avoid if some vectorized solution exist.
I think merge with left join is necessary, parameter on should be omit if only col1 is same in both DataFrames:
df3 = df2.merge(df1, how='left')
print (df3)
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0

try this,
Simple left join will solve your problem,
pd.merge(df2,df1,how='left',on=['col1'])
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

apply function only within the same row index? - python

You'll want to use groupby to apply diff to each group: mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]}) mini_df = mini_df.set_index(['col1', 'col2']).sort_index() mini_df['diff'] = mini_df.groupby(axis=0, level='col1')['col3'].diff()

Related

Pandas how to perform outer merge with specific order of adding rows

Confused about the meaning of calculating sum in Pandas Pivot Table

Unexpected row getting changed in pandas loc assignment

Sum column in one dataframe based on row value of another dataframe

for loop for searching value in dataframe and updating values next to it

Categories

Resources