Setting with enlargement - updating transaction DF - python

Looking for ways to achieve following updates on a dataframe:
dfb is the base dataframe that I want to update with dft transactions.
Any common index rows should be updated with values from dft.
Indexes only in dft should be appended to dfb.
Looking at the documentation, setting with enlargement looked perfect but then I realized it only worked with a single row. Is it possible to use setting with enlargement to do this update or is there another method that could be recommended?
dfb = pd.DataFrame(data={'A': [11,22,33], 'B': [44,55,66]}, index=[1,2,3])
dfb
Out[70]:
A B
1 11 44
2 22 55
3 33 66
dft = pd.DataFrame(data={'A': [0,2,3], 'B': [4,5,6]}, index=[3,4,5])
dft
Out[71]:
A B
3 0 4
4 2 5
5 3 6
# Updated dfb should look like this:
dfb
Out[75]:
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6

You can use combine_first with renaming columns, last convert float columns to int by astype:
dft = dft.rename(columns={'c':'B', 'B':'A'}).combine_first(dfb).astype(int)
print (dft)
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
Another solution with finding same indexes in both DataFrames by Index.intersection, drop it from first DataFrame dfb and then use concat:
dft = dft.rename(columns={'c':'B', 'B':'A'})
idx = dfb.index.intersection(dft.index)
print (idx)
Int64Index([3], dtype='int64')
dfb = dfb.drop(idx)
print (dfb)
A B
1 11 44
2 22 55
print (pd.concat([dfb, dft]))
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6

Related

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

How can I only swap the values of two columns and leave the rest as it is in a dataframe?

I'm using Pandas to create a dataframe by extracting a file which is located in an SFTP location. Sample df looks like below:
So what I'm trying to achieve here is to only swap the values between columns PROCESS_FLAG and AD_WINDTOUCH_ID and leave the rest of the dataframe cols and rows as it is. Please note I don't want to even switch the column names, just the values. I tried finding an approach and only came across reindexing which doesn't work as expected. If any of you'll have gone through this scenario, any help would be appreciated.
You can use unpacking,
df.PROCESS_FLAG, df.AD_WINDTOUCH_ID = df.AD_WINDTOUCH_ID.copy(), df.PROCESS_FLAG.copy()
SAMPLE
Given a dataframe like this:
>>> df = pd.DataFrame({"A": np.random.randint(0,10,100), "B": np.random.randint(0,10,100)})
>>> df
Out[164]:
A B
0 1 9
1 4 7
2 3 8
3 6 8
4 5 0
.. .. ..
95 1 9
96 3 8
97 2 3
98 3 4
99 3 1
[100 rows x 2 columns]
>>> df.A, df.B = df.B.copy(), df.A.copy()
>>> df
Out[166]:
A B
0 9 1
1 7 4
2 8 3
3 8 6
4 0 5
.. .. ..
95 9 1
96 8 3
97 3 2
98 4 3
99 1 3
[100 rows x 2 columns]
To rename and reorder:
import pandas as pd
d = {'a': [1,2,3,4], 'b':[1,2,5,6]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'b','b':'a'}, inplace=True)
df = df[['a','b']]
print(df)
a b
0 1 1
1 2 2
2 5 3
3 6 4

Sort part of DataFrame in Python Panda, return new column with order depending on row values

My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2

Map two pandas dataframe and add a column to the first dataframe

I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

Categories

Resources