I have a dataframe as the following:
import pandas as pd
df = pd.DataFrame([(1,2,3,4,5,6),
(1,2,3,4,5,6),
(1,2,3,4,5,6)], columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
and I want to get the sum of all the elements of the dataframe, always excluding one column. In this example the desired outcome would be:
60 57 54 51 48 45
I have found a solution that seems to do the job, but I'm pretty sure there must be a more efficient way to do the same:
for x in df.columns:
df.drop(columns = x).sum().sum()
Use DataFrame.rsub for subtract from right side summed rows by df.sum(axis=1), last add sum for sum per columns:
s = df.rsub(df.sum(axis=1), axis=0).sum()
print (s)
a 60
b 57
c 54
d 51
e 48
f 45
dtype: int64
Simply subtract the total sum with the sum per column. This is efficiently done using an assignment expression:
out = (s:=df.sum()).sum()-s
output:
a 60
b 57
c 54
d 51
e 48
f 45
Yet another way: Subtract the column sums from the total sum of the DataFrame using broadcasting:
out = df.sum().sum() - df.sum()
Output:
a 60
b 57
c 54
d 51
e 48
f 45
dtype: int64
Related
I have a dataframe that uses MultiIndex for both index and columns.
For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.
For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.
What is the best way to achieve this?
If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
You can use stack to put the 'y' level of column as index and then groupby only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())
I am new to Pandas.
My dataset:
df
A B
10 1
15 2
65 3
54 2
51 2
96 1
I am trying to add new column C and calculate the median for values that are in the same group defined by column B.
Expected result:
df
A B C
10 11 53
15 2 34
65 3 65
54 2 34
51 2 34
96 1 53
What I've tried:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
I do get an answer, but due to big DataFrame I am unsure if my code performs correctly, could someone tell me if I am using the right way to achieve this?
You can use:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
As provided in comments.
I already have a solution -but it is very slow (13 minutes for 800 rows). here is an example of the dataframe:
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df
In a new column, I want to calculate how many of the previous values (for example three)of col2 are greater or equal than row-value of col1. i also continue the first rows.
this is my slow code:
start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column
for row in range(len(df)):
if row <= start_at_nr - 1:
df["overlap_count"].loc[row] = "x"
else:
df["overlap_count"].loc[row] = (
df["col2"].loc[row - start_at_nr:row - 1] >=
(df["col1"].loc[row])).sum()
df
i obtain a faster solution - thank you for your time!
this is the result i obtain:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3.
You basically compare the current value of col1 to previous 3 rows of col2 and starting the compare from row 3. You may use shift as follow
n = 3
s = ((pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1) >= df.col1.values[:,None])
.sum(1)[3:])
or
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
Out[65]:
3 1
4 1
5 2
6 3
7 3
dtype: int64
To get your desired output, assign it back to df and fillna
n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
You could do it with .apply() in a single statement as follows. I have used a convenience function process_row(), which is also included below.
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False).rename(
columns={'index': 'ID'})).apply(
lambda x: process_row(x, df, offset=3), axis=1))
For More Speed:
In case you need more speed and are processing a lot of rows, you may consider using swifter library. All you have to do is:
install swifter: pip install swifter.
import the library as import swifter.
replace any .apply() with .swifter.apply() in the code-block above.
Solution in Detail
#!pip install -U swifter
#import swifter
import numpy as np
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
def process_row(x, df, offset=3):
value = (df.loc[x.ID - offset:x.ID - 1, 'col2'] >= df.loc[x.ID, 'col1']).sum() if (x.ID >= offset) else 'x'
return value
# Use df.swifter.apply() for faster processing, instead of df.apply()
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False, inplace=False).rename(
columns={'index': 'ID'}, inplace=False)).apply(
lambda x: process_row(x, df, offset=3), axis=1))
Output:
col1 col2 OVERLAP_COUNT
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64
I'm trying to take an existing DataFrame and append a new column.
Let's say I have this DataFrame (just some random numbers):
a b c d e
0 2.847674 0.890958 -1.785646 -0.648289 1.178657
1 -0.865278 0.696976 1.522485 -0.248514 1.004034
2 -2.229555 -0.037372 -1.380972 -0.880361 -0.532428
3 -0.057895 -2.193053 -0.691445 -0.588935 -0.883624
And I want to create a new column 'f' that multiplies each row by a 'costs' vector, for instance [1,0,0,0,0]. So, for row zero, the output in column f should be 2.847674.
Here's the function I currently use:
def addEstimate (df, costs):
row_iterator = df.iterrows()
for i, row in row_iterator:
df.ix[i, 'f'] = np.dot(costs, df.ix[i])
I'm doing this with a 15-element vector, over ~20k rows, and I'm finding that this is super-duper slow (half an hour). I suspect that using iterrows and ix is inefficient, but I'm not sure how to correct this.
Is there a way that I can apply this to the entire DataFrame at once, rather than looping through rows? Or do you have other suggestions to speed this up?
You can create the new column with df['f'] = df.dot(costs).
dot is already a DataFrame method: applying it to the DataFrame as a whole will be much quicker than looping over the DataFrame and applying np.dot to individual rows.
For example:
>>> df # an example DataFrame
a b c d e
0 0 1 2 3 4
1 12 13 14 15 16
2 24 25 26 27 28
3 36 37 38 39 40
>>> costs = [1, 0, 0, 0, 2]
>>> df['f'] = df.dot(costs)
>>> df
a b c d e f
0 0 1 2 3 4 8
1 12 13 14 15 16 44
2 24 25 26 27 28 80
3 36 37 38 39 40 116
Pandas has a dot function as well. Does
df['dotproduct'] = df.dot(costs)
do what you are looking for?