I'm trying to take a difference of consecutive numbers in one of dataframe columns, while preserving an order in another columns, for example:
import pandas as pd
df = pd.DataFrame({"A": [1,1,1,2,2,2,3,3,3,4],
"B": [2,1,3,3,2,1,1,2,3,4],
"C": [2.1,2.0,2.2,1.2,1.1,1.0,3.0,3.1,3.2,3.3]})
In [1]: df
Out[1]:
A B C
0 1 2 2.1
1 1 1 2.0
2 1 3 2.2
3 2 3 1.4
4 2 2 1.2
5 2 1 1.0
6 3 1 3.0
7 3 2 3.3
8 3 3 3.6
9 4 4 4.0
I would like to:
- for each distinctive element of column A (1, 2, 3, and 4)
- sort column B and take consecutive differences of column C
without a loop, to get something like that
In [2]: df2
Out[2]:
A B C Diff
0 1 2 2.1 0.1
2 1 3 2.2 0.1
3 2 3 1.2 0.2
4 2 2 1.1 0.2
7 3 2 3.1 0.3
8 3 3 3.2 0.3
I have run a number of operations:
df2 = df.groupby(by='A').apply(lambda x: x.sort_values(by = ['B'])['C'].diff())
df3 = pd.DataFrame(df2)
df3.reset_index(inplace=True)
df4 = df3.set_index('level_1')
df5 = df.copy()
df5['diff'] = df4['C']
and got what I wanted:
df5
Out[1]:
A B C diff
0 1 2 2.1 0.1
1 1 1 2.0 NaN
2 1 3 2.2 0.1
3 2 3 1.2 0.1
4 2 2 1.1 0.1
5 2 1 1.0 NaN
6 3 1 3.0 NaN
7 3 2 3.1 0.1
8 3 3 3.2 0.1
9 4 4 3.3 NaN
but is there a more efficient way of doing so?
(NaN values can be easily removed so I'm not fussy about that part)
A little unclear on what is expected as result (why are there less rows?).
For taking the consecutive differences you probably want to use Series.diff() (see docs here)
df['Diff'] = df.C.diff()
You can use the period keyword if you wanted some (positive or negative) lags to take the differences.
Don't see where the sort part comes into effect, but for that you probably want to use Series.sort_values() (see docs here)
EDIT
Based on your updated information, I believe this may be what you are looking for:
df.sort_values(by=['B', 'C'], inplace=True)
df['diff'] = df.C.diff()
EDIT 2
Based on your new updated information about the calculation, you want to:
- groupby by A (see docs on DataFrame.groupby() here)
- sort (each group) by B (or presort by A then B, prior to groupby)
- calculate differences of C (and dismiss the first record since it will be missing).
The following code achieves that:
df.sort_values(by=['A','B'], inplace=True)
df['Diff'] = df.groupby('A').apply(lambda x: x['C'].diff()).values
df2 = df.dropna()
Explanation of the code:
First line sorts the dataframe first.
The second line there has a bunch of things going...:
First groupby (which now generates a grouped DataFrame, see the helpful pandas page on split-apply-combine if you're new to the groupby)
then obtain the differences of C for each group
and "flatten" the grouped dataframe by obtaining a series with .values
which we assign to df['Diff'] (that is why we needed to presort the dataframe, so this assignment would get it right... if not we would have to merge the series on A and B).
The third line just removes the NAs and assigns that to df2.
EDIT3
I think my EDIT2 version is maybe what you are looking for in, a bit more concise and less aux data generated. However, you can also improve your version of the solution a little by:
df3.reset_index(level=0, inplace=True) # no need to reset and then set again
df5 = df.copy() # only if you don't want to change df
df5['diff'] = df3.C # else, just do df.insert(2, 'diff', df3.C)
Related
I have a column with acceleration values, and I’m trying to integrate them in a new column.
Here’s what I want as output :
A B
0 a b-a
1 b c-b
2 c d-c
3 d …-d
…
I’m currently doing like that
l=[]
for i in range(len(df)):
l.append(df.values[i+1][0]-df.values[i][0])
df[1]=l
That’s very slow to process.
I have over a million lines, and this in 20 different csv files. Is there a way to do it faster ?
IIUC, you can use diff:
df = pd.DataFrame({'A': [0,2,1,10]})
df['B'] = -df['A'].diff(-1)
output:
A B
0 0 2.0
1 2 -1.0
2 1 9.0
3 10 NaN
What is the proper way to go from a df like this:
>>>df
treatment mutation_1 mutation_2 resistance frequency
0 a hpc abc 1.2 3
1 a awd jda 2.1 4
2 b abc hpc 1.2 5
To this:
mutation_1 mutation_2 resistance frequency_a frequency_b
0 hpc abc 1.2 3 5
1 awd jda 2.1 4 0
Please notice that the order in columns a & b does not matter.
Edit: Changed column names in my example for clarity
Edit2: I added the resistance column which is important for me to keep.
First you want to sort the columns of interest horizontally, and pivot:
cols = ['mutation_1','mutation_2']
df[cols] = np.sort(df[cols],1)
(df.pivot_table(index=cols,
columns='treatment',
values='frequency')
.rename(columns=lambda x: f'frequency_{x}') # rename as needed
.reset_index()) # reset index to columns
Output:
treatment mutation_1 mutation_2 frequency_a frequency_b
0 abc hpc 3.0 5.0
1 awd jda 4.0 NaN
Is there a Pandas function equivalent to the MS Excel fill handle?
It fills data down or extends a series if more than one cell is selected. My specific application is filling down with a set value in a specific column from a specific row in the dataframe, not necessarily filling a series.
This simple function essentially does what I want. I think it would be nice if ffill could be modified to fill in this way...
def fill_down(df, col, val, start, end = 0, interval = 1):
if not end:
end = len(df)
for i in range(start,end,interval):
df[col].iloc[i] += val
return df
As others commented, there isn't a GUI for pandas, but ffill gives the functionality you're looking for. You can also use ffill with groupby for more powerful functionality. For example:
>>> df
A B
0 12 1
1 NaN 1
2 4 2
3 NaN 2
>>> df.A = df.groupby('B').A.ffill()
A B
0 12 1
1 12 1
2 4 2
3 4 2
Edit: If you don't have NaN's, you could always create the NaN's where you want to fill down. For example:
>>> df
Out[8]:
A B
0 1 2
1 3 3
2 4 5
>>> df.replace(3, np.nan)
Out[9]:
A B
0 1.0 2.0
1 NaN NaN
2 4.0 5.0
I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5
(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)
Suppose you start with a DataFrame
df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]:
a b val
0 x 0 0
1 x 1 1
2 y 0 2
3 y 1 3
Now suppose you want to make the index a, the columns b, the values in a cell val, and specify what to do if there are two or more values in a resulting cell:
b 0 1
a
x 0 1
y 2 3
Then you can do this either through
df.val.groupby([df.a, df.b]).sum().unstack()
or through
pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')
So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.
Is there something I've missed?
Are there operations that can be performed using one and not the other?
Are there, perhaps, operations easier to perform using one over the other?
If not, why not deprecate pivot_tale? groupby seems much more general.
If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for:
df.groupby([index + columns]).agg(aggfunc).unstack(columns)
plus:
margins (subtotals and grand totals as #ayhan has already said)
pivot_table() also removes extra multi-levels from columns axis (see example below)
convenient dropna parameter: Do not include columns whose entries are all NaN
Demo: (I took this DF from the docstring [source code for pivot_table()])
In [40]: df
Out[40]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
pay attention at the top level column: D
In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
D
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
why not deprecate pivot_tale? groupby seems much more general.
IMO, because it's very easy to use and very convenient!
;)