I face problem in pandas where I perform many changes on data. But eventually I dont know which change caused the final state of value in the column.
For example I change volumes like this. But I run many checks like this one:
# Last check
for i in range(5):
df_gp.tail(1).loc[ (df_gp['volume']<df_gp['volume'].shift(1)) | (df_gp['volume']<0.4),['new_volume'] ] = df_gp['new_volume']*1.1
I want to update not only 'new_volume' column, but also column 'commentary' if the conditions are fulfilled.
Is it possible to add it somewhere, so that I 'commentary' is updated in the same time as 'new_volume'?
Thanks!
Yes, it is possible by assign, but in my opinion less readable, better is update each columns separately by boolean mask cached in variable:
df_gp = pd.DataFrame({'volume':[.1,.3,.5,.7,.1,.7],
'new_volume':[5,3,6,9,2,4],
'commentary':list('aaabbb')})
print (df_gp)
volume new_volume commentary
0 0.1 5 a
1 0.3 3 a
2 0.5 6 a
3 0.7 9 b
4 0.1 2 b
5 0.7 4 b
#create boolean mask and assign to variable for reuse
m = (df_gp['volume']<df_gp['volume'].shift(1)) | (df_gp['volume']<0.4)
#change columns by assign by condition and assign back only filtered columns
c = ['commentary','new_volume']
df_gp.loc[m, c] = df_gp.loc[m, c].assign(new_volume=df_gp['new_volume']*1.1
commentary='updated')
print (df_gp)
volume new_volume commentary
0 0.1 5.5 updated
1 0.3 3.3 updated
2 0.5 6.0 a
3 0.7 9.0 b
4 0.1 2.2 updated
5 0.7 4.0 b
#multiple filtered column by scalar
df_gp.loc[m, 'new_volume'] *= 1.1
#append new value to filtered column
df_gp.loc[m, 'commentary'] = 'updated'
print (df_gp)
volume new_volume commentary
0 0.1 5.5 updated
1 0.3 3.3 updated
2 0.5 6.0 a
3 0.7 9.0 b
4 0.1 2.2 updated
5 0.7 4.0 b
Related
I am trying to pivot a pandas table composed of 3 columns whereby the process id identifies the process that generates a series of scalar values, forms part of the resultant dataframe column (per process) as the following describes:
Input
time scalar process_id
1 0.5 A
1 0.6 B
2 0.7 A
2 1.5 B
3 1.6 A
3 1.9 B
Resultant:
time scalar_A scalar_B
1 0.5 0.6
2 0.7 1.5
3 1.6 1.9
I have tried using unstack (after setting process id in a multi index), however this causes the columns and process id that generated them to be nested:
bicir.set_index(['time', 'process_id'], inplace=True)
df.unstack(level=-1)
How would one most efficiently/effectively achieve this?
Thanks
It's actually already covered by pd.DataFrame.pivot method:
new_df = df.pivot(index='time', columns='process_id', values='scalar').reset_index()
Output:
process_id time A B
0 1 0.5 0.6
1 2 0.7 1.5
2 3 1.6 1.9
And if you want to rename your columns:
new_df = df.pivot(index='time', columns='process_id', values='scalar')
new_df.columns = [f'scalar_{i}' for i in new_df.columns]
new_df = new_df.reset_index()
Output:
time scalar_A scalar_B
0 1 0.5 0.6
1 2 0.7 1.5
2 3 1.6 1.9
I have a sensor. For some reasons, the sensor like to record data like this:
>df
obs count
-0.3 3
0.9 2
1.4 5
i.e. it first records observations and make a count table out of it. What I would like to do it convert this df into a series with raw observations. For example, I would like to end up with: [-0.3,-0.3,-0.3,0.9,0.9,1.4,1.4 ....]
Similar question asked for excel.
If your dataframe structure is like this one (or similar):
obs count
0 -0.3 3
1 0.9 2
2 1.4 5
This is an option, using numpy.repeat:
import numpy as np
times = df['count']
df2['obs'] = np.concatenate([np.repeat(df['obs'],times)])
print(df2)
obs
0 -0.3
1 -0.3
2 -0.3
3 0.9
4 0.9
5 1.4
6 1.4
7 1.4
8 1.4
9 1.4
Good Morning, (bad beginner)
I have the following pandas dataframe:
My goal is to take the firs time a new ID appears and let the VALUE COLUMN be 1000* DELTA of that row. for all consecutive rows of that ID, the VALUE is the VALUE of the row above * the DELTA of the current row.
I tried by getting all unique ID values:
a=stocks2.ID.unique()
a.tolist()
It works, unfortunately I do not really know how to iterate in the way I described. Any kind of help or tip would be greatly appreciated!
A way to do it would be as follows. Example dataframe:
df = pd.DataFrame({'ID':[1,1,5,3,3], 'delta':[0.3,0.5,0.2,2,4]}).assign(value=[2,5,4,2,3])
print(df)
ID delta value
0 1 0.3 2
1 1 0.5 5
2 5 0.2 4
3 3 2.0 2
4 3 4.0 3
Fill value from the row above as:
df['value'] = df.shift(1).delta * df.shift(1).value
Groupby to get the indices where the first ID appears:
w = df.groupby('ID', as_index=False).nth(0).index.values
And compute the values for value using the indices in w:
df.loc[w,'value'] = df.loc[w,'delta'] * 1000
Which gives for this example:
ID delta value
0 1 0.3 300.0
1 1 0.5 0.6
2 5 0.2 200.0
3 3 2.0 2000.0
4 3 4.0 4.0
I'm trying to take a difference of consecutive numbers in one of dataframe columns, while preserving an order in another columns, for example:
import pandas as pd
df = pd.DataFrame({"A": [1,1,1,2,2,2,3,3,3,4],
"B": [2,1,3,3,2,1,1,2,3,4],
"C": [2.1,2.0,2.2,1.2,1.1,1.0,3.0,3.1,3.2,3.3]})
In [1]: df
Out[1]:
A B C
0 1 2 2.1
1 1 1 2.0
2 1 3 2.2
3 2 3 1.4
4 2 2 1.2
5 2 1 1.0
6 3 1 3.0
7 3 2 3.3
8 3 3 3.6
9 4 4 4.0
I would like to:
- for each distinctive element of column A (1, 2, 3, and 4)
- sort column B and take consecutive differences of column C
without a loop, to get something like that
In [2]: df2
Out[2]:
A B C Diff
0 1 2 2.1 0.1
2 1 3 2.2 0.1
3 2 3 1.2 0.2
4 2 2 1.1 0.2
7 3 2 3.1 0.3
8 3 3 3.2 0.3
I have run a number of operations:
df2 = df.groupby(by='A').apply(lambda x: x.sort_values(by = ['B'])['C'].diff())
df3 = pd.DataFrame(df2)
df3.reset_index(inplace=True)
df4 = df3.set_index('level_1')
df5 = df.copy()
df5['diff'] = df4['C']
and got what I wanted:
df5
Out[1]:
A B C diff
0 1 2 2.1 0.1
1 1 1 2.0 NaN
2 1 3 2.2 0.1
3 2 3 1.2 0.1
4 2 2 1.1 0.1
5 2 1 1.0 NaN
6 3 1 3.0 NaN
7 3 2 3.1 0.1
8 3 3 3.2 0.1
9 4 4 3.3 NaN
but is there a more efficient way of doing so?
(NaN values can be easily removed so I'm not fussy about that part)
A little unclear on what is expected as result (why are there less rows?).
For taking the consecutive differences you probably want to use Series.diff() (see docs here)
df['Diff'] = df.C.diff()
You can use the period keyword if you wanted some (positive or negative) lags to take the differences.
Don't see where the sort part comes into effect, but for that you probably want to use Series.sort_values() (see docs here)
EDIT
Based on your updated information, I believe this may be what you are looking for:
df.sort_values(by=['B', 'C'], inplace=True)
df['diff'] = df.C.diff()
EDIT 2
Based on your new updated information about the calculation, you want to:
- groupby by A (see docs on DataFrame.groupby() here)
- sort (each group) by B (or presort by A then B, prior to groupby)
- calculate differences of C (and dismiss the first record since it will be missing).
The following code achieves that:
df.sort_values(by=['A','B'], inplace=True)
df['Diff'] = df.groupby('A').apply(lambda x: x['C'].diff()).values
df2 = df.dropna()
Explanation of the code:
First line sorts the dataframe first.
The second line there has a bunch of things going...:
First groupby (which now generates a grouped DataFrame, see the helpful pandas page on split-apply-combine if you're new to the groupby)
then obtain the differences of C for each group
and "flatten" the grouped dataframe by obtaining a series with .values
which we assign to df['Diff'] (that is why we needed to presort the dataframe, so this assignment would get it right... if not we would have to merge the series on A and B).
The third line just removes the NAs and assigns that to df2.
EDIT3
I think my EDIT2 version is maybe what you are looking for in, a bit more concise and less aux data generated. However, you can also improve your version of the solution a little by:
df3.reset_index(level=0, inplace=True) # no need to reset and then set again
df5 = df.copy() # only if you don't want to change df
df5['diff'] = df3.C # else, just do df.insert(2, 'diff', df3.C)
I am attempting to bin data in one dataframe according to bins defined in a second dataframe. I am thinking that some combination of pd.bin and pd.merge might get me there?
This is basically the form each dataframe is currently in:
df = pd.DataFrame({'id':['a', 'b', 'c', 'd','e'],
'bin':[1, 2, 3, 3, 2],
'perc':[0.1,0.9,0.3,0.7,0.5]})
df2 = pd.DataFrame({'bin':[1, 1, 1, 2, 2, 2, 3, 3, 3],
'result':['low', 'medium','high','low', 'medium','high','low', 'medium','high'],
'cut_min':[0,0.2,0.6,0,0.3,0.7,0,0.4,0.8],
'cut_max':[0.2,0.6,1,0.3,0.7,1,0.4,0.8,1]})
df:
bin id perc
1 a 0.1
2 b 0.9
3 c 0.3
3 d 0.7
2 e 0.5
And this is the table with the bins, df2:
bin cut_max cut_min result
1 0.2 0.0 low
1 0.6 0.2 medium
1 1.0 0.6 high
2 0.3 0.0 low
2 0.7 0.3 medium
2 1.0 0.7 high
3 0.4 0.0 low
3 0.8 0.4 medium
3 1.0 0.8 high
I would like to match the bin, and find the appropriate result in df2 using the cut_min and cut_max that encompasses the perc value in df. So, I would like the resulting table to look like this:
bin id perc result
1 a 0.1 low
2 b 0.9 high
3 c 0.3 low
3 d 0.7 medium
2 e 0.5 medium
I originally wrote this in a SQL query which accomplished the task quite simply with a join:
select
df.id
, df.bin
, df.perc
, df2.result
from df
inner join df2
on df.bin = df2.bin
and df.perc >= df2.cut_min
and df.perc < df2.cut_max
If anyone knows a good way to do this using Pandas, it would be greatly appreciated! (And this is actually the first time I haven't been able to find a solution just searching on stackoverflow, so my apologies if any of the above wasn't explained well enough!)
First merge df and df2 on the bin column, and then select the rows where cut_min <= perc < cut_max:
In [95]: result = pd.merge(df, df2, on='bin').query('cut_min <= perc < cut_max'); result
Out[95]:
bin id perc cut_max cut_min result
0 1 a 0.1 0.2 0.0 low
5 2 b 0.9 1.0 0.7 high
7 2 e 0.5 0.7 0.3 medium
9 3 c 0.3 0.4 0.0 low
13 3 d 0.7 0.8 0.4 medium
In [97]: result = result[['bin', 'id', 'perc', 'result']]
In [98]: result.sort('id')
Out[98]:
bin id perc result
0 1 a 0.1 low
5 2 b 0.9 high
9 3 c 0.3 low
13 3 d 0.7 medium
7 2 e 0.5 medium
module bitstring has a class called BitArray which you can initialize with a byte array:
(you will need to pip install bitarray)
from bitstring import BitArray
BitArray(bytes = <byte_array>)