Pandas accumulation without for loop - python

Hi I have a pandas frame work like:
1. 1
2. 2
3. 3
4. 4
And the output is something like
1. 1
2. 3
3. 6
4. 10
where each value is the current value plus the last one (3 = 1 + 2, 6 = 3 + 3, 10 = 6 + 4 etc).
Can I do this without a for loop?

You need Series.cumsum:
print (df)
col
1.0 1
2.0 2
3.0 3
4.0 4
df['col1'] = df.col.cumsum()
print (df)
col col1
1.0 1 1
2.0 2 3
3.0 3 6
4.0 4 10
If need overwrite column col:
df.col = df.col.cumsum()
print (df)
col
1.0 1
2.0 3
3.0 6
4.0 10

Related

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

How to apply function vertically in df

I want to add column values vertically from top to down
def add(x,y):
return x,y
df = pd.DataFrame({'A':[1,2,3,4,5]})
df['add'] = df.apply(lambda row : add(row['A'], axis = 1)
I tried using apply but its not working
Desired output is basically adding A column values 1+2, 2+3:
A add
0 1 1
1 2 3
2 3 5
3 4 7
4 5 9
You can apply rolling.sum on a moving window of size 2:
df.A.rolling(2, min_periods=1).sum()
0 1.0
1 3.0
2 5.0
3 7.0
4 9.0
Name: A, dtype: float64
Try this instead:
>>> df['add'] = (df + df.shift()).fillna(df)['A']
>>> df
A add
0 1 1.0
1 2 3.0
2 3 5.0
3 4 7.0
4 5 9.0
>>>

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Calculating formula based on multiple columns in Pandas Dataframe - but without creating many intermediate columns

I have been trying to calculate the "True Range" formula based on a Pandas dataframe containing stock ticker prices history.
This is the formula:
TR = max [(high - low ), abs(high − close prev), abs ⁡(low − close prev)]
I have high, low and close as columns in the dataframe.
When I try to operate like this, I get invalid character identifier error which is not very helpful. I tried many changes and combninations in the following expression, but not successful.
df['TR']=((df['high']-df['low']), (df['high'] - df['adjclose'].shift(1)).abs(),(df['low'] - df['adjclose'].shift(1))).max(axis=1)
I know this can be achieved by three separate intermediate columns and taking a max of them. But, I want to avoid the same and do it directly.
Is there a way out?
Use concat with max:
df['TR'] = pd.concat([(df['high'] - df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1).max(axis=1)
Sample:
df = pd.DataFrame({'high':[4,5,4,5,5,4],
'low':[7,8,9,4,2,3],
'adjclose':[1,3,5,7,1,0]})
print (df)
adjclose high low
0 1 4 7
1 3 5 8
2 5 4 9
3 7 5 4
4 1 5 2
5 0 4 3
df['TR'] = pd.concat([(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1).max(axis=1)
print (df)
adjclose high low TR
0 1 4 7 -3.0
1 3 5 8 7.0
2 5 4 9 6.0
3 7 5 4 1.0
4 1 5 2 3.0
5 0 4 3 3.0
Detail:
print (pd.concat([(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1))
0 1 2
0 -3 NaN NaN
1 -3 4.0 7.0
2 -5 1.0 6.0
3 1 0.0 -1.0
4 3 2.0 -5.0
5 1 3.0 2.0
Numpy solution is different, because max of NaN in row is again NaN:
df['TR1'] = np.max(np.c_[(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1)
print (df)
adjclose high low TR1
0 1 4 7 NaN
1 3 5 8 7.0
2 5 4 9 6.0
3 7 5 4 1.0
4 1 5 2 3.0
5 0 4 3 3.0
print (np.c_[(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))])
[[-3. nan nan]
[-3. 4. 7.]
[-5. 1. 6.]
[ 1. 0. -1.]
[ 3. 2. -5.]
[ 1. 3. 2.]]
It can be done by :
df['TR']=list(map(max,zip((df['high']-df['low']), (df['high'] - df['adjclose'].shift(1)).abs(),(df['low'] - df['adjclose'].shift(1)))))

How to remove multiple headers from dataframe and keeps just the first python

I'm working with a csv file that presents multiple headers, all are repeated like in this example:
1 2 3 4
0 POSITION_T PROB ID
1 2.385 2.0 1
2 POSITION_T PROB ID
3 3.074 6.0 3
4 6.731 8.0 4
6 POSITION_T PROB ID
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
I want to remove the duplicated headers to get the a final document like this:
0 POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
The way in which I trying to remove these headers is by using:
df1 = [df!='POSITION_T'][df!='PROB'][df!='ID']
But that produce the error TypeError: Could not compare ['TRACK_ID'] with block values
Some ideas? thanks in advance!
Filtering out by field value:
df = pd.read_table('yourfile.csv', header=None, delim_whitespace=True, skiprows=1)
df.columns = ['0','POSITION_T','PROB','ID']
del df['0']
# filtering out the rows with `POSITION_T` value in corresponding column
df = df[df.POSITION_T.str.contains('POSITION_T') == False]
print(df)
The output:
POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
6 12.508 2.0 1
7 12.932 4.0 2
8 12.985 4.0 2
This is not ideal! The best way to deal with this would be to handle it in the file parsing.
mask = df.iloc[:, 0] == 'POSITION_T'
d1 = df[~mask]
d1.columns = df.loc[mask.idxmax].values
d1 = d1.apply(pd.to_numeric, errors='ignore')
d1
POSITION_T PROB ID
1
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
To keep the bottom level column names only:
df.columns=[multicols[-1] for multicols in df.columns]
past_data=pd.read_csv("book.csv")
past_data = past_data[past_data.LAT.astype(str).str.contains('LAT') == False]
print(past_data)
Replace the CSV (here: book.csv)
Replace your variable names (here: past_data)
Replace all the LAT with your any of your column name
That's All/ your multiple headers will be removed

Categories

Resources