Pandas accumulation without for loop

Pandas accumulation without for loop - python

Hi I have a pandas frame work like:
1. 1
2. 2
3. 3
4. 4
And the output is something like
1. 1
2. 3
3. 6
4. 10
where each value is the current value plus the last one (3 = 1 + 2, 6 = 3 + 3, 10 = 6 + 4 etc).
Can I do this without a for loop?

You need Series.cumsum:
print (df)
col
1.0 1
2.0 2
3.0 3
4.0 4
df['col1'] = df.col.cumsum()
print (df)
col col1
1.0 1 1
2.0 2 3
3.0 3 6
4.0 4 10
If need overwrite column col:
df.col = df.col.cumsum()
print (df)
col
1.0 1
2.0 3
3.0 6
4.0 10

Related

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3

You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0

You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

How to apply function vertically in df

I want to add column values vertically from top to down
def add(x,y):
return x,y
df = pd.DataFrame({'A':[1,2,3,4,5]})
df['add'] = df.apply(lambda row : add(row['A'], axis = 1)
I tried using apply but its not working
Desired output is basically adding A column values 1+2, 2+3:
A add
0 1 1
1 2 3
2 3 5
3 4 7
4 5 9

You can apply rolling.sum on a moving window of size 2:
df.A.rolling(2, min_periods=1).sum()
0 1.0
1 3.0
2 5.0
3 7.0
4 9.0
Name: A, dtype: float64

Try this instead:
>>> df['add'] = (df + df.shift()).fillna(df)['A']
>>> df
A add
0 1 1.0
1 2 3.0
2 3 5.0
3 4 7.0
4 5 9.0
>>>

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0

Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Calculating formula based on multiple columns in Pandas Dataframe - but without creating many intermediate columns

I have been trying to calculate the "True Range" formula based on a Pandas dataframe containing stock ticker prices history.
This is the formula:
TR = max [(high - low ), abs(high − close prev), abs ⁡(low − close prev)]
I have high, low and close as columns in the dataframe.
When I try to operate like this, I get invalid character identifier error which is not very helpful. I tried many changes and combninations in the following expression, but not successful.
df['TR']=((df['high']-df['low']), (df['high'] - df['adjclose'].shift(1)).abs(),(df['low'] - df['adjclose'].shift(1))).max(axis=1)
I know this can be achieved by three separate intermediate columns and taking a max of them. But, I want to avoid the same and do it directly.
Is there a way out?

Use concat with max:
df['TR'] = pd.concat([(df['high'] - df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1).max(axis=1)
Sample:
df = pd.DataFrame({'high':[4,5,4,5,5,4],
'low':[7,8,9,4,2,3],
'adjclose':[1,3,5,7,1,0]})
print (df)
adjclose high low
0 1 4 7
1 3 5 8
2 5 4 9
3 7 5 4
4 1 5 2
5 0 4 3
df['TR'] = pd.concat([(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1).max(axis=1)
print (df)
adjclose high low TR
0 1 4 7 -3.0
1 3 5 8 7.0
2 5 4 9 6.0
3 7 5 4 1.0
4 1 5 2 3.0
5 0 4 3 3.0
Detail:
print (pd.concat([(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1))
0 1 2
0 -3 NaN NaN
1 -3 4.0 7.0
2 -5 1.0 6.0
3 1 0.0 -1.0
4 3 2.0 -5.0
5 1 3.0 2.0
Numpy solution is different, because max of NaN in row is again NaN:
df['TR1'] = np.max(np.c_[(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))], axis=1)
print (df)
adjclose high low TR1
0 1 4 7 NaN
1 3 5 8 7.0
2 5 4 9 6.0
3 7 5 4 1.0
4 1 5 2 3.0
5 0 4 3 3.0
print (np.c_[(df['high']-df['low']),
(df['high'] - df['adjclose'].shift(1)).abs(),
(df['low'] - df['adjclose'].shift(1))])
[[-3. nan nan]
[-3. 4. 7.]
[-5. 1. 6.]
[ 1. 0. -1.]
[ 3. 2. -5.]
[ 1. 3. 2.]]

It can be done by :
df['TR']=list(map(max,zip((df['high']-df['low']), (df['high'] - df['adjclose'].shift(1)).abs(),(df['low'] - df['adjclose'].shift(1)))))

How to remove multiple headers from dataframe and keeps just the first python

I'm working with a csv file that presents multiple headers, all are repeated like in this example:
1 2 3 4
0 POSITION_T PROB ID
1 2.385 2.0 1
2 POSITION_T PROB ID
3 3.074 6.0 3
4 6.731 8.0 4
6 POSITION_T PROB ID
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
I want to remove the duplicated headers to get the a final document like this:
0 POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
The way in which I trying to remove these headers is by using:
df1 = [df!='POSITION_T'][df!='PROB'][df!='ID']
But that produce the error TypeError: Could not compare ['TRACK_ID'] with block values
Some ideas? thanks in advance!

Filtering out by field value:
df = pd.read_table('yourfile.csv', header=None, delim_whitespace=True, skiprows=1)
df.columns = ['0','POSITION_T','PROB','ID']
del df['0']
# filtering out the rows with `POSITION_T` value in corresponding column
df = df[df.POSITION_T.str.contains('POSITION_T') == False]
print(df)
The output:
POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
6 12.508 2.0 1
7 12.932 4.0 2
8 12.985 4.0 2

This is not ideal! The best way to deal with this would be to handle it in the file parsing.
mask = df.iloc[:, 0] == 'POSITION_T'
d1 = df[~mask]
d1.columns = df.loc[mask.idxmax].values
d1 = d1.apply(pd.to_numeric, errors='ignore')
d1
POSITION_T PROB ID
1
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2

To keep the bottom level column names only:
df.columns=[multicols[-1] for multicols in df.columns]

past_data=pd.read_csv("book.csv")
past_data = past_data[past_data.LAT.astype(str).str.contains('LAT') == False]
print(past_data)
Replace the CSV (here: book.csv)
Replace your variable names (here: past_data)
Replace all the LAT with your any of your column name
That's All/ your multiple headers will be removed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas accumulation without for loop - python

Hi I have a pandas frame work like: 1. 1 2. 2 3. 3 4. 4 And the output is something like 1. 1 2. 3 3. 6 4. 10 where each value is the current value plus the last one (3 = 1 + 2, 6 = 3 + 3, 10 = 6 + 4 etc). Can I do this without a for loop?

You need Series.cumsum: print (df) col 1.0 1 2.0 2 3.0 3 4.0 4 df['col1'] = df.col.cumsum() print (df) col col1 1.0 1 1 2.0 2 3 3.0 3 6 4.0 4 10 If need overwrite column col: df.col = df.col.cumsum() print (df) col 1.0 1 2.0 3 3.0 6 4.0 10

Related

Find the time difference between consecutive rows of two columns for a given value in third column

How to apply function vertically in df

Finding mean of specific column and keep all rows that have specific mean values

Calculating formula based on multiple columns in Pandas Dataframe - but without creating many intermediate columns

How to remove multiple headers from dataframe and keeps just the first python

Categories

Resources