Sum values in specific columns in DataFrame and ignore None

Sum values in specific columns in DataFrame and ignore None - python

I want to calculate the distance that my cars have driven. I have all the coordinates that the cars need to go to. Some cars parks earlier then others, and this messes up my calculation.
I have this:
cars= pd.DataFrame({'x': [3,3,3,3,3,3,3,3,3],
'y': [1,2,3,4,5,6,7,8,9],
'x_goal_1': [3,3,3,3,3,3,3,3,3],
'y_goal_1': [10,10,10,10,10,10,10,10,10],
'x_goal_2': [17,24,31,31,17,17,38,38,31],
'y_goal_2': [10,10,10,10,10,10,10,10,10],
'x_goal_3': [17,24,31,31,17,17,38,38,31],
'y_goal_3': [17, 3, 3, 3, 17, 17, 17, 17, 3],
'x_goal_4': [None,27,35,28,14,18,42,43,None],
'y_goal_4': [None, 3, 3, 3, 17, 17, 17, 17, None],
'z': [3,4,5,6,7,8,9,12,22]})
cars['moved_tot'] = (
abs(cars['x']-cars['x_goal_1']) + abs(cars['y']-cars['y_goal_1']) +
abs(cars['x_goal_1']-cars['x_goal_2']) + abs(cars['y_goal_1']-cars['y_goal_2']) +
abs(cars['x_goal_2']-cars['x_goal_3']) + abs(cars['y_goal_2']-cars['y_goal_3']) +
abs(cars['x_goal_3']-cars['x_goal_4']) + abs(cars['y_goal_3']-cars['y_goal_4']) )
I then get:
x y x_goal_1 y_goal_1 ... x_goal_4 y_goal_4 z moved_tot
0 3 1 3 10 ... NaN NaN 3 NaN
1 3 2 3 10 ... 27.0 3.0 4 39.0
2 3 3 3 10 ... 35.0 3.0 5 46.0
3 3 4 3 10 ... 28.0 3.0 6 44.0
4 3 5 3 10 ... 14.0 17.0 7 29.0
5 3 6 3 10 ... 18.0 17.0 8 26.0
6 3 7 3 10 ... 42.0 17.0 9 49.0
7 3 8 3 10 ... 43.0 17.0 12 49.0
8 3 9 3 10 ... NaN NaN 22 NaN
I want the first moved_tot I want 30, and in the last I want 36. I want the calculation to ignore if a value is None ( that is if this car has parked earlier ). How do I do this?
with help from David S ( thank you! ) I figured out how to do it.
bags['moved_tot'] = (
abs(bags['x']-bags['x_goal_1']).fillna(0) + abs(bags['y']-bags['y_goal_1']).fillna(0) +
abs(bags['x_goal_1']-bags['x_goal_2']).fillna(0) + abs(bags['y_goal_1']-bags['y_goal_2']).fillna(0) +
abs(bags['x_goal_2']-bags['x_goal_3']).fillna(0) + abs(bags['y_goal_2']-bags['y_goal_3']).fillna(0) +
abs(bags['x_goal_3']-bags['x_goal_4']).fillna(0) + abs(bags['y_goal_3']-bags['y_goal_4']).fillna(0)
)

You could just replace the NaN inplace with 0 to avoid getting NaN in the result column, like:
cars['moved_tot'] = (abs(cars['x']-cars['x_goal_1'].fillna(0)) + abs(cars['y']-cars['y_goal_1'].fillna(0)) +
abs(cars['x_goal_1'].fillna(0)-cars['x_goal_2'].fillna(0)) + abs(cars['y_goal_1'].fillna(0)-cars['y_goal_2'].fillna(0)) +
abs(cars['x_goal_2'].fillna(0)-cars['x_goal_3'].fillna(0)) + abs(cars['y_goal_2'].fillna(0)-cars['y_goal_3'].fillna(0)) +
abs(cars['x_goal_3'].fillna(0)-cars['x_goal_4'].fillna(0)) + abs(cars['y_goal_3'].fillna(0)-cars['y_goal_4'].fillna(0)) )
If you want to 0 the calculation result if NaN is present just move the .fillna(0) to outside the abs()

Just use cars = cars.fillna(0) or cars.fillna(0, inplace=True) to get rid of repeating .fillna(0) after each abs.
If you don't want to change the original dataframe, use cars_ = cars.fillna(0) then replace cars in cars['moved_tot'] to cars_.
Besides, you could make use of the feature of columns to get rid of writing repeating column names.
cars_ = cars.fillna(0)
cars_['moved_tot'] = 0
for i in range(len(cars.columns) - 3):
print(cars_.columns[i], '-', cars_.columns[i+2])
cars_['moved_tot'] += abs(cars_[cars_.columns[i]] - cars_[cars_.columns[i+2]])
# Output and print(cars_)
x - x_goal_1
y - y_goal_1
x_goal_1 - x_goal_2
y_goal_1 - y_goal_2
x_goal_2 - x_goal_3
y_goal_2 - y_goal_3
x_goal_3 - x_goal_4
y_goal_3 - y_goal_4
x y x_goal_1 y_goal_1 x_goal_2 y_goal_2 x_goal_3 y_goal_3 x_goal_4 y_goal_4 z moved_tot
0 3 1 3 10 17 10 17 17 0.0 0.0 3 64.0
1 3 2 3 10 24 10 24 3 27.0 3.0 4 39.0
2 3 3 3 10 31 10 31 3 35.0 3.0 5 46.0
3 3 4 3 10 31 10 31 3 28.0 3.0 6 44.0
4 3 5 3 10 17 10 17 17 14.0 17.0 7 29.0
5 3 6 3 10 17 10 17 17 18.0 17.0 8 26.0
6 3 7 3 10 38 10 38 17 42.0 17.0 9 49.0
7 3 8 3 10 38 10 38 17 43.0 17.0 12 49.0
8 3 9 3 10 31 10 31 3 0.0 0.0 22 70.0
You could even do the sum in one line
cars_['moved_tot'] = pd.concat([abs(cars_[cars_.columns[i]] - cars_[cars_.columns[i+2]]) for i in range(len(cars.columns) - 3)], axis=1).sum(1)

Related

How to compare with the previous line after reassignment [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 months ago.
Improve this question
Compare each row of column A with the previous row
If greater than, reassign to the value of the previous row
If less than, the value is unchanged
Now the problem is that each time the comparison is made with the original value
What I want is, to compare with the previous line after reassignment
import pandas as pd
import numpy as np
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_changed']=np.where(df.A>df.A.shift(),df.A.shift(),df.A)
df
A A_changed
0 16 16.0
1 19 16.0
2 18 18.0
3 15 15.0
4 13 13.0
5 16 13.0
expected output
A A_changed
0 16 16.0
1 19 16.0
2 18 16.0
3 15 15.0
4 13 13.0
5 16 13.0

Are you trying to do cummin?
df['compare_min'] = df['A'].cummin()
Output:
A compare compare_min
0 5 5.0 5
1 14 5.0 5
2 12 12.0 5
3 15 12.0 5
4 13 13.0 5
5 16 13.0 5
df['b'] = [10, 11, 12, 5, 8, 2]
df['compare_min_b'] = df['b'].cummin()
Output:
A compare compare_min b compare_min_b
0 5 5.0 5 10 10
1 14 5.0 5 11 10
2 12 12.0 5 12 10
3 15 12.0 5 5 5
4 13 13.0 5 8 5
5 16 13.0 5 2 2
Update using your example, this exactly what cummin does:
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_change'] = df['A'].cummin()
df
Output:
A A_changed A_change
0 16 16.0 16
1 19 16.0 16
2 18 18.0 16
3 15 15.0 15
4 13 13.0 13
5 16 13.0 13
Here is why your code will not work:
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_shift'] = df['A'].shift()
df
Output:
A A_shift
0 16 NaN
1 19 16.0
2 18 19.0
3 15 18.0
4 13 15.0
5 16 13.0
Look at the output of the shifted column, what you want to do is keep the cumulative mine instead of just comparing A to shifted A. Hence index 2 is not giving you what you expected.

how to substract two successive rows in a dataframe?

I have a dataFrame (python) like this:
x y z time
0 0.730110 4.091428 7.833503 1618237788537
1 0.691825 4.024428 7.998608 1618237788537
2 0.658325 3.998107 8.195119 1618237788537
3 0.658325 4.002893 8.408080 1618237788537
4 0.677468 4.017250 8.561220 1618237788537
I want to add column to this dataFrame called computed. This column includes values computed as for:
row 0: (0.730110-0)^2 +(4.091428-0)^2 +(7.833503-0)^2
row 1: (0.691825 -0.730110)^2 +(4.024428- 4.091428)^2 +(7.998608-7.833503)^2
etc
How can do that please.

TL;DR:
df['computed'] = df.diff().pow(2).sum(axis=1)
df.at[0, 'computed'] = df.loc[0].pow(2).sum()
Step by step:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 1, 2, 3, 5, 8], 'c': [1, 4, 9, 16, 25, 36]})
df
# a b c
# 0 1 1 1
# 1 2 1 4
# 2 3 2 9
# 3 4 3 16
# 4 5 5 25
# 5 6 8 36
df.diff()
# a b c
# 0 NaN NaN NaN
# 1 1.0 0.0 3.0
# 2 1.0 1.0 5.0
# 3 1.0 1.0 7.0
# 4 1.0 2.0 9.0
# 5 1.0 3.0 11.0
df.diff().pow(2)
# a b c
# 0 NaN NaN NaN
# 1 1.0 0.0 9.0
# 2 1.0 1.0 25.0
# 3 1.0 1.0 49.0
# 4 1.0 4.0 81.0
# 5 1.0 9.0 121.0
df.diff().pow(2).sum(axis=1)
# 0 0.0
# 1 10.0
# 2 27.0
# 3 51.0
# 4 86.0
# 5 131.0
df['computed'] = df.diff().pow(2).sum(axis=1)
df
# a b c computed
# 0 1 1 1 0.0
# 1 2 1 4 10.0
# 2 3 2 9 27.0
# 3 4 3 16 51.0
# 4 5 5 25 86.0
# 5 6 8 36 131.0
df.at[0, 'computed'] = df.loc[0].pow(2).sum()
df
# a b c computed
# 0 1 1 1 3.0
# 1 2 1 4 10.0
# 2 3 2 9 27.0
# 3 4 3 16 51.0
# 4 5 5 25 86.0
# 5 6 8 36 131.0
Relevant documentation and related questions:
Difference between rows with .diff();
Square each cell with .pow(2);
Sum by row with .sum(axis=1);
How to calculate sum of Nth power of each cell for a column in dataframe?;
Set value for particular cell in pandas DataFrame?.

How to fill DataFrame column with minimum values of next n entries of other column

I have a dataframe:
import numpy as np
import pandas as pd
np.random.seed(18)
df = pd.DataFrame(np.random.randint(0,50,size=(10, 2)), columns=list('AB'))
df['Min'] = np.nan
n = 3 # can be changed
I need to fill column 'Min' with minimum values of next n enrties of column 'B':
Currently I do it using iteration:
for row in range (0, df.shape[0]-n):
low = []
for i in range (1, n+1):
low.append(df.loc[df.index[row+i], 'B'])
df.loc[df.index[row], 'Min'] = min(low)
But it is quite a slow process. Is there more efficient way, please? Thank you.

Use rolling with min and then shift:
df['Min'] = df['B'].rolling(n).min().shift(-n)
print (df)
A B Min
0 42 19 2.0
1 5 49 2.0
2 46 2 17.0
3 8 24 17.0
4 34 17 11.0
5 5 21 4.0
6 47 42 1.0
7 10 11 NaN
8 36 4 NaN
9 43 1 NaN
If performance is important use this solution:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = rolling_window(df['B'].values, n).min(axis=1)
df['Min'] = np.concatenate([arr[1:], [np.nan] * n])
print (df)
A B Min
0 42 19 2.0
1 5 49 2.0
2 46 2 17.0
3 8 24 17.0
4 34 17 11.0
5 5 21 4.0
6 47 42 1.0
7 10 11 NaN
8 36 4 NaN
9 43 1 NaN

Jez`s got it. Just as another option, you can also do a forward rolling through the Series (as suggested by Andy here)
df.B[::-1].rolling(3).min()[::-1].shift(-1)
0 2.0
1 2.0
2 17.0
3 17.0
4 11.0
5 4.0
6 1.0
7 NaN
8 NaN
9 NaN

in python 3.6.i want to use sklearn tree to classify but appear ValueError: could not convert string to float: 'NC',

This is my code
import os
import pandas as pd
import numpy as np
import pylab as pl
from sklearn import tree
os.chdir('C:/Users/Shinelon/Desktop/ch13')
w=pd.read_table('cup98lrn.txt',sep=',',low_memory=False)
w1=(w.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG','TARGET_B']]).dropna(how='any')
x=w1.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG']]
y=w1.loc[:,['TARGET_B']]
clf=tree.DecisionTreeClassifier(min_samples_split=1000,min_samples_leaf=400,max_depth=10)
print(w1.head())
clf=clf.fit(x,y)
but appear the question I can't understand .because i use sklearn.tree before .D:\python3.6\python.exe C:/Users/Shinelon/Desktop/ch13/.idea/13.4.py
AGE AVGGIFT CARDGIFT CARDPM12 CARDPROM CLUSTER2 DOMAIN GENDER \
1 46.0 15.666667 1 6 12 1.0 S1 M
3 70.0 6.812500 7 6 27 41.0 R2 F
4 78.0 6.864865 8 10 43 26.0 S2 F
6 38.0 7.642857 8 4 26 53.0 T2 F
11 75.0 12.500000 2 6 8 23.0 S2 M
GEOCODE2 HIT ... MDMAUD_R MINRAMNT NGIFTALL NUMPRM12 RAMNTALL \
1 A 16 ... X 10.0 3 13 47.0
3 C 2 ... X 2.0 16 14 109.0
4 A 60 ... X 3.0 37 25 254.0
6 D 0 ... X 3.0 14 9 107.0
11 B 3 ... X 10.0 2 12 25.0
RFA_2A RFA_2F STATE TIMELAG TARGET_B
1 G 2 CA 18.0 0
3 E 4 CA 9.0 0
4 F 2 FL 14.0 0
6 E 1 IN 4.0 0
11 F 2 IN 3.0 0
this is print(w1) result

DataFrame: fillna() with running sum of valid values

I'm working a Pandas Dataframe, that looks like this:
0 Data
1
2
3
4 5
5
6
7
8 21
9
10 2
11
12
13
14
15
I'm trying to fill the blank with next valid values by: df.fillna(method='backfill'). This works, but then I need to add the previous valid value to the next valid value, from the bottom up, such as:
0 Data
1 28
2 28
3 28
4 28
5 23
6 23
7 23
8 23
9 2
10 2
11
12
13
14
15
I can get this to work by looping over it, but is there a method within pandas that can do this?
Thanks a lot!

You could reverse the df, then fillna(0) and then cumsum and reverse again:
In [12]:
df = df[::-1].fillna(0).cumsum()[::-1]
df
Out[12]:
Data
0 28.0
1 28.0
2 28.0
3 28.0
4 23.0
5 23.0
6 23.0
7 23.0
8 2.0
9 2.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
here we use slicing notation to reverse the df, then replace all NaN with 0, perform cumsum and reverse back

Another simple way to do that : df.sum()-df.fillna(0).cumsum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum values in specific columns in DataFrame and ignore None - python

Related

How to compare with the previous line after reassignment [closed]

how to substract two successive rows in a dataframe?

How to fill DataFrame column with minimum values of next n entries of other column

in python 3.6.i want to use sklearn tree to classify but appear ValueError: could not convert string to float: 'NC',

DataFrame: fillna() with running sum of valid values

Categories

Resources