Comparing rows values using shift function - python

I'm learning pandas and I came across the following method to compare rows in a dataframe.
Here I'm using np.were and shift() functions to compare values within a column.
import pandas as pd
import numpy as np
# Initialise data to Dicts of series.
d = {'col' : pd.Series([10, 30, 20, 40, 70, 60])}
# creates Dataframe.
df = pd.DataFrame(d)
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df
Here the output appearing as the following:
col Relation
0 10 Less
1 30 Grater
2 20 Less
3 40 Grater
4 70 Grater
5 60 Less
I have confusion in row 3 why it is appearing as Grater?, 40 is less than 70 so it should appear as Less. What I'm doing wrong here?

Because compare 40 with 20, because shift index by 1:
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df['shifted'] = df['col'].shift()
df['m'] = df['col'] > df['col'].shift()
print (df)
col Relation shifted m
0 10 Less NaN False
1 30 Grater 10.0 True
2 20 Less 30.0 False
3 40 Grater 20.0 True <- here
4 70 Grater 40.0 True
5 60 Less 70.0 False
Maybe you want shift by -1:
df['Relation'] = np.where(df['col'] > df['col'].shift(-1), "Grater", "Less")
df['shifted'] = df['col'].shift(-1)
df['m'] = df['col'] > df['col'].shift(-1)
print (df)
col Relation shifted m
0 10 Less 30.0 False
1 30 Grater 20.0 True
2 20 Less 40.0 False
3 40 Less 70.0 False
4 70 Grater 60.0 True
5 60 Less NaN False

Related

2-dimensional bins from a pandas DataFrame based on 3 columns

I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.

How to sum columns from two different size datasets in pandas

I have two datasets. The first one (df1) contains more then 200.000 rows, and the second one (df2) only two. I need to create a new column df1['column_2'] which is the sum of df1['column_1'] and df2['column_1']
When I try to make df1['column_2'] = df1['column_1'] + df2['column_1'] I get an error "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
How can I sum values of different datasets with different amount of rows?
Will be thankful for any help!
Screenshot of my notebook: https://prnt.sc/p1d6ze
I tried your code and it works with no error, using Pandas 0.25.0
and Python 3.7.0.
If you use older versions, consider upgrade.
For the test I used df1 with 10 rows (shorter):
column_1
0 10
1 20
2 30
3 40
4 50
5 60
6 70
7 80
8 90
9 100
and df2 with 2 rows (just as in your post):
column_1
0 3
1 5
Your instruction df1['column_2'] = df1['column_1'] + df2['column_1']
gives the following result:
column_1 column_2
0 10 13.0
1 20 25.0
2 30 NaN
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
7 80 NaN
8 90 NaN
9 100 NaN
So that:
Elements with "overlapping" index values are summed.
Other elements (with no corresponding index in df2 are NaN.
Because of the presence of NaN values, this column is coerced to float.
Alternative form of this instruction, using .loc[...] is:
df1['column_2'] = df1.loc[:, 'column_1'] + df2.loc[:, 'column_1']
It works on my computer either.
Or maybe you want to "multiply" (replicate) df2 to the length of df1
before summing? If yes, run:
df1['column_2'] = df1.column_1 + df2.column_1.values.tolist() * 5
In this case 5 is the number of times df2 should be "multiplied".
This time no index alignment takes place and the result is:
column_1 column_2
0 10 13
1 20 25
2 30 33
3 40 45
4 50 53
5 60 65
6 70 73
7 80 85
8 90 93
9 100 105
Reindex is applied on the df which have less number of records compared to the other, For example here y
Subtraction:
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x - y.reindex_like(x).fillna(0)
Addition
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x + y.reindex_like(x).fillna(0)
Multiplication
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x * y.reindex_like(x).fillna(1)
I have discovered that I can not make df_1['column_3] = df_1['column_1] + df_1['column_2] if df_1 is a slice from original dataframe df. So, I have solved my question by writing a function:
def new_column(dataframe):
if dataframe['column']=='value_1':
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_1']
['column_1'].values[0])
else:
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_2']
['column_1'].values[0])
return dataframe
dataframe=df_1.apply(new_column,axis=1)

Merge Pandas DataFrame where Values are not exactly alike

I have two DataFrames:
First one (sp_df)
X Y density keep mass size
10 20 33 False 23 23
3 2 52 True 5 5
1.2 3 35 False 25 52
Second one (ep_df)
X Y density keep mass size
2.1 1.1 55 True 4.0 4.4
1.1 2.9 60 False 24.8 54.8
9.0 25.0 33 False 22.0 10.0
now i need to merge them with their X/Y Position into something like this:
X-SP Y-SP density-SP ........ X-EP Y-EP density-EP......
1.5 2.0 30 1.0 2.4 28.7
So with the Data shown above you would get something like this:
X-SP Y-SP density-SP keep-SP mass-SP size-SP X-EP Y-EP density-EP keep-EP mass-EP size-EP
3 2 52 True 5 5 2.1 1.1 55 True 4.0 4.4
1.2 3 35 False 25 52 1.1 2.9 60 False 24.8 54.8
10 20 33 False 23 23 9.0 25.0 33 False 22.0 10.0
My Problem is now that those values are not frequently alike. So I need some kind comparison what two columns in the different dataframes are most likely to be the same. Unfortunately, I have no idea how I can get this done.
Any Tips, advice? Thanks in advance
you can merge the two dataframes like a cartesian product. This will make a dataframe with each row of first data frame joined with every row of second data frame. Than remove the rows which have more difference between X values of the two dataframes. Hope the following code helps,
import pandas as pd
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1].tolist()
df=df.drop(df.index[drop])
df
Edited code:
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1.01].tolist()
drop_negative=df.index[df["diff"] <= 0 ].tolist()
droped_values=drop+drop_negative
df=df.drop(df.index[droped_values])
df

Remove subseries (rows in a data frame) which meet a condition

I have data frame with a time series (column 1) and a column with values (column 2), which are features of each subseries of the time series.
How to remove subseries which meet a condition?
The picture illustrates what I want to do. I want to remove the orange rows:
I tried to make loops to create an additional column with features that indicate which rows to remove but this solution is very computationally expensive (I have 10mln records in a column). Code (slow solution):
import numpy as np
import pandas as pd
# sample data (smaller than actual df)
# length of df = 100; should be 10000000 in the actual data frame
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
subser_size = 3
maxdist = 18
# loop which creates an additional column which indicates which indexes should be removed.
# Takes first value in a subseries and checks if it meets the condition.
# If it does, all values in subseries (i.e. rows) should be removed ('wrong').
for i,d in zip(range(len(df)), df.distance):
if d >= maxdist:
df.to_remove.iloc[i:i+subser_size] = 'wrong'
else:
df.to_remove.iloc[i] ='good'
You can use list comprehension for create array of indexes by numpy.concatenate with numpy.unique for remove duplicates.
Then use drop or if need new column loc:
np.random.seed(123)
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
6 19.615284 25 NaN
7 13.696595 25 NaN
8 9.618638 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
subser_size = 3
maxdist = 18
print (df.index[df['distance'] >= maxdist])
Int64Index([6, 38, 47, 84, 91], dtype='int64')
arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]]
idx = np.unique(np.concatenate(arr))
print (idx)
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93]
df = df.drop(idx)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
...
...
If need values in column:
df['to_remove'] = 'good'
df.loc[idx, 'to_remove'] = 'wrong'
print (df)
distance time_ser to_remove
0 13.929384 25 good
1 5.722787 25 good
2 4.537029 25 good
3 11.026295 25 good
4 14.389379 25 good
5 8.462129 25 good
6 19.615284 25 wrong
7 13.696595 25 wrong
8 9.618638 25 wrong
9 7.842350 25 good
10 6.863560 25 good
11 14.580994 25 good

Pandas column addition/subtraction

I am using a pandas/python dataframe. I am trying to do a lag subtraction.
I am currently using:
newCol = df.col - df.col.shift()
This leads to a NaN in the first spot:
NaN
45
63
23
...
First question: Is this the best way to do a subtraction like this?
Second: If I want to add a column (same number of rows) to this new column. Is there a way that I can make all the NaN's 0's for the calculation?
Ex:
col_1 =
Nan
45
63
23
col_2 =
10
10
10
10
new_col =
10
55
73
33
and NOT
NaN
55
73
33
Thank you.
I think your method of of computing lags is just fine:
import pandas as pd
df = pd.DataFrame(range(4), columns = ['col'])
print(df['col'] - df['col'].shift())
# 0 NaN
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'] + df['col'].shift())
# 0 NaN
# 1 1
# 2 3
# 3 5
# Name: col
If you wish NaN plus (or minus) a number to be the number (not NaN), use the add (or sub) method with fill_value = 0:
print(df['col'].sub(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'].add(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 3
# 3 5
# Name: col

Categories

Resources