Pandas column addition/subtraction - python

I am using a pandas/python dataframe. I am trying to do a lag subtraction.
I am currently using:
newCol = df.col - df.col.shift()
This leads to a NaN in the first spot:
NaN
45
63
23
...
First question: Is this the best way to do a subtraction like this?
Second: If I want to add a column (same number of rows) to this new column. Is there a way that I can make all the NaN's 0's for the calculation?
Ex:
col_1 =
Nan
45
63
23
col_2 =
10
10
10
10
new_col =
10
55
73
33
and NOT
NaN
55
73
33
Thank you.

I think your method of of computing lags is just fine:
import pandas as pd
df = pd.DataFrame(range(4), columns = ['col'])
print(df['col'] - df['col'].shift())
# 0 NaN
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'] + df['col'].shift())
# 0 NaN
# 1 1
# 2 3
# 3 5
# Name: col
If you wish NaN plus (or minus) a number to be the number (not NaN), use the add (or sub) method with fill_value = 0:
print(df['col'].sub(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'].add(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 3
# 3 5
# Name: col

Related

How to fill missing values in a column by random sampling another column by other column values

I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?
You can create MultiIndex in both DataFrames and then in custom function replace NaN by another DataFrame in GroupBy.transform with numpy.random.choice:
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
cols = ['age', 'location']
source1 = source.set_index(cols)['x']
target1 = target.set_index(cols)['x']
def f(x):
try:
a = source1.loc[x.name].to_numpy()
m = x.isna()
x[m] = np.random.choice(a, size=m.sum())
return x
except KeyError:
return np.nan
target1 = target1.groupby(level=[0,1]).transform(f).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 0 2.0
3 21 1 3.0
4 21 2 NaN
You can create a common grouper and perform a merge:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols+['group'], as_index=False) # get one row per group
.first()
),
on=cols+['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN

2-dimensional bins from a pandas DataFrame based on 3 columns

I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.

identifying patterns in CSV columns/rows in python

Hi all and thanks for help in advance.
The problem I am trying to solve is as follows:
I have two columns within one CSV file: Column A and Column B.
There are certain patterns that need to be present in my data in each row underneath column A and B.
For example, if there is a "1" in row 1 of column A , there must be a "5" adjacent to it in row 1 of column B.
If there was a "1" in row two of Column A, and a "2" adjacent to it in row 2 of column B I would need this to be flagged and printed out as "does not follow pattern"
The rules go as follows:
Any time theres a 1 in column A, next to it there should be a 5 in column B
Any time theres a 3 in column A, next to it there should be a 6 in column B
Any time theres a 2 in column A, next to it there should be a 4 in column B
Anytime these rules are not followed a return statement should say "pattern not followed"
Here is where I am at on the code, I can't seem to figure out a way of doing this data check.
import numpy as np
import pandas as pd
import os # filepaths
import glob
import getpass # Login information
unane = getpass.getuser()
# Paths:
path2proj = os.path.join('C:', os.sep, 'Users', unane, 'Documents', 'Home','Expts','TSP', '')
path2data = os.path.join(path2proj,'Data','')
path2asys = os.path.join(path2proj,'Analysis', '')
path2figs = os.path.join(path2asys, 'figures', '')
path2hddm = os.path.join(path2asys, 'modeling', '')
df = pd.read_csv(path2data + '001_2012_Dec_19_0932_PST_train.csv')
os.chdir(path2data)
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
all_filenames = glob.glob("*.csv")
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv("combined_csv.csv",index=False, encoding='utf-8-sig')
df = pd.read_csv(path2data + 'combined_csv.csv')
df['left_stim_number'].equals(df['right_stim_number'])
df = pd.read_csv(path2data + 'combined_csv.csv')
df1 = pd.DataFrame(df, columns=['left_stim_number'])
df2 = pd.DataFrame(df, columns=['right_stim_number'])
df1['match'] = np.where(df1['left_stim_number']== df2['right_stim_number'], True, False)
# Checking to see if there are any errors as all should add up to 7
df1['add'] = np.where(df1['left_stim_number']== df2['right_stim_number'], 0, df1['left_stim_number'] + df2['right_stim_number'])
# def see_correct(df):
# if df1['add'] == ['7']:
# return 1
# else:
# return 0
# df1.tail(10)
combined_csv.isna().sum()
combined_csv.dropna()
df.loc[df['left_stim_number'] != df['right_stim_number'],:]
---
Example of CSV data
A(left_stim_number) Column B (Right_stim_number)
1 5
1 5
3 6
1 5
3 6
2 4
2 4
2 4
1 5
Since we don't have an example I'll make one up - a pandas DataFrame with two columns of integers.
import numpy as np
import pandas as pd
np.random.seed(2)
df = pd.DataFrame({'colA':np.random.randint(0,10,100),
'colB':np.random.randint(0,10,100)})
>>> df.head()
colA colB
0 8 7
1 8 1
2 6 9
3 2 2
4 8 1
There might be more concise ways to do this nut this is pretty clear what is happening. This uses a lot of boolean indexing.
Your rules exclude any row in colA that is not 1,2,or 3. Rows in colB that are not 4,5,or 6 are also excluded. You can make a mask for all the excluded rows.
mask = ~df.colA.isin([1,2,3]) | ~df.colB.isin([4,5,6])
>>> df[mask].head()
colA colB
0 8 7
1 8 1
2 6 9
3 2 2
4 8 1
>>>
You can use the mask to assign "pattern not followed" to a new column for all those rows.
df.loc[mask,'colC'] = 'pattern not followed'
>>> df.head()
colA colB colC
0 8 7 pattern not followed
1 8 1 pattern not followed
2 6 9 pattern not followed
3 2 2 pattern not followed
4 8 1 pattern not followed
You can also use the mask to find all the rows that might match your criteria. Notice colC is NaN for these rows.
>>> df[~mask]
colA colB colC
13 3 5 NaN
35 2 6 NaN
39 1 5 NaN
61 2 5 NaN
62 1 5 NaN
65 1 6 NaN
69 1 5 NaN
70 2 4 NaN
77 3 5 NaN
92 1 6 NaN
98 2 5 NaN
>>>
Set colC the rows that meet the criteria to True(?).
df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True
That leaves some outliers.
>>> df.loc[df.colC.isna()]
colA colB colC
13 3 5 NaN
35 2 6 NaN
61 2 5 NaN
65 1 6 NaN
77 3 5 NaN
92 1 6 NaN
98 2 5 NaN
Which can be fixed with.
df.loc[df.colC.isna(),'colC'] = 'pattern not followed'
After looking at that only the last four operations are needed.
df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True
df.loc[df.colC.isna(),'colC'] = 'pattern not followed'
>>> df.loc[df.colC == True]
colA colB colC
39 1 5 True
62 1 5 True
69 1 5 True
70 2 4 True
>>>
If the text in csv file looks like this-
4,9
8,3
4,6
2,4
7,5
1,3
.
.
.
The Dataframe can be made with -
df = pd.read_csv('data.csv',names=['colA','colB'])

Remove subseries (rows in a data frame) which meet a condition

I have data frame with a time series (column 1) and a column with values (column 2), which are features of each subseries of the time series.
How to remove subseries which meet a condition?
The picture illustrates what I want to do. I want to remove the orange rows:
I tried to make loops to create an additional column with features that indicate which rows to remove but this solution is very computationally expensive (I have 10mln records in a column). Code (slow solution):
import numpy as np
import pandas as pd
# sample data (smaller than actual df)
# length of df = 100; should be 10000000 in the actual data frame
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
subser_size = 3
maxdist = 18
# loop which creates an additional column which indicates which indexes should be removed.
# Takes first value in a subseries and checks if it meets the condition.
# If it does, all values in subseries (i.e. rows) should be removed ('wrong').
for i,d in zip(range(len(df)), df.distance):
if d >= maxdist:
df.to_remove.iloc[i:i+subser_size] = 'wrong'
else:
df.to_remove.iloc[i] ='good'
You can use list comprehension for create array of indexes by numpy.concatenate with numpy.unique for remove duplicates.
Then use drop or if need new column loc:
np.random.seed(123)
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
6 19.615284 25 NaN
7 13.696595 25 NaN
8 9.618638 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
subser_size = 3
maxdist = 18
print (df.index[df['distance'] >= maxdist])
Int64Index([6, 38, 47, 84, 91], dtype='int64')
arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]]
idx = np.unique(np.concatenate(arr))
print (idx)
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93]
df = df.drop(idx)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
...
...
If need values in column:
df['to_remove'] = 'good'
df.loc[idx, 'to_remove'] = 'wrong'
print (df)
distance time_ser to_remove
0 13.929384 25 good
1 5.722787 25 good
2 4.537029 25 good
3 11.026295 25 good
4 14.389379 25 good
5 8.462129 25 good
6 19.615284 25 wrong
7 13.696595 25 wrong
8 9.618638 25 wrong
9 7.842350 25 good
10 6.863560 25 good
11 14.580994 25 good

Pandas Boolean indexing with two dataframes

I have two pandas dataframes:
df1
'A' 'B'
0 0
0 2
1 1
1 1
1 3
df2
'ID' 'value'
0 62
1 70
2 76
3 4674
4 3746
I want to assign df.value as a new column D to df1, but just when df.A == 0.
df1.B and df2.ID are supposed to be the identifiers.
Example output:
df1
'A' 'B' 'D'
0 0 62
0 2 76
1 1 NaN
1 1 NaN
1 3 NaN
I tried the following:
df1['D'][ df1.A == 0 ] = df2['value'][df2.ID == df1.B]
However, since df2 and df1 don't have the same length, I get the a ValueError.
ValueError: Series lengths must match to compare
This is quite certainly due to the boolean indexing in the last part: [df2.ID == df1.B]
Does anyone know how to solve the problem without needing to iterate over the dataframe(s)?
Thanks a bunch!
==============
Edit in reply to #EdChum: It worked perfectly with the example data, but I have issues with my real data. df1 is a huge dataset. df2 looks like this:
df2
ID value
0 1 1.00000
1 2 1.00000
2 3 1.00000
3 4 1.00000
4 5 1.00000
5 6 1.00000
6 7 1.00000
7 8 1.00000
8 9 0.98148
9 10 0.23330
10 11 0.56918
11 12 0.53251
12 13 0.58107
13 14 0.92405
14 15 0.00025
15 16 0.14863
16 17 0.53629
17 18 0.67130
18 19 0.53249
19 20 0.75853
20 21 0.58647
21 22 0.00156
22 23 0.00000
23 24 0.00152
24 25 1.00000
After doing the merging, the output is the following: first 133 times 0.98148, then 47 times 0.00025 and then it continues with more sequences of values from df2 until finally a sequence of NaN entries appear...
Out[91]: df1
A B D
0 1 3 0.98148
1 0 9 0.98148
2 0 9 0.98148
3 0 7 0.98148
5 1 21 0.98148
7 1 12 0.98148
... ... ... ...
2592 0 2 NaN
2593 1 17 NaN
2594 1 16 NaN
2596 0 17 NaN
2597 0 6 NaN
Any idea what might have happened here? They are all int64.
==============
Here are two csv with data that reproduces the problem.
df1:
https://owncloud.tu-berlin.de/public.php?service=files&t=2a7d244f55a5772f16aab364e78d3546
df2:
https://owncloud.tu-berlin.de/public.php?service=files&t=6fa8e0c2de465cb4f8a3f8890c325eac
To reproduce:
import pandas as pd
df1 = pd.read_csv("../../df1.csv")
df2 = pd.read_csv("../../df2.csv")
df1['D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']
Slightly tricky this one, there are 2 steps here, first is to select only the rows in df where 'A' is 0, then merge to this the other df where 'B' and 'ID' match but perform a 'left' merge, then select the 'value' column from this and assign to the df:
In [142]:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
df
Out[142]:
A B D
0 0 0 62
1 0 2 76
2 1 1 NaN
3 1 1 NaN
4 1 3 NaN
Breaking this down will show what is happening:
In [143]:
# boolean mask on condition
df[df.A == 0]
Out[143]:
A B D
0 0 0 62
1 0 2 76
In [144]:
# merge using 'B' and 'ID' columns
df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')
Out[144]:
A B D ID value
0 0 0 62 0 62
1 0 2 76 2 76
After all the above you can then assign directly:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
This works as it will align with the left hand side idnex so any missing values will automatically be assigned NaN
EDIT
Another method and one that seems to work for your real data is to use map to perform the lookup for you, map accepts a dict or series as a param and will lookup the corresponding value, in this case you need to set the index to 'ID' column, this reduces your df to one with just the 'Value' column:
df['D'] = df[df.A==0]['B'].map(df1.set_index('ID')['value'])
So the above performs boolean indexing as before and then calls map on the 'B' column and looksup the corresponding 'Value' in the other df after we set the index on 'ID'.
Update
I looked at your data and my first method and I can see why this fails, the alignment to the left hand side df fails so you get 1192 values in a continuous row and then the rest of the rows are NaN up to row 2500.
What does work is if you apply the same mask to the left hand side like so:
df1.loc[df1.A==0, 'D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']
So this masks the rows on the left hand side correctly and assigns the result of the merge

Categories

Resources