Find a subset of columns based on another dataframe with NaN values? - python

I'm attempting to get the mean values on one data frame between certain time points that are marked as events in a second data frame.
This is a follow up to this question, where now I have missing/NaN values: Find a subset of columns based on another dataframe?
import pandas as pd
import numpy as np
#example
example_g = [["4/20/21 4:20", 302, 0, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0.0, 0.1, 0.2, 0.3, 0.4, 0.5])
#Example Timestamps
example_s = [["4/20/21 4:20",302,0.0, 0.2, np.NaN],
["2/17/21 9:20",135,0.0, 0.1, 0.4 ],
["2/17/21 9:20",111,0.3, 0.4, 0.5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])
df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
def func(df):
event_2 = df['event_2']
event_3 = df['event_3']
start = event_2 + 2 # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
end = event_3 + 2 # same as above
total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
return avg
df['avg'] = df.apply(func,axis=1)
I get the following error:
cannot do positional indexing on Index with these indexers [nan] of type float
I have attempted making sure that columns are floats and have tried removing the int() command within the definitions of the events.
How can I preform the same calculations as before where possible but while skipping any values that are NaN?

so about your question, check if this solution is ok:
def func(row):
try:
event_2 = row['event_2']
event_3 = row['event_3']
start = int(event_2 + 2)
end = int(event_3 + 2)+1
list_row = row.tolist()[start:end]
list_row = [x for x in list_row if x == x]
return sum(list_row)/(end-start)
except Exception as e:
return np.NaN
df['avg'] = df.apply(lambda x: func(x),axis=1)
I reduced the function and convert start and end parameters to integer before to the set a subset and when you call the function interows I'm using a lambda function and in Avg calculation, I remove all NaN values

You can check if the event values are NaN and if any of the event value is NaN, just return NaN from the function, else return the required value.
You can also modify the function a bit to calculate the values between any two given events, i.e. not necessarily event 2, and event 3. Also, the data you provided in the previous question had event values columns as integer, but this time, you have float values like 0.1, 0.2, 0.3, ... etc. You can just store the column for event values in a list in an increasing order to be able to access them via index values coming from events column from the second dataframe.
Additionally, you can directly use np.mean instead of calculating the sum and dividing it manually. The modified version of the function will look like this:
eventCols = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # Columns having the value for events
def getMeanValue(row, eN1=2, eN2=3):
if pd.isna([row[f'event_{eN1}'], row[f'event_{eN2}']]).any():
return float('nan')
else:
requiredEventCols =eventCols[int(row[f'event_{eN1}']):int(row[f'event_{eN2}']+1)]
return np.mean(row[requiredEventCols])
Now, you can apply this function on the dataframe on axis=1=:
df['avg'] = df.apply(getMeanValue,axis=1)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 3.30
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.35
[3 rows x 12 columns]
Additionally, if needed, you can also pass the two event numbers, default values are 2, and 3 which means the value will be calculated for event_2, and event_3
Average between event_1 and event_2:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=2)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN 0.00
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 1.20
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.25
[3 rows x 12 columns]
Average between event_1 and event_3:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=3)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 2.84
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.30
[3 rows x 12 columns]

The format of your data is hard to work with. I would spend some time to rearrange it into a less wide format, then do the work needed.
Here is a quick example, but I did not spend any time making this readable:
base = example_g_table.set_index(['Date_Time','CID']).stack().to_frame()
data = example_s_table.set_index(['Date_Time','CID']).stack().reset_index().set_index(['Date_Time','CID', 0])
base['events'] = data
base = base.reset_index()
base = base.rename(columns={'level_2': 'local_index', 0: 'values'})
This produces a frame that looks something like this:
In this format calculating the result is not so hard.
import numpy
from functools import partial
def mean_two_events(event1, event2, columns_to_mean, df):
event_1 = df['events'] == event1
event_2 = df['events'] == event2
if any(event_1) and any(event_2):
return df.loc[event_1.idxmax():event_2.idxmax()][columns_to_mean].mean()
else:
return np.nan
mean_event2_and_event3 = partial(mean_two_events, 'event_2','event_3', 'values')
mean_event1_and_event3 = partial(mean_two_events, 'event_1','event_3', 'values')
base.groupby(['Date_Time','CID']).apply(mean_event2_and_event3).reset_index()
Good luck!
Edit:
Here is an alternative solution that filters out the values BEFORE the groupby.
base['events'] = base.groupby(['Date_Time','CID']).events.ffill()
# This caluclates all periods up until the next event. The shift makes the first values of the next event included as well.
# The problem with appoach is that more complex logic will be needed if you need to caluclate values between events that
# are not adjasant, IE this wont work if you want the calculate between event_1 and event_3.
base['time_periods_to_include'] = ((base.events == 'event_2') | (base.groupby(['Date_Time','CID']).events.shift() == 'event_2'))
# Now we can simply do:
filtered_base = base[base['time_periods_to_include']]
filtered_base.groupby(['Date_Time','CID']).values.mean()
# The benifit is that you can now eaisaly do:
filtered_base.groupby(['Date_Time','CID']).values.rolling(5).mean()

Related

select rows in group by dataframe before the row which not satisfies a condition (python)

I have a dataframe with some features. I want to group by 'id' feature. Then for each group I want to identify the row which has 'speed' feature value greater than a threshold and select all the rows before this one.
For example, my threshold is 1.5 for 'speed' feature and my input is:
id
speed
...
1
1.2
...
1
1.9
...
1
1.0
...
5
0.9
...
5
1.3
...
5
3.5
...
5
0.4
...
And my desired output is:
id
speed
...
1
1.2
...
5
0.9
...
5
1.3
...
This should get you the desired results:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
df
output:
id speed
0 1 1.2
1 1 1.9
2 1 1.0
3 5 0.9
4 5 1.3
5 5 9.5
6 5 0.4
ther = 1.5
s = df.speed.shift(-1).ge(ther)
df[s]
Output:
id speed
0 1 1.2
4 5 1.3
It took me an hour to figure out but I got what you need. You need to REVERSE the dataframe and use .cumsum() (cumulative sum) in the groupbyed id's to find the values after the speed threshold you set. Then drop the speeds more than threshold, along with rows that do not satisfy the condition. Finally, reverse back the dataframe:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
# Reverse the dataframe
df = df.iloc[::-1]
thre = 1.5
# Find rows with speed more than threshold
df = df.assign(ge=df.speed.ge(thre))
# Groupby and cumsum to get the rows that are after the threshold in with same id
df.insert(0, 'beforethre', df.groupby('id')['ge'].cumsum())
# Drop speed more than threshold
df['ge'] = df['ge'].replace(True, np.nan)
# Drop rows that don't have any speed more than threshold or after threshold
df['beforethre'] = df['beforethre'].replace(0, np.nan)
df = df.dropna(axis=0).drop(['ge', 'beforethre'], axis=1)
# Reverse back the dataframe
df = df.iloc[::-1]
# Viola!
df
Output:
id speed
0 1 1.2
3 5 0.9
4 5 1.3

Pandas: how to multiply each element of a Series to each element of a column in a Dataframe

I am trying to find a solution to do the following operation using either numpy or pandas:
For instance, the result matrix has [0, 0, 0] as its first column which is a result of [a x a] elementwise, more specifically it is equal to: [0 x 0.5, 0 x 0.4, 0 x 0.1].
If there is no solution method for such a problem, I might just expand the series to a dataframe by duplicating its values to just multiply two dataframes..
input data:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
This is actually very simple. Because the Series' index aligns with the DataFrame's columns, you only need to do:
series*df
output:
a b c d e
0 0.0 4.0 0.0 70.0 0.8
1 0.0 5.0 0.0 10.0 0.5
2 0.0 9.0 0.0 30.0 0.8
input:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))

lambda function referencing a column value not specified in function

I have a situation where I want to use the results of a groupby in my training set to fill in results for my test set.
I don't think there's a straight forward way to do this in pandas, so I'm trying use the apply method on the column in my test set.
MY SITUATION:
I want to use the average values from my MSZoning column to infer the missing value for my LotFrontage column.
If I use the groupby method on my training set I get this:
train.groupby('MSZoning')['LotFrontage'].agg(['mean', 'count'])
giving.....
Now, I want to use these values to impute missing values on my test set, so I can't just use the transform method.
Instead, I created a function that I wanted to pass into the apply method, which can be seen here:
def fill_MSZoning(row):
if row['MSZoning'] == 'C':
return 69.7
elif row['MSZoning'] == 'FV':
return 59.49
elif row['MSZoning'] == 'RH':
return 58.92
elif row['MSZoning'] == 'RL':
return 74.68
else:
return 52.4
I call the function like this:
test['LotFrontage'] = test.apply(lambda x: x.fillna(fill_MSZoning), axis=1)
Now, the results for the LotFrontage column are the same as the Id column, even though I didn't specify this.
Any idea what is happening?
you can do it like this
import pandas as pd
import numpy as np
## creating dummy data
np.random.seed(100)
raw = {
"group": np.random.choice("A B C".split(), 10),
"value": [np.nan if np.random.rand()>0.8 else np.random.choice(100) for _ in range(10)]
}
df = pd.DataFrame(raw)
display(df)
## calculate mean
means = df.groupby("group").mean()
display(means)
Fill With Group Mean
## fill with mean value
def fill_group_mean(x):
group_mean = means["value"].loc[x["group"].max()]
return x["value"].mask(x["value"].isna(), group_mean)
r= df.groupby("group").apply(fill_group_mean)
r.reset_index(level=0)
Output
group value
0 A NaN
1 A 24.0
2 A 60.0
3 C 9.0
4 C 2.0
5 A NaN
6 C NaN
7 B 83.0
8 C 91.0
9 C 7.0
group value
0 A 42.00
1 A 24.00
2 A 60.00
5 A 42.00
7 B 83.00
3 C 9.00
4 C 2.00
6 C 27.25
8 C 91.00
9 C 7.00

Fuction can not interpret nan value

I am trying to get rid of NaN values in a dataframe.
Instead of filling NaN with averages or doing ffill I wanted to fill missing values according to the destribution of values inside a column.
In other words, if a column has 120 rows, 20 are NaN, 80 contain 1.0 and 20 contain 0,0, I want to fill 80% of NaN values with 1. Note that the column contains floats.
I made a function to do so:
def fill_cr_hist(x):
if x is pd.np.nan:
r = random.random()
if r > 0.80:
return 0.0
else:
return 1.0
else:
return x
However when I call the function it does not change NaN values.
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
I thied filling NaN values with pd.np.nan, but it didn't change anything.
df['Credit_History'].fillna(value=pd.np.nan, inplace=True)
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
The other function I wrote that is almost identical and works fine. In that case the column contains strings.
def fill_self_emp(x):
if x is pd.np.nan:
r = random.random()
if r > 0.892442:
return 'Yes'
else:
return 'No'
else:
return x
ser = pd.Series([
1, 1, np.nan, 0, 0, 1, np.nan, 1, 1, np.nan, 0, 0, np.nan])
Use value_counts with normalize=True to get a list of probabilities corresponding to your values. Then generate values randomly according to the given probability distribution and use fillna to fill NaNs.
p = ser.value_counts(normalize=True).sort_index().tolist()
u = np.sort(ser.dropna().unique())
ser = ser.fillna(pd.Series(np.random.choice(u, len(ser), p=p)))
This solution should work for any number of numeric/categorical values, not just 0s and 1s. If data is a string type, use pd.factorize and convert to numeric.
Details
First, compute the probability distribution:
ser.value_counts(normalize=True).sort_index()
0.0 0.444444
1.0 0.555556
dtype: float64
Get a list of unique values, sorted in the same way:
np.sort(ser.dropna().unique())
array([0., 1.])
Finally, generate random values with specified probability distribution.
pd.Series(np.random.choice(u, len(ser), p=p))
0 0.0
1 0.0
2 1.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 0.0
9 0.0
10 1.0
11 0.0
12 1.0
dtype: float64

Applying a function to a pandas col

I would like to map the function GetPermittedFAR to my dataframe(df) such that I could test if a value in the col zonedist1 == a certain value I could build new cols such as df['FAR_Permitted'] etc.
I have tried various means of map() etc. but haven't gotten this to work. I feel this should be a pretty simple thing to do?
Ideally, I would use a simple list comprehension / lambda as I have many of these test conditional values resulting in col data to create.
import pandas as pd
import numpy as np
def GetPermittedFAR():
if df['zonedist1'] == 'R7-3':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R3-2':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R1-1':
df['FAR_Permitted'] = 0.7
df['Building Height Max'] = 100
#etc...if statement for each unique value in 'zonedist'
df = pd.DataFrame({'zonedist1':['R7-3', 'R3-2', 'R1-1',
'R1-2', 'R2', 'R2A', 'R2X',
'R1-1','R7-3','R3-2','R7-3',
'R3-2', 'R1-1', 'R1-2'
]}
df = df.apply(lambda x: GetPermittedFAR(), axis=1)
How about using pd.merge()?
Let df be your dataframe
In [612]: df
Out[612]:
zonedist1
0 R7-3
1 R3-2
2 R1-1
3 R1-2
4 R2
5 R2A
6 R2X
merge be another dataframe with conditions
In [613]: merge
Out[613]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
Then, merge df with merge on 'left'
In [614]: df.merge(merge, how='left')
Out[614]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
2 R1-1 NaN NaN
3 R1-2 NaN NaN
4 R2 NaN NaN
5 R2A NaN NaN
6 R2X NaN NaN
Later you can replace NaN values.

Categories

Resources