I am trying to get rid of NaN values in a dataframe.
Instead of filling NaN with averages or doing ffill I wanted to fill missing values according to the destribution of values inside a column.
In other words, if a column has 120 rows, 20 are NaN, 80 contain 1.0 and 20 contain 0,0, I want to fill 80% of NaN values with 1. Note that the column contains floats.
I made a function to do so:
def fill_cr_hist(x):
if x is pd.np.nan:
r = random.random()
if r > 0.80:
return 0.0
else:
return 1.0
else:
return x
However when I call the function it does not change NaN values.
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
I thied filling NaN values with pd.np.nan, but it didn't change anything.
df['Credit_History'].fillna(value=pd.np.nan, inplace=True)
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
The other function I wrote that is almost identical and works fine. In that case the column contains strings.
def fill_self_emp(x):
if x is pd.np.nan:
r = random.random()
if r > 0.892442:
return 'Yes'
else:
return 'No'
else:
return x
ser = pd.Series([
1, 1, np.nan, 0, 0, 1, np.nan, 1, 1, np.nan, 0, 0, np.nan])
Use value_counts with normalize=True to get a list of probabilities corresponding to your values. Then generate values randomly according to the given probability distribution and use fillna to fill NaNs.
p = ser.value_counts(normalize=True).sort_index().tolist()
u = np.sort(ser.dropna().unique())
ser = ser.fillna(pd.Series(np.random.choice(u, len(ser), p=p)))
This solution should work for any number of numeric/categorical values, not just 0s and 1s. If data is a string type, use pd.factorize and convert to numeric.
Details
First, compute the probability distribution:
ser.value_counts(normalize=True).sort_index()
0.0 0.444444
1.0 0.555556
dtype: float64
Get a list of unique values, sorted in the same way:
np.sort(ser.dropna().unique())
array([0., 1.])
Finally, generate random values with specified probability distribution.
pd.Series(np.random.choice(u, len(ser), p=p))
0 0.0
1 0.0
2 1.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 0.0
9 0.0
10 1.0
11 0.0
12 1.0
dtype: float64
Related
So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!
I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.
You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values
I am trying to find a solution to do the following operation using either numpy or pandas:
For instance, the result matrix has [0, 0, 0] as its first column which is a result of [a x a] elementwise, more specifically it is equal to: [0 x 0.5, 0 x 0.4, 0 x 0.1].
If there is no solution method for such a problem, I might just expand the series to a dataframe by duplicating its values to just multiply two dataframes..
input data:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
This is actually very simple. Because the Series' index aligns with the DataFrame's columns, you only need to do:
series*df
output:
a b c d e
0 0.0 4.0 0.0 70.0 0.8
1 0.0 5.0 0.0 10.0 0.5
2 0.0 9.0 0.0 30.0 0.8
input:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
I am looking for an elegant way to select columns that contain a value under 15, and if they do, i want to change it to 1. I also want to change the next closest number to 2. any suggestions would be great. I can subset accordingly but am stuck with dynamically adapting the next closest number
df i have
df = pd.DataFrame(data={'a':[1,1,13,23,40],
'b': [89.87,1,12,4,8],
'c': [45,12,901,12,29]}).astype(float)
df i want
expected = pd.DataFrame(data={'a':[1,1,1,2,40],
'b': [2,1,1,1,1],
'c': [45,1,901,1,2]}).astype(float)
You can use masks and mask:
mask = df.lt(15) # values lower than 15
mask2 = df.eq(df.mask(mask).min()) # min values, excluding values below 15
df.mask(mask, 1).mask(mask2, 2) # replacing mask with 1, mask2 with 2
output:
a b c
0 1.0 2.0 45.0
1 1.0 1.0 1.0
2 1.0 1.0 901.0
3 2.0 1.0 1.0
4 40.0 1.0 2.0
I'm attempting to get the mean values on one data frame between certain time points that are marked as events in a second data frame.
This is a follow up to this question, where now I have missing/NaN values: Find a subset of columns based on another dataframe?
import pandas as pd
import numpy as np
#example
example_g = [["4/20/21 4:20", 302, 0, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0.0, 0.1, 0.2, 0.3, 0.4, 0.5])
#Example Timestamps
example_s = [["4/20/21 4:20",302,0.0, 0.2, np.NaN],
["2/17/21 9:20",135,0.0, 0.1, 0.4 ],
["2/17/21 9:20",111,0.3, 0.4, 0.5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])
df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
def func(df):
event_2 = df['event_2']
event_3 = df['event_3']
start = event_2 + 2 # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
end = event_3 + 2 # same as above
total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
return avg
df['avg'] = df.apply(func,axis=1)
I get the following error:
cannot do positional indexing on Index with these indexers [nan] of type float
I have attempted making sure that columns are floats and have tried removing the int() command within the definitions of the events.
How can I preform the same calculations as before where possible but while skipping any values that are NaN?
so about your question, check if this solution is ok:
def func(row):
try:
event_2 = row['event_2']
event_3 = row['event_3']
start = int(event_2 + 2)
end = int(event_3 + 2)+1
list_row = row.tolist()[start:end]
list_row = [x for x in list_row if x == x]
return sum(list_row)/(end-start)
except Exception as e:
return np.NaN
df['avg'] = df.apply(lambda x: func(x),axis=1)
I reduced the function and convert start and end parameters to integer before to the set a subset and when you call the function interows I'm using a lambda function and in Avg calculation, I remove all NaN values
You can check if the event values are NaN and if any of the event value is NaN, just return NaN from the function, else return the required value.
You can also modify the function a bit to calculate the values between any two given events, i.e. not necessarily event 2, and event 3. Also, the data you provided in the previous question had event values columns as integer, but this time, you have float values like 0.1, 0.2, 0.3, ... etc. You can just store the column for event values in a list in an increasing order to be able to access them via index values coming from events column from the second dataframe.
Additionally, you can directly use np.mean instead of calculating the sum and dividing it manually. The modified version of the function will look like this:
eventCols = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # Columns having the value for events
def getMeanValue(row, eN1=2, eN2=3):
if pd.isna([row[f'event_{eN1}'], row[f'event_{eN2}']]).any():
return float('nan')
else:
requiredEventCols =eventCols[int(row[f'event_{eN1}']):int(row[f'event_{eN2}']+1)]
return np.mean(row[requiredEventCols])
Now, you can apply this function on the dataframe on axis=1=:
df['avg'] = df.apply(getMeanValue,axis=1)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 3.30
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.35
[3 rows x 12 columns]
Additionally, if needed, you can also pass the two event numbers, default values are 2, and 3 which means the value will be calculated for event_2, and event_3
Average between event_1 and event_2:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=2)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN 0.00
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 1.20
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.25
[3 rows x 12 columns]
Average between event_1 and event_3:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=3)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 2.84
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.30
[3 rows x 12 columns]
The format of your data is hard to work with. I would spend some time to rearrange it into a less wide format, then do the work needed.
Here is a quick example, but I did not spend any time making this readable:
base = example_g_table.set_index(['Date_Time','CID']).stack().to_frame()
data = example_s_table.set_index(['Date_Time','CID']).stack().reset_index().set_index(['Date_Time','CID', 0])
base['events'] = data
base = base.reset_index()
base = base.rename(columns={'level_2': 'local_index', 0: 'values'})
This produces a frame that looks something like this:
In this format calculating the result is not so hard.
import numpy
from functools import partial
def mean_two_events(event1, event2, columns_to_mean, df):
event_1 = df['events'] == event1
event_2 = df['events'] == event2
if any(event_1) and any(event_2):
return df.loc[event_1.idxmax():event_2.idxmax()][columns_to_mean].mean()
else:
return np.nan
mean_event2_and_event3 = partial(mean_two_events, 'event_2','event_3', 'values')
mean_event1_and_event3 = partial(mean_two_events, 'event_1','event_3', 'values')
base.groupby(['Date_Time','CID']).apply(mean_event2_and_event3).reset_index()
Good luck!
Edit:
Here is an alternative solution that filters out the values BEFORE the groupby.
base['events'] = base.groupby(['Date_Time','CID']).events.ffill()
# This caluclates all periods up until the next event. The shift makes the first values of the next event included as well.
# The problem with appoach is that more complex logic will be needed if you need to caluclate values between events that
# are not adjasant, IE this wont work if you want the calculate between event_1 and event_3.
base['time_periods_to_include'] = ((base.events == 'event_2') | (base.groupby(['Date_Time','CID']).events.shift() == 'event_2'))
# Now we can simply do:
filtered_base = base[base['time_periods_to_include']]
filtered_base.groupby(['Date_Time','CID']).values.mean()
# The benifit is that you can now eaisaly do:
filtered_base.groupby(['Date_Time','CID']).values.rolling(5).mean()
How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/
You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0