Found a solution using .fillna
As you can guess, my title is already confusing, and so am I!
I have a dataframe like this
Index Values
0 NaN
1 NaN
...................
230 350.21
231 350.71
...................
1605 922.24
Between 230 and 1605 I have values, but not for the first 229 entries. So I calculated the slope to approximate the missing data and stored it in 'slope'.
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
In essence I want to get the .min from Values, subtract the slope and insert the new value in the index before the previous .min. However, I am completely lost, I tried something like this:
df['Values2'] = df['Values'].min().apply(lambda x: x.min() - slope)
But that is obviously rubbish. I would greatly appreciate some advise
EDIT
So after trying multiple ways I found a crude solution that at least works for me.
loopcounter = 0
missingValue = []
missingindex = []
missingindex.append(loopcounter)
missingValue.append(Y1)
for minValue in missingValue:
minValue = minValue-slopeave
missingValue.append(minwavelength)
loopcounter +=1
missingindex.append(loopcounter)
if loopcounter == 230:
break
del missingValue[0]
missingValue.reverse()
del missingindex[-1]
First I created two lists, one is for the missing values and the other for the index.
Afterwards I added my minimum Value (Y1) to the list and started my loop.
I wanted the loop to stop after 230 times (the amount of missing Values)
Each loop would subtract the slope from the items in the list, starting with the minimum value while also adding the counter to the missingindex list.
Deleting the first value and reversing the order transformed the list into the correct order.
missValue = dict(zip(missingindex,missingValue))
I then combined the two lists into a dictionary
df['Values'] = df['Values'].fillna(missValue)
Afterwards I used the .fillna function to fill up my dataframe with the dictionary.
This worked for me, I know its probably not the most elegant solution...
I would like to thank everyone that invested their time in trying to help me, thanks a lot.
Check this. However, I feel you would have to put this is a loop, as the insertion and min calculation has to do the re-calculation
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=('Values',),data=
[
np.nan,
np.nan,
350.21,
350.71,
922.24
])
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
line = pd.DataFrame(data=[Y1-slope], columns=('Values',), index=[X1ID])
df2 = pd.concat([df.ix[:X1ID-1], line, df.ix[X1ID:]]).reset_index(drop=True)
print df2
The insert logic is provided here Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
I think you can use loc with interpolate:
print df
Values
Index
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 NaN
230 350.21
231 350.71
1605 922.24
#add value 0 to index = 0
df.at[0, 'Values'] = 0
#add value Y1 - slope (349.793978) to max NaN value
df.at[X1ID-1, 'Values'] = Y1 - slope
print df
Values
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
print df.loc[0:X1ID-1, 'Values']
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
Name: Values, dtype: float64
#filter values by indexes and interpolate
df.loc[0:X1ID-1, 'Values'] = df.loc[0:X1ID-1, 'Values'].interpolate(method='linear')
print df
Values
Index
0 0.000000
1 49.970568
2 99.941137
3 149.911705
4 199.882273
5 249.852842
6 299.823410
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
I will revise this a little bit:
df['Values2'] = df['Values']
df.ix[df.Values2.isnull(), 'Values2'] = (Y1 - slope)
EDIT
Or try to put this in a loop like below. This will recursively fill in all values until it reaches the end of the series:
def fix_rec(series):
Y1 = series.min()
X1ID = series.idxmin() ##; print(X1ID)
Y2 = series.max()
X2ID = series.idxmax()
slope = (Y2 - Y1) / (X2ID - X1ID);
if X1ID == 0: ## termination condition
return series
series.loc[X1ID-1] = Y1 - slope
return fix_rec(series)
call it like this:
df['values2'] = df['values']
fix_rec(df.values2)
I hope that helps!
Related
I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727
I have a column that I'm trying to smooth out the results. Most of the data creates a smooth chart but sometimes I get a random spike. I want to reduce the impact of the spike.
My thought was to take the outlier and just make it the mean of the values between it but I'm struggling and not getting the result I want.
Here's what I'm doing right now:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 1)), columns=list('A'))
def aDetection(inputs):
median = inputs["A"].median()
std = inputs["A"].std()
outliers = (inputs["A"] - median).abs() > std
print("outliers")
print(outliers)
inputs[outliers]["A"] = np.nan #this isn't working.
inputs[outliers] = np.nan #works but wipes out entire row
inputs['A'].fillna(median, inplace=True)
print("modified:")
print(inputs)
print("original")
print(df)
aDetection(df)
original
A
0 4
1 86
2 40
3 99
4 97
outliers
0 True
1 False
2 True
3 False
4 False
Name: A, dtype: bool
modified:
A
0 86.0
1 86.0
2 86.0
3 99.0
4 97.0
For one, it seems to change all rows not just the single column. But the bigger problem is all the outliers in my example are using 86. I realize this is because I set the mean for the entire column, but I would like the mean between the previous column with the missing data.
For a single column, you can do your task with the following one-liner
(for readability folded into 2 lines):
df.A = df.A.mask((df.A - df.A.median()).abs() > df.A.std(),
pd.concat([df.A.shift(), df.A.shift(-1)], axis=1).mean(axis=1))
Details:
(df.A - df.A.median()).abs() > df.A.std() - computes outliers.
df.A.shift() - computes a Series of previous values.
df.A.shift(-1) - computes a Series of following values.
pd.concat(...) - creates a DataFrame from both the above Series.
mean(axis=1) - computes means by rows.
mask(...) - takes original values of A column for non-outliers
and the value from concat for outliers.
The result is:
A
0 86.0
1 86.0
2 92.5
3 99.0
4 97.0
If you want to apply this mechanism to all columns of your DataFrame,
then:
Change the above code to a function:
def replOutliers(col):
return col.mask((col - col.median()).abs() > col.std(),
pd.concat([col.shift(), col.shift(-1)], axis=1).mean(axis=1))
Apply it (to each column):
df = df.apply(replOutliers)
I want to create a new column in my table by implementing equation, but there might be 2 possible equations for the new table.
(1) frequency = (total x 100) / hour
(2) frequency = (total x 1000000) / km_length
the table is similar to this:
type hour km_length total
A 1 - 1
B - 2 1
the calculation for "frequency" table would depend on which columns between hour and km_length that has value.
then, I expect the table will be like this:
type hour km_length total frequency
A 1 - 1 100
B - 2 1 500000
I have tried using np.nan_to_num before but it did not show the expected table I want.
is there anyway I can make it using python? Looking forward to any help
thankyou.
We can use np.where for assigning values based on a condition:
df[["hour", "km_length"]] = df[["hour", "km_length"]].apply(pd.to_numeric, errors="coerce")
df["frequency"] = np.where(
df["km_length"].isna(),
df["total"] * 100 / df["hour"],
df["total"] * 1_000_000 / df["km_length"]
)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
Make your values numeric then multiply. Because a missing value indicates with method to use and because division with NaN results in a NaN do both multiplications and use .fillna to determine the correct resulting value.
df[['hour', 'km_length']] = df[['hour', 'km_length']].apply(pd.to_numeric, errors='coerce')
s1 = df['total'].divide(df['hour']).multiply(100)
s2 = df['total'].divide(df['km_length']).multiply(10**6)
df['frequency'] = s1.fillna(s2)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
You can store the data in numpy array.
import numpy as np
table = np.array([['hour' , 'km_lenght' , 'total' , 'frequrncy']] #set the value of frequency as 0
for i in table:
try:
i[3] = (i[2]*100)/i[0]
except:
i[3] = (i[2]*1000000)/i[1]
print(table)
This should print the desired table.
I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2
I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')