Something weird happened to me today. I needed to create a list based on a sequence of if statements. My dataframe looks something like this:
prom_lect4b_rbd prom_lect2m_rbd prom_lect8b_rbd prom_lect6b_rbd
100 np.nan 80 200
np.nan np.nan 40 1000
np.nan np.nan np.nan 90
230 100 80 100
Columns are orderer according to their priority. The list (or column) I'm trying to create takes the first value from those rows that is not nan. So, in this case I want a column that looks like this:
simce_final_lect
100
40
90
230
I tried the following:
cols=[simces.prom_lect4b_rbd, simces.prom_lect2m_rbd, simces.prom_lect8b_rbd, simces.prom_lect6b_rbd]
simce_final_lect=[j if np.isnan(j)==False else k if np.isnan(k)==False
else l if np.isnan(l)==False else m if np.isnan(m)==False
else np.nan for j,k,l,m in zip(cols[0],cols[1],cols[2],cols[3])]
And that just copies two values (out of 8752) to the list. But if I limit my zip to just j,k,l, it works perfectly:
simce_final_lect=[j if np.isnan(j)==False else k if np.isnan(k)==False
else l if np.isnan(l)==False
else np.nan for j,k,l in zip(cols[0],cols[1],cols[2])]
Do you know what is happening? Else, is there a more efficient solution to my problem?
You can use bfill(axis=1) and select the first col.
df.bfill(axis=1).iloc[:,0]
0 100.0
1 40.0
2 90.0
3 230.0
Name: prom_lect4b_rbd, dtype: float64
## For list
df.bfill(axis=1).iloc[:,0].tolist()
['100', '40', 90, '230']
Use first_valid_index():
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Yields:
0 100.0
1 40.0
2 90.0
3 230.0
dtype: float64
Related
(I think) I'm looking to apply a transform to a column, by finding the % change when compared to one static column.
My first attempt looks like this (without a transform):
from pandas import DataFrame
from numpy import random
df = DataFrame(random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
print(df)
for col in df.columns:
# Find the % increase/decrease of "col" compared to column A
df[col] = df[["A", col]].pct_change(axis=1)[col]
print(df)
...however the resulting df is all NaN's when I'm expecting it to be in % increase/decrease format.
So as an example, it starts by comparing column A with column A, that's fine, all values SHOULD be the same. Then the next iteration it should be column B compared to column A. We should see %'s in column B in the end. Then same for C and D. I'm just new to transforms/changing values of a column in place and not sure how to do it.
Subtract column A from the dataframe then divide by column A to calculate pct_change:
df.sub(df['A'], axis=0).div(df['A'], axis=0)
The above expression can be further simplied to:
df.div(df['A'], axis=0).sub(1)
A B C D
0 0.0 -0.821429 1.535714 0.500000
1 0.0 0.491525 0.508475 -0.745763
2 0.0 -0.452055 0.013699 -0.452055
3 0.0 2.187500 0.062500 0.812500
4 0.0 -0.632184 -0.839080 0.114943
5 0.0 -0.042105 -0.378947 -0.157895
6 0.0 -0.553191 -0.734043 -0.319149
...
98 0.0 -0.604651 -0.325581 -0.418605
99 0.0 0.649123 -0.964912 -0.631579
I think the problem is that you can not have two columns 'A' in df[["A", col]]. When you change df.columns to df.columns[1:] it runs without errors.
for col in df.columns[1:]:
# Find the % increase/decrease of "col" compared to column A
df[col] = df[["A", col]].pct_change(axis=1)[col]
print(df)
Result:
A B C D
0 33 -0.974288 1.757576 0.575758
1 74 -1.010044 -0.945946 -0.797297
2 62 -1.015869 0.064516 -0.145161
3 53 -0.998932 0.377358 0.075472
4 97 -1.010203 -0.948454 -0.340206
.. .. ... ... ...
95 88 -0.998838 -0.102273 -0.704545
96 59 -1.009193 -0.983051 -0.525424
97 52 -1.011464 0.134615 -0.903846
98 0 inf inf inf
99 33 -0.979798 -0.181818 -0.303030
I have a column that I'm trying to smooth out the results. Most of the data creates a smooth chart but sometimes I get a random spike. I want to reduce the impact of the spike.
My thought was to take the outlier and just make it the mean of the values between it but I'm struggling and not getting the result I want.
Here's what I'm doing right now:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 1)), columns=list('A'))
def aDetection(inputs):
median = inputs["A"].median()
std = inputs["A"].std()
outliers = (inputs["A"] - median).abs() > std
print("outliers")
print(outliers)
inputs[outliers]["A"] = np.nan #this isn't working.
inputs[outliers] = np.nan #works but wipes out entire row
inputs['A'].fillna(median, inplace=True)
print("modified:")
print(inputs)
print("original")
print(df)
aDetection(df)
original
A
0 4
1 86
2 40
3 99
4 97
outliers
0 True
1 False
2 True
3 False
4 False
Name: A, dtype: bool
modified:
A
0 86.0
1 86.0
2 86.0
3 99.0
4 97.0
For one, it seems to change all rows not just the single column. But the bigger problem is all the outliers in my example are using 86. I realize this is because I set the mean for the entire column, but I would like the mean between the previous column with the missing data.
For a single column, you can do your task with the following one-liner
(for readability folded into 2 lines):
df.A = df.A.mask((df.A - df.A.median()).abs() > df.A.std(),
pd.concat([df.A.shift(), df.A.shift(-1)], axis=1).mean(axis=1))
Details:
(df.A - df.A.median()).abs() > df.A.std() - computes outliers.
df.A.shift() - computes a Series of previous values.
df.A.shift(-1) - computes a Series of following values.
pd.concat(...) - creates a DataFrame from both the above Series.
mean(axis=1) - computes means by rows.
mask(...) - takes original values of A column for non-outliers
and the value from concat for outliers.
The result is:
A
0 86.0
1 86.0
2 92.5
3 99.0
4 97.0
If you want to apply this mechanism to all columns of your DataFrame,
then:
Change the above code to a function:
def replOutliers(col):
return col.mask((col - col.median()).abs() > col.std(),
pd.concat([col.shift(), col.shift(-1)], axis=1).mean(axis=1))
Apply it (to each column):
df = df.apply(replOutliers)
I'm using pandas's example to do what I want to do:
>>> s = pd.Series([90, 91, 85])
>>> s
0 90
1 91
2 85
dtype: int64
then the pct_change() is applied to this series:
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
okay, fair enough, but Percentage Increase = [ (Final Value - Starting Value) / |Starting Value| ] × 100
so the results should actually be [NaN, 1.11111%, -6.59341%].
how would I get this *100 part that the pct_change() didn't run for me?
You can simply multiply the result by 100 to get what you want:
In [712]: s.pct_change().mul(100)
Out[712]:
0 NaN
1 1.111111
2 -6.593407
dtype: float64
If you want the result to be a list of these values, do this:
In [714]: l = s.pct_change().mul(100).tolist()
In [715]: l
Out[715]: [nan, 1.1111111111111072, -6.593406593406592]
Try concatenating functions:
.pct_change().multiply(100)
following the desired df operation. You could concatenate more functions before or after.
I am combining two dataframe values from an Excel file to a new dataframe but the combined values changed to decimal number. Here are my codes:
My dataframe that I wish to combine:
cable_block pair
1 10
1 11
3 123
3 222
I insert a dataframe to have those two combined with a delimiter of /, so here is my code:
df['new_col'] = df[['cable_block', 'pair']].apply(lambda x: '/'.join(x.astype(str), axis=1))
The result I get is:
cable_block pair new_col
1 10 1.0/10.0
1 11 1.0/11.0
3 123 3.0/123.0
3 222 3.0/222.0
After searching, I found good answer by
here Psidom and Skirrebattie. So I tried:
df['new_col'] = df['new_col'].applymap(str)
and
df['new_col'] = df['new_col'].astype(str)
But it doesn't work the way it should. Looking by the codes, it should work and I find it weird that it doesn't.
Is there another work around?
First, to remove the trailing .0 ensure that data is int:
df = df.astype(int)
Then you can do:
df['cable_block'].astype(str) + '/' + df['pair'].astype(str)
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object
Another option to ensure a correct formatting could be:
df.apply(lambda x: "%d/%d" %(x['cable_block'], x['pair']), axis=1)
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object
Why not using astype
df.astype(str).apply('/'.join,1)
Out[604]:
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object
df['cable_block'].astype(int).astype(str) + '/' + df['pair'].astype(int).astype(str)
The data in your dataframe is probably floats, not ints.
You can use a list comprehension and f-strings:
df['new_col'] = [f'{cable_block}/{pair}' for cable_block, pair in df.values]
print(df)
cable_block pair new_col
0 1 10 1/10
1 1 11 1/11
2 3 123 3/123
3 3 222 3/222
The approach compares reasonably well versus the alternatives:
df = pd.concat([df]*10000, ignore_index=True)
%timeit df['cable_block'].astype(str) + '/' + df['pair'].astype(str) # 62.8 ms
%timeit [f'{cable_block}/{pair}' for cable_block, pair in df.values] # 85.1 ms
%timeit list(map('/'.join, map(list, df.values.astype(str)))) # 157 ms
%timeit df.astype(str).apply('/'.join,1) # 1.11 s
I have a dataframe with 3 columns: x, y, time. There are a few thousand rows.
What I want to do is retrieve the row with the minimum time but I would like that the minimum should not be 0.
e.g.
x y time
240 1 28.5
240 2 19.3
240 240 0
240 19 9.7
So far what I've tried were the following:
df.loc[df['time'] > 0].min()
# this gives me a series and I want a row
# x 225.000000
# y 0.000000
# time 1.066606
df['time'].drop_duplicates().nsmallest(1)
# 225 0.0
I have also tried something with groupby as well
df.loc[df.groupby('id_x', sort=False)['time'].idxmin()]
I know had problems subsetting this one as I usually got a series.
Try this:
In [69]: df.loc[df.time>0, 'time'].idxmin()
Out[69]: 3
or
In [72]: df.loc[[df.loc[df.time>0, 'time'].idxmin()]]
Out[72]:
x y time
3 240 19 9.7
You can filter out 0 values by query and get index of minimal value by idxmin, last select by loc:
s = df.loc[df.query('time != 0')['time'].idxmin()]
print (s)
x 240.0
y 19.0
time 9.7
Name: 3, dtype: float64
df = df.loc[[df.query('time != 0')['time'].idxmin()]]
print (df)
x y time
3 240 19 9.7
You don't need groupby at all. Here's an option with mask/where + loc + idxmin;
df.loc[[df.time.mask(df.time.eq(0)).idxmin()]]
Or,
df.loc[[df.time.where(df.time.ne(0)).idxmin()]]
x y time
3 240 19 9.7