Performing operations on column with nan's without removing them - python

I currently have a data frame like so:
treated
control
9.5
9.6
10
5
6
0
6
6
I want to apply get a log 2 ratio between treated and control i.e log2(treated/control). However, the math.log2() ratio breaks, due to 0 values in the control column (a zero division). Ideally, I would like to get the log 2 ratio using method chaining, e.g a df.assign() and simply put nan's where it is not possible, like so:
treated
control
log_2_ratio
9.5
9.6
-0.00454
10
5
0.301
6
0
nan
6
6
0
I have managed to do this in an extremely round-about way, where I have:
made a column ratio which is treated/control
done new_df = df.dropna() on this dataframe
applied the log 2 ratio to this.
Left joined it back to it's the original df.
As always, any help is very much appreciated :)

You need to replace the inf with nan:
df.assign(log_2_ratio=np.log2(df['treated'].div(df['control'])).replace(np.inf, np.nan))
Output:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000

To avoid subsequent replacement you may go through an explicit condition (bearing in mind that multiplication/division operation with zero always result in 0).
df.assign(log_2_ratio=lambda x: np.where(x.treated * x.control, np.log2(x.treated/x.control), np.nan))
Out[22]:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000

Stick with the numpy log functions and you'll get an inf in the cells where the divide doesn't work. That seems like a better choice than nan anyway.
>>> df["log_2_ratio"] = np.log2(df.treated/df.control)
>>> df
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 inf
3 6.0 6.0 0.000000

Related

Pandas group by x, sort by y, select z, aggregating in case of multiple maximum values

Suppose I have a dataframe df:
df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
'amount' : [2,4,5,1,2,3,5,5,5],
'x':[2,5,8,3,6,9,3,1,0]})
group_id amount x
0 1 2 2
1 1 4 5
2 1 5 8
3 2 1 3
4 2 2 6
5 3 3 9
6 3 5 3
7 3 5 1
8 3 5 0
I want to group it by group_id, then pick x, corresponding to the largest amount. The part which I cannot figure out is how to deal with cases when there are multiple rows with maximum amount. For example, 3 last rows in the df above.
In such cases, I would like to aggregate values of x using mean, median or mode of x. I am trying to get the solution, in which I can implement every one of these 3 aggregation methods.
I saw many questions here, which solve the problem without dealing with multiple maximum values. For example, I could do something like this:
df.sort_values('amount', ascending=False).groupby('group_id').first().x
But I do not know how to implement different aggregation approaches.
EDIT: The second part of this question is here.
If I understand you question right, you can use custom function with GroupBy.apply:
out = df.groupby("group_id").apply(
lambda x: pd.Series(
{
"mean": (d := x.loc[x["amount"] == x["amount"].max(), "x"]).mean(),
"median": d.median(),
"mode": d.mode()[0],
}
)
)
print(out)
Prints:
mean median mode
group_id
1 8.000000 8.0 8.0
2 6.000000 6.0 6.0
3 1.333333 1.0 0.0
Or .describe():
out = df.groupby("group_id").apply(
lambda x: x.loc[x["amount"] == x["amount"].max(), "x"].describe()
)
print(out)
Prints:
x count mean std min 25% 50% 75% max
group_id
1 1.0 8.000000 NaN 8.0 8.0 8.0 8.0 8.0
2 1.0 6.000000 NaN 6.0 6.0 6.0 6.0 6.0
3 3.0 1.333333 1.527525 0.0 0.5 1.0 2.0 3.0

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

Dataframe operation

I have a dataframe populated with different zeros and values different than zero. For each row I want to apply the following condition:
If the value in the given cell is different than zero AND the value in the cell to the right is zero, then put the same value in the cell to the right.
The example would be the following:
This is the one of the rows in the dataframe now:
[0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0]
The function would convert it to the following:
[0,0,0,20,20,20,20,33,3,3,5,5,5,5,5,5]
I want to apply this to the whole dataframe.
Your help would be much appreciated!
Thank you.
Since you imply you are using Pandas, I would leverage a bit of the build-in muscle in the library.
import pandas as pd
import numpy as np
s = pd.Series([0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0])
s.replace(0, np.NaN, inplace=True)
s = s.ffill()
Output:
0 NaN
1 NaN
2 NaN
3 20.0
4 20.0
5 20.0
6 20.0
7 33.0
8 3.0
9 3.0
10 5.0
11 5.0
12 5.0
13 5.0
14 5.0
15 5.0
dtype: float64

dataframe previous rows mean

I have the following dataframe, and want to create a new column 'measurement_mean', I expect the value of each row in the new column, to be the mean of all the previous 'Measurement' values.
How can I do this?
Measurement
0 2.0
1 4.0
2 3.0
3 0.0
4 100.0
5 3.0
6 2.0
7 1.0
Use pandas.Series.expanding
df[‘measurement_mean’] = df.Measurement.expanding().mean()
df['measurement_mean'] = df.Measurement.cumsum()/(df.index+1)

Fastest way to compare all rows of a DataFrame

I have written a program (in Python 3.6) that tries to map the columns of a users csv/excel to a template xls I have. So far so good but part of this process has to be user's data processing which are contacts. For example I want to delete duplicates ,merge data etc. To do this I need to compare every row to all other rows which is costly. Every user's csv I read has ~ 2000-4000 rows but I want it to be efficient for even more rows. I have stored the data in a pd.DataFrame.
Is there a more efficient way to do the comparisons beside brute force?
Thanks
First, what code have you tried?
But to delete duplicates, this is very easy in pandas. Example below:
import pandas as pd
import numpy as np
# Creating the Test DataFrame below -------------------------------
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','A9875', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(dfp)
#Output Below----------------
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 0.0 A9875 123456.0 Unassign
2 3.0 3.0 rmacy 1234567.0 Assign
3 4.0 5.0 Idaho Rx 12345678.0 Ugly
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Assign
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
8 5.0 0.0 RX12345 1234567.0 Assign
9 NaN 0.0 USA Pharma NaN Unicorn
# Remove all records with duplicated values in column a:
# keep='first' keeps the first occurences.
df2 = dfp[dfp.duplicated(['A'], keep='first')]
#output
A B C D E
1 NaN 0.0 A9875 123456.0 Unassign
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Assign
8 5.0 0.0 RX12345 1234567.0 Assign
9 NaN 0.0 USA Pharma NaN Unicorn
if you want to have a new dataframe with no dupes that checks across all columns use the tilde. the ~ operator is essentially the not equal to or != operator. official documentation here
df2 = dfp[~dfp.duplicated(keep='first')]

Categories

Resources