Use .apply to recode nan rows into a different value - python

I am trying to create a new groupid based on the original groupid which has the value of 0, 1. I used the following code but it failed to code the nan rows into 2.
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x == np.nan else x)
I tried the following code also, but it gave an attribute error
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x.isnull() else x)
Could someone please explain why this is the case? Thanks

Use pd.isnull for check scalars if need use apply:
final = pd.DataFrame({'groupid': [1, 0, np.nan],\
'B': [400, 500, 600]})
final['groupid2'] = final['groupid'].apply(lambda x: 2 if pd.isnull(x) else x)
print (final)
groupid B groupid2
0 1.0 400 1.0
1 0.0 500 0.0
2 NaN 600 2.0
Details:
Value x in lambda function is scalar, because Series.apply loop each value of column. So function pd.Series.isnull() failed.
For better testing is possible rewrite lambda funcion to:
def f(x):
print (x)
print (pd.isnull(x))
return 2 if pd.isnull(x) else x
1.0
False
0.0
False
nan
True
final['groupid2'] = final['groupid'].apply(f)
But better is Series.fillna:
final['groupid2'] = final['groupid'].fillna(2)

Related

Compare columns of two dataframes with custom functions

Given the following two dataframes:
df1 = pd.DataFrame(data={'unicorn': ['blue', 'red', 'piNk'], 'size': [3, 4, 6]})
df2 = pd.DataFrame(data={'unicorn': ['red'], 'size': [2]})
df1:
unicorn size
0 blue 3
1 red 4
2 piNk 6
df2 (always has one row):
unicorn size
0 red 2
How can I compare the rows of both dataframes column-wise using custom comparison functions like this (simplified):
def unicorn_comparison(str1, str2) -> float:
return 100.0 if str1 == str2 else 0.0
and
def size_comparison(nr1, nr2) -> float:
return 100.0 if nr1 < nr2 else 0.0
Expected result:
unicorn size
0 0.0 0.0
1 100.0 0.0
2 0.0 0.0
As you have always a single row in df2, don't use a DataFrame (2D), but a Series (1D).
ser = df2.loc[0]
Then assuming you just want a comparison, use vectorial code (not a custom function):
out = df1.eq(ser)*100
If you really need to use a non-vectorial function and have to compare all combinations, use:
def unicorn_comparison(str1, str2) -> float:
return 100.0 if str1 == str2 else 0.0
def size_comparison(nr1, nr2) -> float:
return 100.0 if nr1 < nr2 else 0.0
funcs = {'unicorn': unicorn_comparison,
'size': size_comparison
}
out = df1.apply(lambda c: c.apply(lambda s: funcs[c.name](s, ser[c.name])))
output:
unicorn size
0 0 0
1 100 0
2 0 0
its way.
first; add df2's column of you want.
df1['unicorn2'] = df2['unicorn']
after; You can use "application loop". You can run the logic of the you want in the "application loop".
def function(x):
# your logic
return x
df1_result = df1.apply(function)
for col in df1:
df1[col] = (df1[col] == df2[col].loc[0]).replace({True: 100, False: 0})
This will overwrite your df1, or you can make a copy of it first.

Avoiding writing a dataframe with a large number of columns

I have a dataframe that looks like this:
student school class answer question
a scl first True x
a scl first False y
a scl first True y
b scl first False x
c scl sec False y
c scl sec True z
d scl sec True x
d scl sec True z
e scl third True z
e scl third False z
Note that it is possible to answer a question multiple times. Note also that not everyone may answer the same set of questions. I want to see which class performed better per question. So for each question, a ranking of the classes, one time when I consider only the first answer of a student, and one time overall.
What I did so far is just a ranking of the classes independent of what question was answered:
#only the first answer is considered
df1 = df.drop_duplicates(subset=["student", "scl", "class", "question"], keep="first")
(df1.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
#all the answers are considered
(df.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
So I do indeed have a ranking of the classes. But I don't know how to compare these classes judging by each question, because I wouldn't create a dataframe with 50 columns when I have 50 classes.
Edit:
I would imagine a dataframe like this, but this is a bit ugly when I have 50 classes:
df_all=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0.5 0 None
z None 1 0.5
df_first_attempt=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0 0 None
z None 1 1
If I understood you correctly:
df_first = df.drop_duplicates(subset=['student', 'class', 'question'], keep='first').groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_first = df_first.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_first = df_first.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_overall = df.groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_overall = df_overall.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_overall = df_overall.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_first:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.0 0.0 NaN
2 z NaN 1.0 1.0
df_overall:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.5 0.0 NaN
2 z NaN 1.0 0.5
You could try this.
pd.pivot_table(df, index="class", columns="question", values="answer")
It is similar to your examples, but rows instead of columns, but the content is the same.
On the other hand, if you would want a ranking of all the classes based on the average success on the questions, you could do this here:
pd.pivot_table(df, index="question", columns="class", values="answer").mean()

pandas "where" function does not appear to short-circuit

I'm probably misunderstanding how this works.
I was surprised that, given this dataframe:
A B C D
0 9.0 Nonnumeric 9.0
2 9.0 Num0a 9.0
This DOES appear to short circuit (--GOOD!):
dfzero["B"] = pd.DataFrame.where(
cond = dfzero["A"] != 0,
self = 1/dfzero["A"],
other = 0)
But this does NOT (--BAD!):
(gives a divide by zero error, as there is no short-circuit):
df["D"] = pd.DataFrame.where(
cond = df["C"].str.len() == 5,
self = df["C"].str[-2:].apply(lambda x: int(x, 16)),
other = 0)
The error is:
self = (df["C"].str[-2:].apply(lambda x: int(x, 16))),
ValueError: invalid literal for int() with base 16: 'ic'
No, even the first method does NOT short circuit. Both of the operands must first be evaluated before the result is computed. Meaning, this is computed,
i = dfzero["A"] != 0
i
0 False
1 True
Name: A, dtype: bool
And so is this:
j = 1 / dfzero['A']
j
0 inf
1 0.500000
Name: A, dtype: float64
The expression is effectively:
pd.DataFrame.where(i, j, 0)
It's the same for the second. The behaviour is consistent.
Were you expecting a ZeroDivisionError? You won't get that with numpy or pandas, because these libraries assume you know what you're doing when you compute such quantities.
Your option here is to precompute the mask, and then compute the result for those rows only.
m = df["C"].str.len() == 5
df['D'] = df.loc[m, 'C'].str[-2:].apply(lambda x: int(x, 16))
df
A B C D
0 0 9.0 Nonnumeric NaN
1 2 9.0 Num0a 10.0
If you want to fill in the NaNs, use df.loc[~m, 'D'] = fill_value.

Python Pandas: Passing arguments to a function in agg()

I am trying to reduce data in a pandas dataframe by using different kind of functions and argument values. However, I did not manage to change the default arguments in the aggregation functions. Here is an example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']})
>>> df
x y
0 1.0 a
1 NaN a
2 2.0 b
3 1.0 b
Here is an aggregation function, for which I would like to test different values of b:
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
In the following code, I can use this function with the default b value, but I would like to pass other values:
>>> df.groupby('y').agg(translate_mean)
x
y
a NaN
b 11.5
Any ideas?
Just pass as arguments to agg (this works with apply, too).
df.groupby('y').agg(translate_mean, b=4)
Out:
x
y
a NaN
b 5.5
Maybe you can try using apply in this case:
df.groupby('y').apply(lambda x: translate_mean(x['x'], 20))
Now the result is:
y
a NaN
b 21.5
Just in case you have multiple columns, and you want to apply different functions and different parameters for each column, you can use lambda function with agg function.
For example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']
'z': ['0.1','0.2','0.3','0.4']})
>>> df
x y z
0 1.0 a 0.1
1 NaN a 0.2
2 2.0 b 0.3
3 1.0 0.4
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
To groupby column 'y', and apply function translate_mean with b=10 for col 'x'; b=25 for col 'z', you can try this:
df_res = df.groupby(by='a').agg({
'x': lambda x: translate_mean(x, 10),
'z': lambda x: translate_mean(x, 25)})
Hopefully, it helps.

Funny results with pandas argsort

I think I have hit on a bug in pandas. I was hoping to get some help either verifying the bug or helping me figure out where my logic error is located in my code.
My code is as follows:
import pandas, numpy, StringIO
def sq_fixer(sr):
sr = sr.where(sr != '20200229')
ranks = sr.argsort().astype(float)
ranks[ranks == -1] = numpy.nan
return ','.join(ranks.astype(numpy.str))
def correct_date(sr):
date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x
sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime)
return sr
txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
3347,,2008-02-27,2008-04-10,2008-02-13
3588,2004-09-12,,2004-11-06,2004-09-06
3784,2003-02-22,,2003-06-21,2003-02-19
593,2009-04-03,,2009-06-01,2009-04-01
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01
4299,2004-05-24,2004-07-23,,2004-04-22
4590,2005-05-05,2005-12-05,2005-04-05,
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24
1416,2004-04-03,2004-05-19,2004-02-06,
1580,2008-12-20,,2009-03-19,2008-12-19
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19
1759,2001-10-18,,2002-01-17,2001-10-17
1858,2003-04-14,2003-05-17,,2002-12-17
1972,2003-06-01,2003-07-14,2002-12-14,
5905,2000-11-18,2001-01-13,,2000-11-04
2052,2002-06-11,,2002-08-23,2001-12-12
2165,2006-10-01,,2007-02-27,2006-09-30
2218,2007-09-19,,2008-02-06,2007-09-09
2350,2000-08-08,,2000-09-22,2000-01-08
2432,2001-08-22,,2001-09-25,2000-12-16
2611,2005-05-07,,2005-06-05,2005-03-26
2612,2005-05-06,,2005-05-26,2005-04-11
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08
7550,2006-04-08,,2006-06-01,2006-04-01 '''
df = pandas.read_csv(StringIO.StringIO(txt))
sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE']
df[sequence_array] = df[sequence_array].apply(correct_date, axis=1)
fix_day = lambda x: x if x > 0 else 29
fix_month = lambda x: x if x > 0 else 02
fix_year = lambda x: x if x > 0 else 2020
for col in sequence_array:
xcol = 'X_{0}'.format(col)
df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]]
df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1)
When I run the code most of the results are correct. Take for example index 6:
In [31]: df.ix[6]
Out[31]:
ID 7
RUN_START_DATE 2013-01-26 00:00:00
PUSHUP_START_DATE NaN
SITUP_START_DATE 2013-01-12 00:00:00
PULLUP_START_DATE 2013-01-30 00:00:00
X_RUN_START_DATE 20130126
X_PUSHUP_START_DATE 20200229
X_SITUP_START_DATE 20130112
X_PULLUP_START_DATE 20130130
X_AS_SEQUENCE 1.0,nan,0.0,2.0
However, certain indices seem to throw pandas.argsort() for a loop. Take for example index 10:
In [32]: df.ix[10]
Out[32]:
ID 3347
RUN_START_DATE NaN
PUSHUP_START_DATE 2008-02-27 00:00:00
SITUP_START_DATE 2008-04-10 00:00:00
PULLUP_START_DATE 2008-02-13 00:00:00
X_RUN_START_DATE 20200229
X_PUSHUP_START_DATE 20080227
X_SITUP_START_DATE 20080410
X_PULLUP_START_DATE 20080213
X_AS_SEQUENCE nan,2.0,0.0,1.0
The argsort should return nan,1.0,2.0,0.0 instead of nan,2.0,0.0,1.0.
I have been on this for three days. At this point I am not sure if it is me or a bug. I am not sure how to backtrace it to get an answer. Any help would be most appreciated!
You might be interpreting the result of argsort incorrectly. argsort does not give the ranking of the values. Use the rank method if you want to rank the values.
The values in the Series returned by argsort give the corresponding positions of the original values after dropping the NaNs. In your case, since you convert 20200229 to NaN, you are argsorting NaN, 20080227, 20080410, 20080213. The non-NaN values are
nonnan = [20080227, 20080410, 20080213]
The result, NaN, 2, 0, 1 says:
argsort sorted values
NaN NaN
2 nonnan[2] = 20080213
0 nonnan[0] = 20080227
1 nonnan[1] = 20080410
So it looks OK to me.
if you want to sort a Series, just use sort_values() or rank() function:
In [2]: a=pd.Series([3,2,1])
In [3]: a
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: a.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
if you use argsort(), this will give you the position of each element in the sorted series,
in this case, 1 should be in the 0 position and 2 should be in the 1 position and 3 should be in the 2 position
In [5]: a.argsort()
Out[5]:
0 2
1 1
2 0
dtype: int64

Categories

Resources