pandas "where" function does not appear to short-circuit

pandas "where" function does not appear to short-circuit - python

I'm probably misunderstanding how this works.
I was surprised that, given this dataframe:
A B C D
0 9.0 Nonnumeric 9.0
2 9.0 Num0a 9.0
This DOES appear to short circuit (--GOOD!):
dfzero["B"] = pd.DataFrame.where(
cond = dfzero["A"] != 0,
self = 1/dfzero["A"],
other = 0)
But this does NOT (--BAD!):
(gives a divide by zero error, as there is no short-circuit):
df["D"] = pd.DataFrame.where(
cond = df["C"].str.len() == 5,
self = df["C"].str[-2:].apply(lambda x: int(x, 16)),
other = 0)
The error is:
self = (df["C"].str[-2:].apply(lambda x: int(x, 16))),
ValueError: invalid literal for int() with base 16: 'ic'

No, even the first method does NOT short circuit. Both of the operands must first be evaluated before the result is computed. Meaning, this is computed,
i = dfzero["A"] != 0
i
0 False
1 True
Name: A, dtype: bool
And so is this:
j = 1 / dfzero['A']
j
0 inf
1 0.500000
Name: A, dtype: float64
The expression is effectively:
pd.DataFrame.where(i, j, 0)
It's the same for the second. The behaviour is consistent.
Were you expecting a ZeroDivisionError? You won't get that with numpy or pandas, because these libraries assume you know what you're doing when you compute such quantities.
Your option here is to precompute the mask, and then compute the result for those rows only.
m = df["C"].str.len() == 5
df['D'] = df.loc[m, 'C'].str[-2:].apply(lambda x: int(x, 16))
df
A B C D
0 0 9.0 Nonnumeric NaN
1 2 9.0 Num0a 10.0
If you want to fill in the NaNs, use df.loc[~m, 'D'] = fill_value.

Related

Data problem: identifying data rows where colleagues have reached a consensus

I have a table that shows the results of four colleagues trying to classify several objects as either a, b, c or d. If the colleagues were able to agree on the classification, or if only one colleague is able to classify the object, then in a new column I want to show the colleague's classification. If the colleagues disagree, I want to create a separate dataframe that displays
those objects. For each object, at max only two colleagues are assigned to try classify it, so there won't be a situation where three colleagues cannot agree on the classification.
It is easy to show an object's classification if only one colleague is able to identify it, but I am struggling when there are two. I can only get as far as the following given my noob python skills.
The end result I am looking for, is 'a' for the first row, 'b' for third, and 'd' for fourth. The second row would be singled out for manual classification by a more experienced colleague.
df_test = pd.DataFrame({'check1':['a','a','unknown','d'],
'check2':['unknown','b','unknown','unknown'],
'check3':['unknown','unknown','c','d'],
'check4':['unknown','unknown','c','unknown']})
cols = ['check_ind','check1_ind','check2_ind','check3_ind','check4_ind']
for col in cols:
df_test[col] = 0
checks = [('check1','check1_ind'),('check2','check2_ind'),('check3','check3_ind'),('check4','check4_ind')]
rows = df_test.shape[0]
for r in range(rows):
for c in checks:
if df_test.iloc[r, df_test.columns.get_loc(c[0])] != 'unknown':
df_test.iloc[r, df_test.columns.get_loc(c[1])] = 1
sumcolumn = df_test['check1_ind'] + df_test['check2_ind'] + df_test['check3_ind'] + df_test['check4_ind']
df_test['body_check'] = sumcolumn

df.replace('unknown', np.nan, inplace=True)
df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else 'No Consensus', axis=1)
Output:
0 a
1 No Consensus
2 c
3 d
dtype: object
In use:
df['consensus'] = df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else np.nan, axis=1)
print(df)
...
check1 check2 check3 check4 consensus
0 a NaN NaN NaN a
1 a b NaN NaN NaN
2 NaN NaN c c c
3 d NaN d NaN d

Something like this should do the trick:
def function(series):
val_counts = series.value_counts()
if val_counts.size > 1:
return 'No Consensus'
else:
return val_counts.index[0]
df_test.replace({'unknown': np.nan}).apply(function, axis=1)

For an efficient, vectorial, approach, use mode:
df2 = (df_test
.mask(df_test.eq('unknown'))
.mode(1)
# ensure having a "1" column
.reindex(columns=[0,1])
)
print(df2)
# 0 1
# 0 a NaN
# 1 a b
# 2 c NaN
# 3 d NaN
m = df2[1].notna()
df_test['consensus'] = df2[0].mask(m, 'No consensus')
print(df_test)
Output:
check1 check2 check3 check4 consensus
0 a unknown unknown unknown a
1 a b unknown unknown No consensus
2 unknown unknown c c c
3 d unknown d unknown d

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.

Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])

IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool

I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Why does Python function return 1.0 (float) when `return 1` is specified?

I have a lot of strings, some of which consist of 1 sentence and some consisting of multiple sentences. My goal is to determine which one-sentence strings end with an exclamation mark '!'.
My code gives a strange result. Instead of returning '1' if found, it returns 1.0. I have tried: return int(1) but that does not help. I am fairly new to coding and do not understand, why is this and how can I get 1 as an integer?
'Sentences'
0 [This is a string., And a great one!]
1 [It's a wonderful sentence!]
2 [This is yet another string!]
3 [Strange strings have been written.]
4 etc. etc.
e = df['Sentences']
def Single(s):
if len(s) == 1: # Select the items with only one sentence
count = 0
for k in s: # loop over every sentence
if (k[-1]=='!'): # check if sentence ends with '!'
count = count+1
if count == 1:
return 1
else:
return ''
df['Single'] = e.apply(Single)
This returns the the correct result, except that there should be '1' instead of '1.0'.
'Single'
0 NaN
1 1.0
2 1.0
3
4 etc. etc.
Why does this happen?

The reason is np.nan is considered float. This makes the series of type float. You cannot avoid this unless you want your column to be of type Object [i.e. anything]. This is inefficient and inadvisable, and I refuse to show you how to do this.
If there is an alternative value you can use instead of np.nan, e.g. 0, then there is a workaround. You can replace NaN values with 0 and then convert to int:
s = pd.Series([1, np.nan, 2, 3])
print(s)
# 0 1.0
# 1 NaN
# 2 2.0
# 3 3.0
# dtype: float64
s = s.fillna(0).astype(int)
print(s)
# 0 1
# 1 0
# 2 2
# 3 3
# dtype: int32

Use astype(int)
Ex:
df['Single'] = e.apply(Single).astype(int)

apply function on groups of k elements of a pandas Series

I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?

You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5

You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)

If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64

Funny results with pandas argsort

I think I have hit on a bug in pandas. I was hoping to get some help either verifying the bug or helping me figure out where my logic error is located in my code.
My code is as follows:
import pandas, numpy, StringIO
def sq_fixer(sr):
sr = sr.where(sr != '20200229')
ranks = sr.argsort().astype(float)
ranks[ranks == -1] = numpy.nan
return ','.join(ranks.astype(numpy.str))
def correct_date(sr):
date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x
sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime)
return sr
txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
3347,,2008-02-27,2008-04-10,2008-02-13
3588,2004-09-12,,2004-11-06,2004-09-06
3784,2003-02-22,,2003-06-21,2003-02-19
593,2009-04-03,,2009-06-01,2009-04-01
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01
4299,2004-05-24,2004-07-23,,2004-04-22
4590,2005-05-05,2005-12-05,2005-04-05,
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24
1416,2004-04-03,2004-05-19,2004-02-06,
1580,2008-12-20,,2009-03-19,2008-12-19
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19
1759,2001-10-18,,2002-01-17,2001-10-17
1858,2003-04-14,2003-05-17,,2002-12-17
1972,2003-06-01,2003-07-14,2002-12-14,
5905,2000-11-18,2001-01-13,,2000-11-04
2052,2002-06-11,,2002-08-23,2001-12-12
2165,2006-10-01,,2007-02-27,2006-09-30
2218,2007-09-19,,2008-02-06,2007-09-09
2350,2000-08-08,,2000-09-22,2000-01-08
2432,2001-08-22,,2001-09-25,2000-12-16
2611,2005-05-07,,2005-06-05,2005-03-26
2612,2005-05-06,,2005-05-26,2005-04-11
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08
7550,2006-04-08,,2006-06-01,2006-04-01 '''
df = pandas.read_csv(StringIO.StringIO(txt))
sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE']
df[sequence_array] = df[sequence_array].apply(correct_date, axis=1)
fix_day = lambda x: x if x > 0 else 29
fix_month = lambda x: x if x > 0 else 02
fix_year = lambda x: x if x > 0 else 2020
for col in sequence_array:
xcol = 'X_{0}'.format(col)
df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]]
df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1)
When I run the code most of the results are correct. Take for example index 6:
In [31]: df.ix[6]
Out[31]:
ID 7
RUN_START_DATE 2013-01-26 00:00:00
PUSHUP_START_DATE NaN
SITUP_START_DATE 2013-01-12 00:00:00
PULLUP_START_DATE 2013-01-30 00:00:00
X_RUN_START_DATE 20130126
X_PUSHUP_START_DATE 20200229
X_SITUP_START_DATE 20130112
X_PULLUP_START_DATE 20130130
X_AS_SEQUENCE 1.0,nan,0.0,2.0
However, certain indices seem to throw pandas.argsort() for a loop. Take for example index 10:
In [32]: df.ix[10]
Out[32]:
ID 3347
RUN_START_DATE NaN
PUSHUP_START_DATE 2008-02-27 00:00:00
SITUP_START_DATE 2008-04-10 00:00:00
PULLUP_START_DATE 2008-02-13 00:00:00
X_RUN_START_DATE 20200229
X_PUSHUP_START_DATE 20080227
X_SITUP_START_DATE 20080410
X_PULLUP_START_DATE 20080213
X_AS_SEQUENCE nan,2.0,0.0,1.0
The argsort should return nan,1.0,2.0,0.0 instead of nan,2.0,0.0,1.0.
I have been on this for three days. At this point I am not sure if it is me or a bug. I am not sure how to backtrace it to get an answer. Any help would be most appreciated!

You might be interpreting the result of argsort incorrectly. argsort does not give the ranking of the values. Use the rank method if you want to rank the values.
The values in the Series returned by argsort give the corresponding positions of the original values after dropping the NaNs. In your case, since you convert 20200229 to NaN, you are argsorting NaN, 20080227, 20080410, 20080213. The non-NaN values are
nonnan = [20080227, 20080410, 20080213]
The result, NaN, 2, 0, 1 says:
argsort sorted values
NaN NaN
2 nonnan[2] = 20080213
0 nonnan[0] = 20080227
1 nonnan[1] = 20080410
So it looks OK to me.

if you want to sort a Series, just use sort_values() or rank() function:
In [2]: a=pd.Series([3,2,1])
In [3]: a
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: a.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
if you use argsort(), this will give you the position of each element in the sorted series,
in this case, 1 should be in the 0 position and 2 should be in the 1 position and 3 should be in the 2 position
In [5]: a.argsort()
Out[5]:
0 2
1 1
2 0
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas "where" function does not appear to short-circuit - python

Related

Data problem: identifying data rows where colleagues have reached a consensus

Pandas: Determine if a string in one column is a substring of a string in another column

Why does Python function return 1.0 (float) when `return 1` is specified?

apply function on groups of k elements of a pandas Series

Funny results with pandas argsort

Categories

Resources