Funny results with pandas argsort - python
I think I have hit on a bug in pandas. I was hoping to get some help either verifying the bug or helping me figure out where my logic error is located in my code.
My code is as follows:
import pandas, numpy, StringIO
def sq_fixer(sr):
sr = sr.where(sr != '20200229')
ranks = sr.argsort().astype(float)
ranks[ranks == -1] = numpy.nan
return ','.join(ranks.astype(numpy.str))
def correct_date(sr):
date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x
sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime)
return sr
txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
3347,,2008-02-27,2008-04-10,2008-02-13
3588,2004-09-12,,2004-11-06,2004-09-06
3784,2003-02-22,,2003-06-21,2003-02-19
593,2009-04-03,,2009-06-01,2009-04-01
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01
4299,2004-05-24,2004-07-23,,2004-04-22
4590,2005-05-05,2005-12-05,2005-04-05,
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24
1416,2004-04-03,2004-05-19,2004-02-06,
1580,2008-12-20,,2009-03-19,2008-12-19
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19
1759,2001-10-18,,2002-01-17,2001-10-17
1858,2003-04-14,2003-05-17,,2002-12-17
1972,2003-06-01,2003-07-14,2002-12-14,
5905,2000-11-18,2001-01-13,,2000-11-04
2052,2002-06-11,,2002-08-23,2001-12-12
2165,2006-10-01,,2007-02-27,2006-09-30
2218,2007-09-19,,2008-02-06,2007-09-09
2350,2000-08-08,,2000-09-22,2000-01-08
2432,2001-08-22,,2001-09-25,2000-12-16
2611,2005-05-07,,2005-06-05,2005-03-26
2612,2005-05-06,,2005-05-26,2005-04-11
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08
7550,2006-04-08,,2006-06-01,2006-04-01 '''
df = pandas.read_csv(StringIO.StringIO(txt))
sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE']
df[sequence_array] = df[sequence_array].apply(correct_date, axis=1)
fix_day = lambda x: x if x > 0 else 29
fix_month = lambda x: x if x > 0 else 02
fix_year = lambda x: x if x > 0 else 2020
for col in sequence_array:
xcol = 'X_{0}'.format(col)
df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]]
df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1)
When I run the code most of the results are correct. Take for example index 6:
In [31]: df.ix[6]
Out[31]:
ID 7
RUN_START_DATE 2013-01-26 00:00:00
PUSHUP_START_DATE NaN
SITUP_START_DATE 2013-01-12 00:00:00
PULLUP_START_DATE 2013-01-30 00:00:00
X_RUN_START_DATE 20130126
X_PUSHUP_START_DATE 20200229
X_SITUP_START_DATE 20130112
X_PULLUP_START_DATE 20130130
X_AS_SEQUENCE 1.0,nan,0.0,2.0
However, certain indices seem to throw pandas.argsort() for a loop. Take for example index 10:
In [32]: df.ix[10]
Out[32]:
ID 3347
RUN_START_DATE NaN
PUSHUP_START_DATE 2008-02-27 00:00:00
SITUP_START_DATE 2008-04-10 00:00:00
PULLUP_START_DATE 2008-02-13 00:00:00
X_RUN_START_DATE 20200229
X_PUSHUP_START_DATE 20080227
X_SITUP_START_DATE 20080410
X_PULLUP_START_DATE 20080213
X_AS_SEQUENCE nan,2.0,0.0,1.0
The argsort should return nan,1.0,2.0,0.0 instead of nan,2.0,0.0,1.0.
I have been on this for three days. At this point I am not sure if it is me or a bug. I am not sure how to backtrace it to get an answer. Any help would be most appreciated!
You might be interpreting the result of argsort incorrectly. argsort does not give the ranking of the values. Use the rank method if you want to rank the values.
The values in the Series returned by argsort give the corresponding positions of the original values after dropping the NaNs. In your case, since you convert 20200229 to NaN, you are argsorting NaN, 20080227, 20080410, 20080213. The non-NaN values are
nonnan = [20080227, 20080410, 20080213]
The result, NaN, 2, 0, 1 says:
argsort sorted values
NaN NaN
2 nonnan[2] = 20080213
0 nonnan[0] = 20080227
1 nonnan[1] = 20080410
So it looks OK to me.
if you want to sort a Series, just use sort_values() or rank() function:
In [2]: a=pd.Series([3,2,1])
In [3]: a
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: a.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
if you use argsort(), this will give you the position of each element in the sorted series,
in this case, 1 should be in the 0 position and 2 should be in the 1 position and 3 should be in the 2 position
In [5]: a.argsort()
Out[5]:
0 2
1 1
2 0
dtype: int64
Related
Multiplying values from a string column in Pandas
I have a column with land dimensions in Pandas. It looks like this: df.LotSizeDimensions.value_counts(dropna=False) 40.00X150.00 2 57.00X130.00 2 27.00X117.00 2 63.00X135.00 2 37.00X108.00 2 65.00X134.00 2 57.00X116.00 2 33x124x67x31x20x118 1 55.00X160.00 1 63.00X126.00 1 36.00X105.50 1 In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with def dimensions_split(df: pd.DataFrame): df.LotSizeDimensions = df.LotSizeDimensions.str.strip() df.LotSizeDimensions = df.LotSizeDimensions.str.upper() df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M') if df.LotSizeDimensions.count('X') > 1 return 0 df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1])) This is coming back with the following error: TypeError: cannot convert the series to <class 'int'> I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0: s = df.LotSizeDimensions.str.strip('`"M ').str.upper() df1 = s.str.split('X', expand=True).astype(float) #general data #df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce')) df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0) print (df) LotSizeDimensions LotSize 0 40.00X150.00 6000.0 1 57.00X130.00 7410.0 2 27.00X117.00 3159.0 3 37.00X108.00 3996.0 4 63.00X135.00 8505.0 5 65.00X134.00 8710.0 6 57.00X116.00 6612.0 7 33x124x67x31x20x118 0.0 8 55.00X160.00 8800.0 9 63.00X126.00 7938.0 10 36.00X105.50 3798.0
I get this using list comprehension: import pandas as pd df = pd.DataFrame(['40.00X150.00','57.00X130.00', '27.00X117.00', '37.00X108.00', '63.00X135.00' , '65.00X134.00' , '57.00X116.00' , '33x124x67x31x20x118', '55.00X160.00', '63.00X126.00', '36.00X105.50']) df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]
pandas "where" function does not appear to short-circuit
I'm probably misunderstanding how this works. I was surprised that, given this dataframe: A B C D 0 9.0 Nonnumeric 9.0 2 9.0 Num0a 9.0 This DOES appear to short circuit (--GOOD!): dfzero["B"] = pd.DataFrame.where( cond = dfzero["A"] != 0, self = 1/dfzero["A"], other = 0) But this does NOT (--BAD!): (gives a divide by zero error, as there is no short-circuit): df["D"] = pd.DataFrame.where( cond = df["C"].str.len() == 5, self = df["C"].str[-2:].apply(lambda x: int(x, 16)), other = 0) The error is: self = (df["C"].str[-2:].apply(lambda x: int(x, 16))), ValueError: invalid literal for int() with base 16: 'ic'
No, even the first method does NOT short circuit. Both of the operands must first be evaluated before the result is computed. Meaning, this is computed, i = dfzero["A"] != 0 i 0 False 1 True Name: A, dtype: bool And so is this: j = 1 / dfzero['A'] j 0 inf 1 0.500000 Name: A, dtype: float64 The expression is effectively: pd.DataFrame.where(i, j, 0) It's the same for the second. The behaviour is consistent. Were you expecting a ZeroDivisionError? You won't get that with numpy or pandas, because these libraries assume you know what you're doing when you compute such quantities. Your option here is to precompute the mask, and then compute the result for those rows only. m = df["C"].str.len() == 5 df['D'] = df.loc[m, 'C'].str[-2:].apply(lambda x: int(x, 16)) df A B C D 0 0 9.0 Nonnumeric NaN 1 2 9.0 Num0a 10.0 If you want to fill in the NaNs, use df.loc[~m, 'D'] = fill_value.
How to add a new column to a table formed from conditional statements?
I have a very simple query. I have a csv that looks like this: ID X Y 1 10 3 2 20 23 3 21 34 And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise. My code so far is: import pandas as pd data = pd.read_csv("XYZ.csv") for x in data["X"]: if x >= data["Y"]: Data["Z"] = 1 else: Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype: In [119]: df['Z'] = (df['X'].ge(df['Y'])).astype(int) df Out[119]: ID X Y Z 0 1 10 3 1 1 2 20 23 0 2 3 21 34 0 Regarding your attempt: for x in data["X"]: if x >= data["Y"]: Data["Z"] = 1 else: Data["Z"] = 0 it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column. You need to access the index label which your loop didn't you can use iteritems to do this: In [125]: for idx, x in df["X"].iteritems(): if x >= df['Y'].loc[idx]: df.loc[idx, 'Z'] = 1 else: df.loc[idx, 'Z'] = 0 df Out[125]: ID X Y Z 0 1 10 3 1 1 2 20 23 0 2 3 21 34 0 But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'. However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember: import numpy as np data['Z'] = np.where(data.X >= data.Y, 1, 0)
Collapsing identical adjacent rows in a Pandas Series
Basically if a column of my pandas dataframe looks like this: [1 1 1 2 2 2 3 3 3 1 1] I'd like it to be turned into the following: [1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run. As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself. import pandas example_series = pandas.Series([1, 1, 1, 2, 2, 3]) def collapse(series): last = "" seen = [] for element in series: if element != last: last = element seen.append(element) return seen collapse(example_series) In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value. If you need to handle the return value as a series you can change the last line of the function to: return pandas.Series(seen)
You could write a function that does the following: x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1]) y = x-x.shift(1) y[0] = 1 result = x[y!=0]
You can use DataFrame's diff and indexing: >>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1]) >>> df[df[0].diff()!=0] 0 0 1 2 2 6 3 10 1 >>> df[df[0].diff()!=0].values.ravel() # If you need an array array([1, 2, 3, 1]) Same works for Series: >>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1]) >>> df[df.diff()!=0].values array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row: In [67]: s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5]) s[s!=s.shift()] Out[67]: 0 1 2 2 6 3 10 4 12 5 dtype: int64
pandas fancy indexing and merging back
What's the simplest way of merging back changes to a pandas dataframe after filtering via fancy indexing? For example, define a dataframe with two columns x and y, and select all the rows where x is an even integer, and then set the corresponding values in y to 0. d = pd.DataFrame({'x':range(10), 'y':range(11,21)}) d[d.x % 2 == 0]['y'] = 0 The "fancy indexing" boolean query makes a copy of the dataframe, so the changes are never propagated back to the original dataframe. Is there a better of performing this operation? My current solution is to define a temporary dataframe w, based on the fancy boolean indexing, set the corresponding values in 'y' to 0 in w, and then merge w back to d using the index. There must be a more efficient (and hopefully more direct) way of doing this: w = d[d.x % 2 == 0] w.y = 0
Use DataFrame.ix[]: In [21]: d Out[21]: x y 0 0 11 1 1 12 2 2 13 In [22]: d.ix[d.x % 2 == 0, 'y'] = -5 In [23]: d Out[23]: x y 0 0 -5 1 1 12 2 2 -5