Pandas DataFrame check colums value in other column - python

I have test_df with columns 'MonthAbbr' and 'PromoInterval'
Example output
1017174 Jun Mar,Jun,Sept,Dec
1017175 Mar Mar,Jun,Sept,Dec
1017176 Feb Mar,Jun,Sept,Dec
1017177 Feb Feb,May,Aug,Nov
1017178 Jan Feb,May,Aug,Nov
1017179 Jan Mar,Jun,Sept,Dec
1017180 Jan Mar,Jun,Sept,Dec
I want add column-indicator is month in promo interval, which will =1 if MonthAbbr in PromoInterval for current row, =0 otherwise
Is there more efficient way?
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind])

This is a bit faster:
%%timeit
test_df['IsPromoInThisMonth'] = [x in y for x, y in zip(test_df['MonthAbbr'],
test_df['PromoInterval'])]
1000 loops, best of 3: 317 µs per loop
Than your approach:
%%timeit
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind]))
1000 loops, best of 3: 1.44 ms per loop
UPDATE
Using a function with apply is slower than the list comprehension:
%%timeit
test_df['IsPromoInThisMonth'] = test_df.apply(lambda x: x[0] in x[1], axis=1)
1000 loops, best of 3: 804 µs per loop

Related

Infer year from day of week and date with python datetime

I have data which is of the form Thu Jun 22 09:43:06 and I would like to infer the year from this to use datetime to calculate the time between two dates. Is there a way to use datetime to infer the year given the above data?
No, but if you know the range (for example 2010..2017), you can just iterate over years to see if Jun 22 falls on Thursday:
def find_year(start_year, end_year, month, day, week_day):
for y in range(start_year, end_year+1):
if datetime.datetime(y, month, day, 0, 0).weekday() == week_day:
yield y
# weekday is 0..6 starting from Monday, so 3 stands for Thursday
print(list(find_year(2010, 2017, 6, 22, 3)))
[2017]
For longer ranges, though, there might be more than one result:
print(list(find_year(2000,2017, 6, 22, 3)))
[2000, 2006, 2017]
You could also use pd.date_range to generate a lookup table
calendar = pd.date_range('2017-01-01', '2020-12-31')
dow = {i: d for i, d in enumerate(('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))}
moy = {i: d for i, d in enumerate(('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'), 1)}
lup = {'{} {} {:>2d}'.format(dow[d.weekday()], moy[d.month], d.day): str(d.year) for d in calendar}
date = 'Tue Jun 25'
print(lup[date])
# 2019
print(pd.Timestamp(date + ' ' + lup[date]))
# 2019-06-25 00:00:00
Benchmarking it in ipython, there's some decent speedup once the table is generated, but the overhead of generating the table may not be worth it unless you have a lot of dates to confirm.
In [28]: lup = gen_lookup('1-1-2010', '12-31-2017')
In [29]: date = 'Thu Jun 22'
In [30]: lup[date]
Out[30]: ['2017']
In [32]: list(find_year(2010, 2017, 6, 22, 3))
Out[32]: [2017]
In [33]: %timeit lup = gen_lookup('1-1-2010', '12-31-2017')
13.8 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [34]: %timeit yr = lup[date]
54.1 ns ± 0.547 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [35]: %timeit yr = find_year(2010, 2017, 6, 22, 3)
248 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How to use str methods inside pandas query()

There appears to be a right and a wrong way to use str methods inside of pandas query. Why is the first query working as expected but the second one fails:
>>> import pandas
>>> data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
... 'year': [2012, 2012, 2013, 2014, 2014],
... 'coverage': [25, 94, 57, 62, 70]}
>>> df = pandas.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
>>> print(df.query('name.str.slice(0,1)=="J"'))
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
>>>
>>> print(df.query('name.str.startswith("J")'))
<lines omitted>
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Try this trick:
In [62]: df.query("name.str.startswith('J').values")
Out[62]:
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
alternatively you can specify engine='python':
In [63]: df.query("name.str.startswith('J')", engine='python')
Out[63]:
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
Timing: for 500K rows DF:
In [68]: df = pd.concat([df] * 10**5, ignore_index=True)
In [69]: df.shape
Out[69]: (500000, 3)
In [70]: %timeit df.query("name.str.startswith('J')", engine='python')
1 loop, best of 3: 583 ms per loop
In [71]: %timeit df.query("name.str.startswith('J').values")
1 loop, best of 3: 587 ms per loop
In [72]: %timeit df[df.name.str.startswith('J')]
1 loop, best of 3: 571 ms per loop
In [74]: %timeit df.query('name.str.slice(0,1)=="J"')
1 loop, best of 3: 482 ms per loop

Remove rows where a specific column has a blank entry

I have a dataframe Df which looks like:
date XNGS BBG FX
16/11/2007 19.41464766 0.6819 19.41464766
19/11/2007 19.34059332 0.6819 19.34059332
20/11/2007 19.49080536 0.6739 19.49080536
21/11/2007 19.2399259 0.673 19.2399259
22/11/2007 0.6734
23/11/2007 19.2009794 0.674 19.2009794
I would like to remove any rows where XNGS is empty. In this example I would like to remove the row with the date index 22/11/2007. So the resulting Df would look like:
date XNGS BBG FX
16/11/2007 19.41464766 0.6819 19.41464766
19/11/2007 19.34059332 0.6819 19.34059332
20/11/2007 19.49080536 0.6739 19.49080536
21/11/2007 19.2399259 0.673 19.2399259
23/11/2007 19.2009794 0.674 19.2009794
The dataframe changes a lot so the fix needs to be dynamic. I have tried:
Df = Df[Df.XNGS != ""]
and
Df.dropna(subset=["XNGS"])
but they don't work. What can I try next?
Safe Option
canonical dropna after replace
df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS'])
date XNGS BBG FX
0 16/11/2007 19.414648 0.6819 19.414648
1 19/11/2007 19.340593 0.6819 19.340593
2 20/11/2007 19.490805 0.6739 19.490805
3 21/11/2007 19.239926 0.6730 19.239926
5 23/11/2007 19.200979 0.6740 19.200979
Less Safe, but Cool
Empty strings evaluate to False
df[df.XNGS.values.astype(bool)]
date XNGS BBG FX
0 16/11/2007 19.414648 0.6819 19.414648
1 19/11/2007 19.340593 0.6819 19.340593
2 20/11/2007 19.490805 0.6739 19.490805
3 21/11/2007 19.239926 0.6730 19.239926
5 23/11/2007 19.200979 0.6740 19.200979
Timing
small data
%timeit (df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS']))
1000 loops, best of 3: 1.39 ms per loop
%timeit df[df.XNGS.values.astype(bool)]
1000 loops, best of 3: 192 µs per loop
large data
df = pd.concat([df] * 10000, ignore_index=True)
%timeit (df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS']))
100 loops, best of 3: 10.5 ms per loop
%timeit df[df.XNGS.values.astype(bool)]
100 loops, best of 3: 2.11 ms per loop
What about query?
Df.query('XNGS != ""', inplace=True)
or
Df = Df.query('XNGS != ""')
A long way of doing it is:
df["column name"].fillna(9999, inplace=True)
df = df[df["column name"]!= 9999]

Does Indexing makes Slice of pandas dataframe faster?

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...
I need to effectively get the record during the specific period. The following naive way is very time consuming.
new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]
I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.
My question is
Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?
Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.
df = pd.DataFrame(np.random.randn(1000000, 3),
columns=list('ABC'),
index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))
Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):
%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)
%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)
%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)
%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)
Here are the timings with the Datetime index as a column:
df.reset_index(inplace=True)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)
All of the above indexing techniques produce the same dataframe:
>>> df2.shape
(250560, 3)
It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.
I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.
timedata.txt (extended from your example):
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338
df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])
x y z
time
2015-05-01 10:00:03 112 223 334
2015-05-01 10:00:05 112 223 335
2015-05-01 10:00:08 112 223 336
2015-05-01 10:00:13 112 223 337
Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.
If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.

Faster alternatives to numpy.argmax/argmin which is slow

I am using a lot of argmin and argmax in Python.
Unfortunately, the function is very slow.
I have done some searching around, and the best I can find is here:
http://lemire.me/blog/archives/2008/12/17/fast-argmax-in-python/
def fastest_argmax(array):
array = list( array )
return array.index(max(array))
Unfortunately, this solution is still only half as fast as np.max, and I think I should be able to find something as fast as np.max.
x = np.random.randn(10)
%timeit np.argmax( x )
10000 loops, best of 3: 21.8 us per loop
%timeit fastest_argmax( x )
10000 loops, best of 3: 20.8 us per loop
As a note, I am applying this to a Pandas DataFrame Groupby
E.G.
%timeit grp2[ 'ODDS' ].agg( [ fastest_argmax ] )
100 loops, best of 3: 8.8 ms per loop
%timeit grp2[ 'ODDS' ].agg( [ np.argmax ] )
100 loops, best of 3: 11.6 ms per loop
Where grp2[ 'ODDS' ].head() looks like this:
EVENT_ID SELECTION_ID
104601100 4367029 682508 3.05
682509 3.15
682510 3.25
682511 3.35
5319660 682512 2.04
682513 2.08
682514 2.10
682515 2.12
682516 2.14
5510310 682520 4.10
682521 4.40
682522 4.50
682523 4.80
682524 5.30
5559264 682526 5.00
682527 5.30
682528 5.40
682529 5.50
682530 5.60
5585869 682533 1.96
682534 1.97
682535 1.98
682536 2.02
682537 2.04
6064546 682540 3.00
682541 2.74
682542 2.76
682543 2.96
682544 3.05
104601200 4916112 682548 2.64
682549 2.68
682550 2.70
682551 2.72
682552 2.74
5315859 682557 2.90
682558 2.92
682559 3.05
682560 3.10
682561 3.15
5356995 682564 2.42
682565 2.44
682566 2.48
682567 2.50
682568 2.52
5465225 682573 1.85
682574 1.89
682575 1.91
682576 1.93
682577 1.94
5773661 682588 5.00
682589 4.40
682590 4.90
682591 5.10
6013187 682592 5.00
682593 4.20
682594 4.30
682595 4.40
682596 4.60
104606300 2489827 683438 4.00
683439 3.90
683440 3.95
683441 4.30
683442 4.40
3602724 683446 2.16
683447 2.32
Name: ODDS, Length: 65, dtype: float64
It turns out that np.argmax is blazingly fast, but only with the native numpy arrays. With foreign data, almost all the time is spent on conversion:
In [194]: print platform.architecture()
('64bit', 'WindowsPE')
In [5]: x = np.random.rand(10000)
In [57]: l=list(x)
In [123]: timeit numpy.argmax(x)
100000 loops, best of 3: 6.55 us per loop
In [122]: timeit numpy.argmax(l)
1000 loops, best of 3: 729 us per loop
In [134]: timeit numpy.array(l)
1000 loops, best of 3: 716 us per loop
I called your function "inefficient" because it first converts everything to list, then iterates through it 2 times (effectively, 3 iterations + list construction).
I was going to suggest something like this that only iterates once:
def imax(seq):
it=iter(seq)
im=0
try: m=it.next()
except StopIteration: raise ValueError("the sequence is empty")
for i,e in enumerate(it,start=1):
if e>m:
m=e
im=i
return im
But, your version turns out to be faster because it iterates many times but does it in C, rather that Python, code. C is just that much faster - even considering the fact a great deal of time is spent on conversion, too:
In [158]: timeit imax(x)
1000 loops, best of 3: 883 us per loop
In [159]: timeit fastest_argmax(x)
1000 loops, best of 3: 575 us per loop
In [174]: timeit list(x)
1000 loops, best of 3: 316 us per loop
In [175]: timeit max(l)
1000 loops, best of 3: 256 us per loop
In [181]: timeit l.index(0.99991619010758348) #the greatest number in my case, at index 92
100000 loops, best of 3: 2.69 us per loop
So, the key knowledge to speeding this up further is to know which format the data in your sequence natively is (e.g. whether you can omit the conversion step or use/write another functionality native to that format).
Btw, you're likely to get some speedup by using aggregate(max_fn) instead of agg([max_fn]).
For those that came for a short numpy-free snippet that returns the index of the first minimum value:
def argmin(a):
return min(range(len(a)), key=lambda x: a[x])
a = [6, 5, 4, 1, 1, 3, 2]
argmin(a) # returns 3
Can you post some code? Here is the result on my pc:
x = np.random.rand(10000)
%timeit np.max(x)
%timeit np.argmax(x)
output:
100000 loops, best of 3: 7.43 µs per loop
100000 loops, best of 3: 11.5 µs per loop

Categories

Resources