There appears to be a right and a wrong way to use str methods inside of pandas query. Why is the first query working as expected but the second one fails:
>>> import pandas
>>> data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
... 'year': [2012, 2012, 2013, 2014, 2014],
... 'coverage': [25, 94, 57, 62, 70]}
>>> df = pandas.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
>>> print(df.query('name.str.slice(0,1)=="J"'))
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
>>>
>>> print(df.query('name.str.startswith("J")'))
<lines omitted>
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Try this trick:
In [62]: df.query("name.str.startswith('J').values")
Out[62]:
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
alternatively you can specify engine='python':
In [63]: df.query("name.str.startswith('J')", engine='python')
Out[63]:
coverage name year
Cochice 25 Jason 2012
Maricopa 62 Jake 2014
Timing: for 500K rows DF:
In [68]: df = pd.concat([df] * 10**5, ignore_index=True)
In [69]: df.shape
Out[69]: (500000, 3)
In [70]: %timeit df.query("name.str.startswith('J')", engine='python')
1 loop, best of 3: 583 ms per loop
In [71]: %timeit df.query("name.str.startswith('J').values")
1 loop, best of 3: 587 ms per loop
In [72]: %timeit df[df.name.str.startswith('J')]
1 loop, best of 3: 571 ms per loop
In [74]: %timeit df.query('name.str.slice(0,1)=="J"')
1 loop, best of 3: 482 ms per loop
Related
Suppose I have the following DataFrame:
df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})
And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:
city year value
a 2013 NaN
a 2014 0.20
a 2016 NaN
b 2015 NaN
b 2016 0.05
c 2013 NaN
d 2016 NaN
d 2017 -0.14
d 2018 0.23
I tried to use a group in city and then use apply but it didn't work:
df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()
It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.
EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.
Let's try lazy groupby(), use pct_change for the changes and diff to detect year jump:
groups = df.sort_values('year').groupby(['city'])
df['pct_chg'] = (groups['value'].pct_change()
.where(groups['year'].diff()==1)
)
Output:
city year value pct_chg
0 a 2013 10 NaN
1 a 2014 12 0.200000
2 a 2016 16 NaN
3 b 2015 20 NaN
4 b 2016 21 0.050000
5 c 2013 11 NaN
6 d 2016 15 NaN
7 d 2017 13 -0.133333
8 d 2018 16 0.230769
Although #Quang's answer is much more elegantly written and concise, I just add another approach using indexing.
sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) &
(sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()
my approach is faster as you can see below run on your df, but the syntax is not as pretty.
%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ##Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hello! I have a dataframe with year (1910 ~ 2014), name, count (number of occurrence of each name) as columns. I want to create a new dataframe that shows the name with highest occurrence per year, and I'm not entirely sure about how to do this. Thanks!
Vectorized way
group = df.groupby('year')
df.loc[group['count'].agg('idxmax')]
Try this:
d = {'year': [1910, 1910, 1910,1920,1920,1920], 'name': ["Virginia", "Mary", "Elizabeth","Virginia", "Mary", "Elizabeth"], 'count': [848, 420, 747, 1048, 221, 147]}
df = pd.DataFrame(data=d)
df_results = pd.DataFrame(columns=df.columns)
years = pd.unique(df['year'])
for year in years:
tmp_df = df.loc[df['year'] == year]
tmp_df = tmp_df.sort_values(by='year')
df_results = df_results.append(tmp_df.iloc[0])
I suppose groupby & apply is good approach:
df = pd.DataFrame({
'Year': ['1910', '1910', '1911', '1911', '1911', '2014', '2014'],
'Name': ['Mary', 'Virginia', 'Elizabeth', 'Mary', 'Ann', 'Virginia', 'Elizabeth'],
'Count': [848, 270, 254, 360, 451, 81, 380]
})
df
Out:
Year Name Count
0 1910 Mary 848
1 1910 Virginia 270
2 1911 Elizabeth 254
3 1911 Mary 360
4 1911 Ann 451
5 2014 Virginia 81
6 2014 Elizabeth 380
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1911 4 1911 Ann 451
2014 6 2014 Elizabeth 380
Also you can change head(1) by head(n) to get n most frequent names per year:
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(2))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1 1910 Virginia 270
1911 4 1911 Ann 451
3 1911 Mary 360
2014 6 2014 Elizabeth 380
5 2014 Virginia 81
If you don't like new additional index, drop it via .reset_index(level=0, drop=True):
top_names = df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
top_names.reset_index(level=0, drop=True)
Out:
Year Name Count
0 1910 Mary 848
4 1911 Ann 451
6 2014 Elizabeth 380
another way of doing this is sort the values of count and de-duplicate the Year column(faster too):
df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
time results are below, you can try applying any method and see howmuch time each takes and apply accordingly:
%timeit df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
result: 917 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].agg('idxmax')]
result: 1.06 ms ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].idxmax(), :]
result: 1.13 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have data which is of the form Thu Jun 22 09:43:06 and I would like to infer the year from this to use datetime to calculate the time between two dates. Is there a way to use datetime to infer the year given the above data?
No, but if you know the range (for example 2010..2017), you can just iterate over years to see if Jun 22 falls on Thursday:
def find_year(start_year, end_year, month, day, week_day):
for y in range(start_year, end_year+1):
if datetime.datetime(y, month, day, 0, 0).weekday() == week_day:
yield y
# weekday is 0..6 starting from Monday, so 3 stands for Thursday
print(list(find_year(2010, 2017, 6, 22, 3)))
[2017]
For longer ranges, though, there might be more than one result:
print(list(find_year(2000,2017, 6, 22, 3)))
[2000, 2006, 2017]
You could also use pd.date_range to generate a lookup table
calendar = pd.date_range('2017-01-01', '2020-12-31')
dow = {i: d for i, d in enumerate(('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))}
moy = {i: d for i, d in enumerate(('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'), 1)}
lup = {'{} {} {:>2d}'.format(dow[d.weekday()], moy[d.month], d.day): str(d.year) for d in calendar}
date = 'Tue Jun 25'
print(lup[date])
# 2019
print(pd.Timestamp(date + ' ' + lup[date]))
# 2019-06-25 00:00:00
Benchmarking it in ipython, there's some decent speedup once the table is generated, but the overhead of generating the table may not be worth it unless you have a lot of dates to confirm.
In [28]: lup = gen_lookup('1-1-2010', '12-31-2017')
In [29]: date = 'Thu Jun 22'
In [30]: lup[date]
Out[30]: ['2017']
In [32]: list(find_year(2010, 2017, 6, 22, 3))
Out[32]: [2017]
In [33]: %timeit lup = gen_lookup('1-1-2010', '12-31-2017')
13.8 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [34]: %timeit yr = lup[date]
54.1 ns ± 0.547 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [35]: %timeit yr = find_year(2010, 2017, 6, 22, 3)
248 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When reading a large hdf file with pandas.read_hdf() I get extremely slow read time. My hdf has 50 million rows, 3 columns with integers and 2 with strings. Writing this using to_hdf() with table format and indexing took almost 10 minutes. While this is also slow, I am not too concerned as read speed is more important.
I have tried saving as fixed/table format, with/without compression, however the read time ranges between 2-5 minutes. By comparison, read_csv() on the same data takes 4 minutes.
I have also tried to read the hdf using pytables directly. This is much faster at 6 seconds and this would be the speed I would like to see.
h5file = tables.open_file("data.h5", "r")
table = h5file.root.data.table.read()
I noticed all the speed comparisons in the documentation use only numeric data and running these myself achieved similar performance.
I would like to ask whether there is something I can do to optimise read performance?
Edit
Here is a sample of the data
col_A col_B col_C col_D col_E
30649671 1159660800 10217383 0 10596000 LACKEY
26198715 1249084800 0921720 0 0 KEY CLIFTON
19251910 752112000 0827092 104 243000 WEMPLE
47636877 1464739200 06247715 0 0 FLOYD
14121495 1233446400 05133815 0 988000 OGU ALLYN CH 9
41171050 1314835200 7C140009 0 39000 DEBERRY A
45865543 1459468800 0314892 76 254000 SABRINA
13387355 970358400 04140585 19 6956000 LA PERLA
4186815 849398400 02039719 0 19208000 NPU UNIONSPIELHAGAN1
32666568 733622400 10072006 0 1074000 BROWN
And info on the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52046850 entries, 0 to 52046849
Data columns (total 5 columns):
col_A int64
col_B object
col_C int64
col_D int64
col_E object
dtypes: int64(3), object(2)
memory usage: 1.9+ GB
Here is a small demo:
Generating sample DF (1M rows):
N = 10**6
df = pd.DataFrame({
'n1': np.random.randint(10**6, size=N),
'n2': np.random.randint(10**6, size=N),
'n3': np.random.randint(10**6, size=N),
's1': pd.util.testing.rands_array(10, size=N),
's2': pd.util.testing.rands_array(40, size=N),
})
let's write it to disk in CSV, HDF5 (fixed, table and table + data_columns=True) and in Feather formats
df.to_csv(r'c:/tmp/test.csv', index=False)
df.to_hdf(r'c:/tmp/test_fix.h5', 'a')
df.to_hdf(r'c:/tmp/test_tab.h5', 'a', format='t')
df.to_hdf(r'c:/tmp/test_tab_idx.h5', 'a', format='t', data_columns=True)
import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')
Reading:
In [2]: %timeit pd.read_csv(r'c:/tmp/test.csv')
1 loop, best of 3: 4.48 s per loop
In [3]: %timeit pd.read_hdf(r'c:/tmp/test_fix.h5','a')
1 loop, best of 3: 1.24 s per loop
In [4]: %timeit pd.read_hdf(r'c:/tmp/test_tab.h5','a')
1 loop, best of 3: 5.65 s per loop
In [5]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a')
1 loop, best of 3: 5.6 s per loop
In [6]: %timeit feather.read_dataframe(r'c:/tmp/test.feather')
1 loop, best of 3: 589 ms per loop
conditional reading - let's select only those rows where n2 <= 100000
In [7]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000")
1 loop, best of 3: 1.18 s per loop
the less data we need to select (after filtering) - the faster it is:
In [8]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000")
1 loop, best of 3: 763 ms per loop
In [10]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000 and n3 < 50000")
1 loop, best of 3: 379 ms per loop
UPDATE: for Pandas versions 0.20.0+ there we can write and read directly to/from feather format (thanks #jezrael for the hint):
In [3]: df.to_feather(r'c:/tmp/test2.feather')
In [4]: %timeit pd.read_feather(r'c:/tmp/test2.feather')
1 loop, best of 3: 583 ms per loop
Example of generated DF:
In [13]: df
Out[13]:
n1 n2 n3 s1 s2
0 719458 808047 792611 Fjv4CoRv2b 2aWQTkutPlKkO38fRQh2tdh1BrnEFavmIsDZK17V
1 526092 950709 804869 dfG12EpzVI YVZzhMi9sfazZEW9e2TV7QIvldYj2RPHw0TXxS2z
2 109107 801344 266732 aoyBuHTL9I ui0PKJO8cQJwcvmMThb08agWL1UyRumYgB7jjmcw
3 873626 814409 895382 qQQms5pTGq zvf4HTaKCISrdPK98ROtqPqpsG4WhSdEgbKNHy05
4 212776 596713 924623 3YXa4PViAn 7Y94ykHIHIEnjKvGphYfAWSINRZtJ99fCPiMrfzl
5 375323 401029 973262 j6QQwYzfsK PNYOM2GpHdhrz9NCCifRsn8gIZkLHecjlk82o44Y
6 232655 937230 40883 NsI5Y78aLT qiKvXcAdPVbhWbXnyD3uqIwzS7ZsCgssm9kHAETb
7 69010 438280 564194 N73tQaZjey ttj1IHtjPyssyADMYiNScflBjN4SFv5bk3tbz93o
8 988081 8992 968871 eb9lc7D22T sb3dt1Ndc8CUHyvsFJgWRrQg4ula7KJ76KrSSqGH
9 127155 66042 881861 tHSBB3RsNH ZpZt5sxAU3zfiPniSzuJYrwtrytDvqJ1WflJ4vh3
... ... ... ... ... ...
999990 805220 21746 355944 IMCMWuf97L bj7tSrgudA5wLvWkWVQyNVamSGmFGOeQlIUoKXK3
999991 232596 293850 741881 JD0SVS5uob kWeP8DEw19rwxVN3XBBcskibMRGxfoToNO9RDeCT
999992 532752 733958 222003 9X4PopnltN dKhsdKFK1EfAATBFsB5hjKZzQWERxzxGEQZWAvSe
999993 308623 717897 703895 Fg0nuq63hA kHzRecZoaG5tAnLbtlq1hqtfd2l5oEMFbJp4NjhC
999994 841670 528518 70745 vKQDiAzZNf M5wdoUNfkdKX2VKQEArvBLYl5lnTNShjDLwnb8VE
999995 986988 599807 901853 r8iHjo39NH 72CfzCycAGoYMocbw3EbUbrV4LRowFjSDoDeYfT5
999996 384064 429184 203230 EJy0mTAmdQ 1jfUQCj2SLIktVqIRHfYQW2QYfpvhcWCbRLO5wqL
999997 967270 565677 146418 KWp2nH1MbM hzhn880cuEpjFhd5bd7vpgsjjRNgaViANW9FHwrf
999998 130864 863893 5614 L28QGa22f1 zfg8mBidk8NTa3LKO4rg31Z6K4ljK50q5tHHq8Fh
999999 528532 276698 553870 0XRJwqBAWX 0EzNcDkGUFklcbKELtcr36zPCMu9lSaIDcmm0kUX
[1000000 rows x 5 columns]
I have test_df with columns 'MonthAbbr' and 'PromoInterval'
Example output
1017174 Jun Mar,Jun,Sept,Dec
1017175 Mar Mar,Jun,Sept,Dec
1017176 Feb Mar,Jun,Sept,Dec
1017177 Feb Feb,May,Aug,Nov
1017178 Jan Feb,May,Aug,Nov
1017179 Jan Mar,Jun,Sept,Dec
1017180 Jan Mar,Jun,Sept,Dec
I want add column-indicator is month in promo interval, which will =1 if MonthAbbr in PromoInterval for current row, =0 otherwise
Is there more efficient way?
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind])
This is a bit faster:
%%timeit
test_df['IsPromoInThisMonth'] = [x in y for x, y in zip(test_df['MonthAbbr'],
test_df['PromoInterval'])]
1000 loops, best of 3: 317 µs per loop
Than your approach:
%%timeit
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind]))
1000 loops, best of 3: 1.44 ms per loop
UPDATE
Using a function with apply is slower than the list comprehension:
%%timeit
test_df['IsPromoInThisMonth'] = test_df.apply(lambda x: x[0] in x[1], axis=1)
1000 loops, best of 3: 804 µs per loop