Given a DataFrame
>>> df
x y z
0 1 a 7
1 2 b 5
2 3 c 7
I would like to find the index of the column by name, e.g., x -> 0, z -> 2, &c.
I can do
>>> list(df.columns).index('y')
1
but it seems backwards (the pandas.indexes.base.Index class should probably be able to do it without circling back to list).
You can use Index.get_loc:
print (df.columns.get_loc('z'))
2
Another solution with Index.searchsorted:
print (df.columns.searchsorted('z'))
2
Timings:
In [86]: %timeit (df.columns.get_loc('z'))
The slowest run took 13.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop
In [87]: %timeit (df.columns.searchsorted('z'))
The slowest run took 10.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.48 µs per loop
Related
I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing
I have a dataframe df which looks like:
id location grain
0 BBG.XETR.AD.S XETR 16.545
1 BBG.XLON.VB.S XLON 6.2154
2 BBG.XLON.HF.S XLON NaN
3 BBG.XLON.RE.S XLON NaN
4 BBG.XLON.LL.S XLON NaN
5 BBG.XLON.AN.S XLON 3.215
6 BBG.XLON.TR.S XLON NaN
7 BBG.XLON.VO.S XLON NaN
In reality this dataframe will be much larger. I would like to iterate over this dataframe returning the 'grain' value but I am only interested in the rows that have a value (not NaN) in the 'grain' column. So only returning as I iterate over the dataframe the following values:
16.545
6.2154
3.215
I can iterate over the dataframe using:
for staticidx, row in df.iterrows():
value= row['grain']
But this returns a value for all rows including those with a NaN value. Is there a way to either remove the NaN rows from the dataframe or skip the rows in the dataframe where grain equals NaN?
Many thanks
You can specify a list of columns in dropna on which to subset the data:
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
>>> df.dropna(subset=['grain'])
id location grain
0 BBG.XETR.AD.S XETR 16.5450
1 BBG.XLON.VB.S XLON 6.2154
5 BBG.XLON.AN.S XLON 3.2150
This:
df[pd.notnull(df['grain'])]
Or this:
df['grain].dropna()
Let's compare different methods (for 800K rows DF):
In [21]: df = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df.shape
Out[22]: (800000, 3)
In [23]: %timeit df.grain[~pd.isnull(df.grain)]
The slowest run took 5.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 17.1 ms per loop
In [24]: %timeit df.ix[df.grain.notnull(), 'grain']
10 loops, best of 3: 23.9 ms per loop
In [25]: %timeit df[pd.notnull(df['grain'])]
10 loops, best of 3: 35.9 ms per loop
In [26]: %timeit df.grain.ix[df.grain.notnull()]
100 loops, best of 3: 17.4 ms per loop
In [27]: %timeit df.dropna(subset=['grain'])
10 loops, best of 3: 56.6 ms per loop
In [28]: %timeit df.grain[df.grain.notnull()]
100 loops, best of 3: 17 ms per loop
In [30]: %timeit df['grain'].dropna()
100 loops, best of 3: 16.3 ms per loop
Consider this performance test on Ipython under python 3:
Create a range, a range_iterator and a generator
In [1]: g1 = range(1000000)
In [2]: g2 = iter(range(1000000))
In [3]: g3 = (i for i in range(1000000))
Measure time for summing using python native sum
In [4]: %timeit sum(g1)
10 loops, best of 3: 47.4 ms per loop
In [5]: %timeit sum(g2)
The slowest run took 374430.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 123 ns per loop
In [6]: %timeit sum(g3)
The slowest run took 1302907.54 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 128 ns per loop
Not sure if I should worry about the warning. The range version timing is vary long (why?), but the range_iterator and the generator are similar.
Now let's use numpy.sum
In [7]: import numpy as np
In [8]: %timeit np.sum(g1)
10 loops, best of 3: 174 ms per loop
In [9]: %timeit np.sum(g2)
The slowest run took 8.47 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.51 µs per loop
In [10]: %timeit np.sum(g3)
The slowest run took 9.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 446 ns per loop
g1 and g3 became x~3.5 slower, but the range_iterator g2 is now some ~50 times slower compared to the native sum. g3 wins.
In [11]: type(g1)
Out[11]: range
In [12]: type(g2)
Out[12]: range_iterator
In [13]: type(g3)
Out[13]: generator
Why such a penalty to range_iterator on numpy.sum? Should such objects be avoided? Does it generalized - Do "home made" generators always beat other objects on numpy?
EDIT 1: I realized that the np.sum does not evaluate the range_iterator but returns another range_iterator object. So this comparison is not good. Why doesn't it get evaluated?
EDIT 2: I also realized that numpy.sum keeps the range in integer form and accordingly gives the wrong results on my sum due to integer overflow.
In [12]: sum(range(1000000))
Out[12]: 499999500000
In [13]: np.sum(range(1000000))
Out[13]: 1783293664
In [14]: np.sum(range(1000000), dtype=float)
Out[14]: 499999500000.0
Intermediate conclusion - don't use numpy.sum on non numpy objects...?
Did you look at the results of repeated sums on the iter?
95:~/mypy$ g2=iter(range(10))
96:~/mypy$ sum(g2)
Out[96]: 45
97:~/mypy$ sum(g2)
Out[97]: 0
98:~/mypy$ sum(g2)
Out[98]: 0
Why the 0s? Because g2 can be use only once. Same goes for the generator expression.
Or look at it with list
100:~/mypy$ g2=iter(range(10))
101:~/mypy$ list(g2)
Out[101]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
102:~/mypy$ list(g2)
Out[102]: []
In Python 3, range is a range object, not a list. So it's an iterator that regenerates each time it is used.
As for np.sum, np.sum(range(10)) has to make an array first.
When operating on a list, the Python sum is quite fast, faster than np.sum on the same:
116:~/mypy$ %%timeit x=list(range(10000))
...: sum(x)
1000 loops, best of 3: 202 µs per loop
117:~/mypy$ %%timeit x=list(range(10000))
...: np.sum(x)
1000 loops, best of 3: 1.62 ms per loop
But operating on an array, np.sum does much better
118:~/mypy$ %%timeit x=np.arange(10000)
...: sum(x)
100 loops, best of 3: 5.92 ms per loop
119:~/mypy$ %%timeit x=np.arange(10000)
...: np.sum(x)
<caching warning>
100000 loops, best of 3: 18.6 µs per loop
Another timing - various ways of making an array. fromiter can be faster than np.array; but the builtin arange is much better.
124:~/mypy$ timeit np.array(range(100000))
10 loops, best of 3: 39.2 ms per loop
125:~/mypy$ timeit np.fromiter(range(100000),int)
100 loops, best of 3: 12.9 ms per loop
126:~/mypy$ timeit np.arange(100000)
The slowest run took 6.93 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 106 µs per loop
Use range if you intend to work with lists; but use numpy's own range if you need to work with arrays. There is an overhead when creating arrays, so they are more valuable when working with large ones.
==================
On the question of how np.sum handles an iterator - it doesn't. Look at what np.array does to such an object:
In [12]: np.array(iter(range(10)))
Out[12]: array(<range_iterator object at 0xb5998f98>, dtype=object)
It produces a single element array with dtype object.
fromiter will evaluate this iterable:
In [13]: np.fromiter(iter(range(10)),int)
Out[13]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.array follows some complicated rules when it comes to converting the input to an array. It's designed to work primarily, with a list of numbers or nested equal length lists.
If you have questions of how a np function handles a non-array object, first check what np.array does to that object.
After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.
Here is the set up...
import pandas as pd
import numpy as np
import itertools
letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]
df1 = pd.DataFrame({
'value': np.random.normal(size=(1000000)),
'letter': np.random.choice(letter_combinations, 1000000)
})
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()
So df1 looks something like this...
print(df1.head(5))
>>>
letter value
0 bdh 0.253778
1 cem -1.915726
2 mru -0.434007
3 lnw -1.286693
4 fjv 0.245523
Here is the code to test differences in lookup performance...
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']
print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']
print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']
And the results...
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop
Questions...
It's pretty clear why the lookup on the sorted index is so much faster, binary search to get O(log(n)) performance vs O(n) for a full array scan. But, why is the lookup on the sorted non-indexed df2 column SLOWER than the lookup on the unsorted non-indexed column df1?
What is up with the The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached. Surely, the results aren't being cached. Is it because the created index is lazy and isn't actually reindexed until needed? That would explain why it is only on the first call to .loc[].
Why isn't an index sorted by default? The fixed cost of the sort can be too much?
The disparity in these %timeit results
In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop
In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop
also shows up in the pure NumPy equality comparisons:
In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop
In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop
Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython
function
which loops through the values of the underlying NumPy array,
df1['letter'].values. It is essentially doing the same thing as
df1['letter'].values == 'ben' but with different handling of NaNs.
Moreover, notice that simply accessing the items in df1['letter'] in
sequential order can be done more quickly than doing the same for df2['letter']:
In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop
In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop
The difference in times within each of these three sets of %timeit tests are
roughly the same. I think that is because they all share the same cause.
Since the letter column holds strings, the NumPy arrays df1['letter'].values and
df2['letter'].values have dtype object and therefore they hold
pointers to the memory location of the arbitrary Python objects (in this case strings).
Consider the memory location of the strings stored in the DataFrames, df1 and
df2. In CPython the id returns the memory location of the object:
memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
'df2': list(map(id, df2['letter'])), })
df1 df2
0 140226328244040 140226299303840
1 140226328243088 140226308389048
2 140226328243872 140226317328936
3 140226328243760 140226230086600
4 140226328243368 140226285885624
The strings in df1 (after the first dozen or so) tend to appear sequentially
in memory, while sorting causes the strings in df2 (taken in order) to be
scattered in memory:
In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]:
df1 df2
0 NaN NaN
1 -952.0 9085208.0
2 784.0 8939888.0
3 -112.0 -87242336.0
4 -392.0 55799024.0
5 -392.0 5436736.0
6 952.0 22687184.0
7 56.0 -26436984.0
8 -448.0 24264592.0
9 -56.0 -4092072.0
10 -168.0 -10421232.0
11 -363584.0 5512088.0
12 56.0 -17433416.0
13 56.0 40042552.0
14 56.0 -18859440.0
15 56.0 -76535224.0
16 56.0 94092360.0
17 56.0 -4189368.0
18 56.0 73840.0
19 56.0 -5807616.0
20 56.0 -9211680.0
21 56.0 20571736.0
22 56.0 -27142288.0
23 56.0 5615112.0
24 56.0 -5616568.0
25 56.0 5743152.0
26 56.0 -73057432.0
27 56.0 -4988200.0
28 56.0 85630584.0
29 56.0 -4706136.0
Most of the strings in df1 are 56 bytes apart:
In [14]:
In [16]: diffs['df1'].value_counts()
Out[16]:
56.0 986109
120.0 13671
-524168.0 215
-56.0 1
-12664712.0 1
41136.0 1
-231731080.0 1
Name: df1, dtype: int64
In [20]: len(diffs['df1'].value_counts())
Out[20]: 7
In contrast the strings in df2 are scattered all over the place:
In [17]: diffs['df2'].value_counts().head()
Out[17]:
-56.0 46
56.0 44
168.0 39
-112.0 37
-392.0 35
Name: df2, dtype: int64
In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764
When these objects (strings) are located sequentially in memory, their values
can be retrieved more quickly. This is why the equality comparisons performed by
df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values
== 'ben'. The lookup time is smaller.
This memory accessing issue also explains why there is no disparity in the
%timeit results for the value column.
In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop
In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop
df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object
arrays, their values are packed together contiguously in memory. Sorting df1
with df2 = df1.sort_values('letter') causes the values in df2['value'] to be
reordered, but since the values are copied into a new NumPy array, the values
are located sequentially in memory. So accessing the values in df2['value'] can
be done just as quickly as those in df1['value'].
(1) pandas currently has no knowledge of the sortedness of a column.
If you want to take advantage of sorted data, you could use df2.letter.searchsorted See #unutbu's answer for an explanation of what's actually causing the difference in time here.
(2) The hash table that sits underneath the index is lazily created, then cached.
I want to combine 2 seperate data frame of the following shape in Python Pandas:
Df1=
A B
1 1 2
2 3 4
3 5 6
Df2 =
C D
1 a b
2 c d
3 e f
I want to have as follows:
df =
A B C D
1 1 2 a b
2 3 4 c d
3 5 6 e f
I am using the following code:
dat = df1.join(df2)
But problem is that, In my actual data frame there are more than 2 Million rows and for that it takes too long time and consumes huge memory.
Is there any way to do it faster and memory efficient?
Thank you in advance for helping.
If I've read your question correctly, your indexes align exactly and you just need to combine columns into a single DataFrame. If that's right then it turns out that copying over a column from one DataFrame to another is the fastest way to go ([92] and [93]). f is my DataFrame in the example below:
In [85]: len(f)
Out[86]: 343720
In [87]: a = f.loc[:, ['date_val', 'price']]
In [88]: b = f.loc[:, ['red_date', 'credit_spread']]
In [89]: %timeit c = pd.concat([a, b], axis=1)
100 loops, best of 3: 7.11 ms per loop
In [90]: %timeit c = pd.concat([a, b], axis=1, ignore_index=True)
100 loops, best of 3: 10.8 ms per loop
In [91]: %timeit c = a.join(b)
100 loops, best of 3: 6.47 ms per loop
In [92]: %timeit a['red_date'] = b['red_date']
1000 loops, best of 3: 1.17 ms per loop
In [93]: %timeit a['credit_spread'] = b['credit_spread']
1000 loops, best of 3: 1.16 ms per loop
I also tried to copy both columns at once but for some strange reason it was more than two times slower than copying each column individually.
In [94]: %timeit a[['red_date', 'credit_spread']] = b[['red_date', 'credit_spread']]
100 loops, best of 3: 5.09 ms per loop