Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?
df = pd.DataFrame({"A":[1,2,3,4],
"B":[10,20,30,40],
"C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]
Goal:
[10, 2, 3, 400]
One method that works is to use an apply statement.
df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)
Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.
Pandas approach:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400
NumPy way
Here's a vectorized NumPy way using advanced indexing -
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :
# https://stackoverflow.com/a/38489403/ #Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx would be obtained like so -
col_idx = column_index(df, columns_to_select)
Further optimization
Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from #MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!
Related
Consider the following example
import pandas as pd
import numpy as np
myidx = pd.date_range('2016-01-01','2017-01-01')
data = pd.DataFrame({'value' : xrange(len(myidx))}, index = myidx)
data.head()
Out[16]:
value
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
This problem is related to expanding each row in a dataframe
I absolutely need to improve the performance of something that is intuitively very simple: I need to "enlarge" the dataframe so that each index value gets "enlarged" by a couple days (2 days before, 2 days after).
To do this task, I have the following function
def expand_onerow(df, ndaysback = 2, nhdaysfwd = 2):
new_index = pd.date_range(pd.to_datetime(df.index[0]) - pd.Timedelta(days=ndaysback),
pd.to_datetime(df.index[0]) + pd.Timedelta(days=nhdaysfwd),
freq='D')
newdf = df.reindex(index=new_index, method='nearest') #New df with expanded index
return newdf
Now either using iterrows or the (supposedly) faster itertuples gives poor results.
%timeit pd.concat([expand_onerow(data.loc[[x],:], ndaysback = 2, nhdaysfwd = 2) for x ,_ in data.iterrows()])
1 loop, best of 3: 574 ms per loop
%timeit pd.concat([expand_onerow(data.loc[[x.Index],:], ndaysback = 2, nhdaysfwd = 2) for x in data.itertuples()])
1 loop, best of 3: 643 ms per loop
Any ideas how to speed up the generation of final dataframe? I have millions of obs in my real dataframe, and the index dates are not necessarily consecutive as they are in this example.
head(10) on the final dataframe
Out[21]:
value
2015-12-30 0
2015-12-31 0
2016-01-01 0
2016-01-02 0
2016-01-03 0
2015-12-31 1
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
Thanks!
When using NumPy/Pandas, the key to speed is often applying vectorized functions to the largest arrays/NDFrames possible. The main reason why your original code is slow is because it calls expand_onerow once for each row. The rows are tiny and you have millions of them. To make it faster, we need to find a way to express the calculation in terms of functions applied to whole DataFrames or at least whole columns. This tends to achieve the result with more time having been spent in fast C or Fortran code and less time in slower Python code.
In this case, the result can be obtained by making copies of data and shifting the index of the whole DataFrame by i days:
new = df.copy()
new.index = df.index + pd.Timedelta(days=i)
dfs.append(new)
and then concatenating the shifted copies:
pd.concat(dfs)
import pandas as pd
import numpy as np
myidx = pd.date_range('2016-01-01','2017-01-01')
data = pd.DataFrame({'value' : range(len(myidx))}, index = myidx)
def expand_onerow(df, ndaysback = 2, nhdaysfwd = 2):
new_index = pd.date_range(pd.to_datetime(df.index[0]) - pd.Timedelta(days=ndaysback),
pd.to_datetime(df.index[0]) + pd.Timedelta(days=nhdaysfwd),
freq='D')
newdf = df.reindex(index=new_index, method='nearest') #New df with expanded index
return newdf
def orig(df, ndaysback=2, ndaysfwd=2):
return pd.concat([expand_onerow(data.loc[[x],:], ndaysback = ndaysback, nhdaysfwd = ndaysfwd) for x ,_ in data.iterrows()])
def alt(df, ndaysback=2, ndaysfwd=2):
dfs = [df]
for i in range(-ndaysback, ndaysfwd+1):
if i != 0:
new = df.copy()
new.index = df.index + pd.Timedelta(days=i)
# you could instead use
# new = df.set_index(df.index + pd.Timedelta(days=i))
# but it made the timeit result a bit slower
dfs.append(new)
return pd.concat(dfs)
Notice that alt has a Python loop with (essentially) 4 iterations. orig has a Python loop (in the form of a list comprehension) with len(df) iterations. Making fewer function calls and applying vectorized functions to bigger array-like objects is how alt gains speed over orig.
Here is a benchmark comparing orig and alt on data:
In [40]: %timeit orig(data)
1 loop, best of 3: 1.15 s per loop
In [76]: %timeit alt(data)
100 loops, best of 3: 2.22 ms per loop
In [77]: 1150/2.22
Out[77]: 518.018018018018
So alt is over 500x faster than orig on a 367-row DataFrame. For small-to-medium sized DataFrame, the speed advantage tends to grow as len(data) gets larger, because alt's Python loop will still have 4 iterations, while orig's loop gets longer. At some point however, for really large DataFrames, I would expect the speed advantage to crest at some constant factor -- I don't know how large it would be, except that it should be greater than 500x.
This checks that the two functions, orig and alt produce the same result (but in a different order):
result = alt(data)
expected = orig(data)
result = result.reset_index().sort_values(by=['index','value']).reset_index(drop=True)
expected = expected.reset_index().sort_values(by=['index','value']).reset_index(drop=True)
assert expected.equals(result)
Given a dataframe a with 3 columns, A , B , C and 3 rows of numerical values. How does one sort all the rows with a comp operator using only the product of A[i]*B[i]. It seems that the pandas sort only takes columns and then a sort method.
I would like to use a comparison function like below.
f = lambda i,j: a['A'][i]*a['B'][i] < a['A'][j]*a['B'][j]
There are at least two ways:
Method 1
Say you start with
In [175]: df = pd.DataFrame({'A': [1, 2], 'B': [1, -1], 'C': [1, 1]})
You can add a column which is your sort key
In [176]: df['sort_val'] = df.A * df.B
Finally sort by it and drop it
In [190]: df.sort_values('sort_val').drop('sort_val', 1)
Out[190]:
A B C
1 2 -1 1
0 1 1 1
Method 2
Use numpy.argsort and then use .ix on the resulting indices:
In [197]: import numpy as np
In [198]: df.ix[np.argsort(df.A * df.B).values]
Out[198]:
A B C
0 1 1 1
1 2 -1 1
Another way, adding it here because this is the first result at Google:
df.loc[(df.A * df.B).sort_values().index]
This works well for me and is pretty straightforward. #Ami Tavory's answer gave strange results for me with a categorical index; not sure it's because of that though.
Just adding on #srs super elegant answer an iloc option with some time comparisons with loc and the naive solution.
(iloc is preferred for when your your index is position-based (vs label-based for loc)
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({
'A': np.random.randint(low=1, high=N, size=N),
'B': np.random.randint(low=1, high=N, size=N)
})
%%timeit -n 100
df['C'] = df['A'] * df['B']
df.sort_values(by='C')
naive: 100 loops, best of 3: 1.85 ms per loop
%%timeit -n 100
df.loc[(df.A * df.B).sort_values().index]
loc: 100 loops, best of 3: 2.69 ms per loop
%%timeit -n 100
df.iloc[(df.A * df.B).sort_values().index]
iloc: 100 loops, best of 3: 2.02 ms per loop
df['C'] = df['A'] * df['B']
df1 = df.sort_values(by='C')
df2 = df.loc[(df.A * df.B).sort_values().index]
df3 = df.iloc[(df.A * df.B).sort_values().index]
print np.array_equal(df1.index, df2.index)
print np.array_equal(df2.index, df3.index)
testing results (comparing the entire index order) between all options:
True
True
Is there a way to split a Numpy Array in a vectorized manner based upon character count for each element?
Input:
In [1]: import numpy as np
In [2]: y = np.array([ 'USC00013160194806SNOW','USC00013160194806SNOW','USC00013160194806SNOW' ])
In [3]: y
Out[3]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
I want each element of the array split according to a certain number of characters.
Desired Output:
In [3]: y
Out[3]:
array(['USC00013160', 'USC00013160',
'USC00013160'],
dtype='|S21')
I've executed this using standard python loops, but I'm dealing with millions of values, so I'm trying to figure the fastest method.
You can create a view using a data type with the same size as y's dtype that has subfields corresponding to the parts that you want. For example,
In [22]: y
Out[22]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
In [23]: dt = np.dtype([('part1', 'S11'), ('part2', 'S6'), ('part3', 'S4')])
In [24]: v = y.view(dt)
In [25]: v['part1']
Out[25]:
array(['USC00013160', 'USC00013160', 'USC00013160'],
dtype='|S11')
In [26]: v['part2']
Out[26]:
array(['194806', '194806', '194806'],
dtype='|S6')
In [27]: v['part3']
Out[27]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Note that these are all views of the same data in y. If you modify them in place, you are also modifying y. For example,
In [32]: v3 = v['part3']
In [33]: v3
Out[33]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Change v3[1] to 'RAIN':
In [34]: v3[1] = 'RAIN'
In [35]: v3
Out[35]:
array(['SNOW', 'RAIN', 'SNOW'],
dtype='|S4')
Now see that y[1] is also changed:
In [36]: y
Out[36]:
array(['USC00013160194806SNOW', 'USC00013160194806RAIN',
'USC00013160194806SNOW'],
dtype='|S21')
One possible solution I've found is just completing the operation using Pandas Series, but I'm wondering if this can be done using an only Numpy arrays slicing methods. If not, it is fine, more curious about the best practice.
Starting Pandas Series:
In [33]: x = pd.read_csv("data.txt", delimiter='\n', dtype=str, squeeze=True)
In [34]: x
Out[34]:
0 USC00013160194807SNOW
1 USC00013160194808SNOW
2 USC00013160194809SNOW
3 USC00013160194810SNOW
4 USC00013160194811SNOW, dtype: object
Vectorized String Processing based on Character Count:
In [37]: k = x.str[0:11]
Output:
In [38]: k
Out[38]:
0 USC00013160
1 USC00013160
2 USC00013160
3 USC00013160
4 USC00013160
I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.
I can get the max across rows or columns by changing the first argument of
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.
For example, I can do this in numpy:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.
E.g. for the above I'd like a function which does:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where()
I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.
what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop
While I sum a DataFrame, it returns a Series:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
In [3]: df
Out[3]:
a b c
0 1 2 3
1 2 3 3
In [4]: s = df.sum()
In [5]: type(s)
Out[5]: pandas.core.series.Series
I know I can construct a new DataFrame by this Series. But, is there any more "pandasic" way?
I'm going to go ahead and say... "No", I don't think there is a direct way to do it, the pandastic way (and pythonic too) is to be explicit:
pd.DataFrame(df.sum(), columns=['sum'])
or, more elegantly, using a dictionary (be aware that this copies the summed array):
pd.DataFrame({'sum': df.sum()})
As #root notes it's faster to use:
pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
(As the zen of python states: "practicality beats purity", so if you care about this time, use this).
However, perhaps the most pandastic way is to just use the Series! :)
.
Some %timeits for your tiny example:
In [11]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
1000 loops, best of 3: 356 us per loop
In [12]: %timeit pd.DataFrame({'sum': df.sum()})
1000 loops, best of 3: 462 us per loop
In [13]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
1000 loops, best of 3: 205 us per loop
and for a slightly larger one:
In [21]: df = pd.DataFrame(np.random.randn(100000, 3), columns=list('abc'))
In [22]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
100 loops, best of 3: 7.99 ms per loop
In [23]: %timeit pd.DataFrame({'sum': df.sum()})
100 loops, best of 3: 8.3 ms per loop
In [24]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
100 loops, best of 3: 2.47 ms per loop
Often it is necessary not only to convert the sum of the columns into a dataframe, but also to transpose the resulting dataframe. There is also a method for this:
df.sum().to_frame().transpose()
I am not sure about earlier versions, but as of pandas 0.18.1 one can use pandas.Series.to_frame method.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
s = df.sum().to_frame(name='sum')
type(s)
>>> pandas.core.frame.DataFrame
The name argument is optional and defines the column name.
You can use agg for simple operations like sum, have a look at how compact this is:
df.agg(['sum'])
df.sum().to_frame() should do what you want.
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_frame.html.
By DF.sum().to_frame() or storing aggregate results directly to Dataframe, is not a healthy option. More importantly when you want to store aggregate value and aggregate sum separate. Using DF.sum().to_frame will store values and sum together.
Try below for cleaner version.
a = DF.sum()
sum = list(a)
values = list(a.index)
Series_Dict = {"Agg_Value":values, "Agg_Sum":sum}
Agg_DF = pd.DataFrame(Series_Dict)