pandas sort lambda function - python

Given a dataframe a with 3 columns, A , B , C and 3 rows of numerical values. How does one sort all the rows with a comp operator using only the product of A[i]*B[i]. It seems that the pandas sort only takes columns and then a sort method.
I would like to use a comparison function like below.
f = lambda i,j: a['A'][i]*a['B'][i] < a['A'][j]*a['B'][j]

There are at least two ways:
Method 1
Say you start with
In [175]: df = pd.DataFrame({'A': [1, 2], 'B': [1, -1], 'C': [1, 1]})
You can add a column which is your sort key
In [176]: df['sort_val'] = df.A * df.B
Finally sort by it and drop it
In [190]: df.sort_values('sort_val').drop('sort_val', 1)
Out[190]:
A B C
1 2 -1 1
0 1 1 1
Method 2
Use numpy.argsort and then use .ix on the resulting indices:
In [197]: import numpy as np
In [198]: df.ix[np.argsort(df.A * df.B).values]
Out[198]:
A B C
0 1 1 1
1 2 -1 1

Another way, adding it here because this is the first result at Google:
df.loc[(df.A * df.B).sort_values().index]
This works well for me and is pretty straightforward. #Ami Tavory's answer gave strange results for me with a categorical index; not sure it's because of that though.

Just adding on #srs super elegant answer an iloc option with some time comparisons with loc and the naive solution.
(iloc is preferred for when your your index is position-based (vs label-based for loc)
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({
'A': np.random.randint(low=1, high=N, size=N),
'B': np.random.randint(low=1, high=N, size=N)
})
%%timeit -n 100
df['C'] = df['A'] * df['B']
df.sort_values(by='C')
naive: 100 loops, best of 3: 1.85 ms per loop
%%timeit -n 100
df.loc[(df.A * df.B).sort_values().index]
loc: 100 loops, best of 3: 2.69 ms per loop
%%timeit -n 100
df.iloc[(df.A * df.B).sort_values().index]
iloc: 100 loops, best of 3: 2.02 ms per loop
df['C'] = df['A'] * df['B']
df1 = df.sort_values(by='C')
df2 = df.loc[(df.A * df.B).sort_values().index]
df3 = df.iloc[(df.A * df.B).sort_values().index]
print np.array_equal(df1.index, df2.index)
print np.array_equal(df2.index, df3.index)
testing results (comparing the entire index order) between all options:
True
True

Related

Adding column of weights to pandas DF by a condition on the DFs column

Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20

Pandas: Select values from specific columns of a DataFrame by row

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?
df = pd.DataFrame({"A":[1,2,3,4],
"B":[10,20,30,40],
"C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]
Goal:
[10, 2, 3, 400]
One method that works is to use an apply statement.
df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)
Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.
Pandas approach:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400
NumPy way
Here's a vectorized NumPy way using advanced indexing -
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :
# https://stackoverflow.com/a/38489403/ #Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx would be obtained like so -
col_idx = column_index(df, columns_to_select)
Further optimization
Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from #MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

python pandas - input values into new column

I have a small dataframe below of spending of 4 persons.
There is an empty column called 'Grade'.
I would like to rate those who spent more than $100 grade A, and grade B for those less than $100.
What is the most efficient method of filling up column 'Grade', assuming it is a big dataframe?
import pandas as pd
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
df['Grade']=''
You can use numpy.where:
df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
print (df)
Customer Spending Grade
0 Bob 130 A
1 Ken 22 B
2 Steve 313 A
3 Joe 46 B
Timings:
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Fastest way to do that would be to use lambda function with an apply function.
df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)

Pandas DataFrame filtering

Let's say I have a DataFrame with four columns, each of which has a threshold value against which I'd like to compare the DataFrame's values.
I would simply like the minimum value of the DataFrame or the threshold.
For example:
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
>>> df.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.711597 -0.063274
4 1.217197 0.202063 -1.407561 0.940371
thresholds = pd.Series({'A': 1, 'B': 1.1, 'C': 1.2, 'D': 1.3})
This solution works (A4 and C3 were filtered), but there must be an easier way:
df_filtered = df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
>>> df_filtered.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.200000 -0.063274
4 1.000000 0.202063 -1.407561 0.940371
Ideally, I'd like to use .loc to filter in place, but I haven't managed to figure it out. I'm using Pandas 0.14.1 (and can't upgrade).
RESPONSE Below are the timed tests of my initial proposal against the alternatives:
%%timeit
df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
1000 loops, best of 3: 990 µs per loop
%%timeit
np.minimum(df, thresholds) # <--- Simple, fast, and returns DataFrame!
10000 loops, best of 3: 110 µs per loop
%%timeit
df[df < thresholds].fillna(thresholds, inplace=True)
1000 loops, best of 3: 1.36 ms per loop
This is pretty fast (and returns a dataframe):
np.minimum( df, [1.0,1.1,1.2,1.3] )
A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...
How about:
df[df < thresholds].fillna(thresholds, inplace=True)

In pandas, how can I get a DataFrame as the output while I sum the DataFrame

While I sum a DataFrame, it returns a Series:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
In [3]: df
Out[3]:
a b c
0 1 2 3
1 2 3 3
In [4]: s = df.sum()
In [5]: type(s)
Out[5]: pandas.core.series.Series
I know I can construct a new DataFrame by this Series. But, is there any more "pandasic" way?
I'm going to go ahead and say... "No", I don't think there is a direct way to do it, the pandastic way (and pythonic too) is to be explicit:
pd.DataFrame(df.sum(), columns=['sum'])
or, more elegantly, using a dictionary (be aware that this copies the summed array):
pd.DataFrame({'sum': df.sum()})
As #root notes it's faster to use:
pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
(As the zen of python states: "practicality beats purity", so if you care about this time, use this).
However, perhaps the most pandastic way is to just use the Series! :)
.
Some %timeits for your tiny example:
In [11]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
1000 loops, best of 3: 356 us per loop
In [12]: %timeit pd.DataFrame({'sum': df.sum()})
1000 loops, best of 3: 462 us per loop
In [13]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
1000 loops, best of 3: 205 us per loop
and for a slightly larger one:
In [21]: df = pd.DataFrame(np.random.randn(100000, 3), columns=list('abc'))
In [22]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
100 loops, best of 3: 7.99 ms per loop
In [23]: %timeit pd.DataFrame({'sum': df.sum()})
100 loops, best of 3: 8.3 ms per loop
In [24]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
100 loops, best of 3: 2.47 ms per loop
Often it is necessary not only to convert the sum of the columns into a dataframe, but also to transpose the resulting dataframe. There is also a method for this:
df.sum().to_frame().transpose()
I am not sure about earlier versions, but as of pandas 0.18.1 one can use pandas.Series.to_frame method.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
s = df.sum().to_frame(name='sum')
type(s)
>>> pandas.core.frame.DataFrame
The name argument is optional and defines the column name.
You can use agg for simple operations like sum, have a look at how compact this is:
df.agg(['sum'])
df.sum().to_frame() should do what you want.
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_frame.html.
By DF.sum().to_frame() or storing aggregate results directly to Dataframe, is not a healthy option. More importantly when you want to store aggregate value and aggregate sum separate. Using DF.sum().to_frame will store values and sum together.
Try below for cleaner version.
a = DF.sum()
sum = list(a)
values = list(a.index)
Series_Dict = {"Agg_Value":values, "Agg_Sum":sum}
Agg_DF = pd.DataFrame(Series_Dict)

Categories

Resources