Most efficient way to convert values of column in Pandas DataFrame - python

I have a a pd.DataFrame that looks like:
I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:
The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.
Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?
My extremely naive way of doing such a thing:
DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])
DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])
threshold = 0.85
#Empty dataframe to append rows
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
#Relabel columns
DF_naive.columns = DF_test.columns
DF_naive.head()
#the sample DF_want

You can use np.where to set your desired value based on a boolean condition:
In [18]:
DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)
DF_test
Out[18]:
c1 c2 value
0 a p 0
1 b q 0
2 c r 1
3 d s 1
4 e t 0
Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:
In [58]:
DF_test.iloc[0]['value']
Out[58]:
'0.12'
So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)
You can compare the timings:
In [16]:
%timeit np.where(DF_test['value'] > threshold, 1,0)
1000 loops, best of 3: 297 µs per loop
In [17]:
%%timeit
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
10 loops, best of 3: 39.3 ms per loop
the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:
DF_test['value'] = (DF_test['value'] > threshold).astype(int)
Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

Related

how to vectorize a for loop on pandas dataframe?

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Groupby in python pandas: Fast Way

I want to improve the time of a groupby in python pandas.
I have this code:
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).
Client: client code
Month: month of data extraction
Contrat: contract number
I want to improve the time. Below I am only working with a subset of my real data:
%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop
df.shape
Out[309]: (7464, 61)
How can I improve the execution time?
Here's one way to proceed :
Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.
Convert the two columns data from ['Client', 'Month'] into a single 1D array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client' represent the row indices, whereas 'Month' elements are the column indices. This is like going from 2D to 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.
Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.
That's the whole idea about it! Here's the implementation -
# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values
# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]
Runtime test
1) Define functions :
def original_app(df):
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
def vectorized_app(df):
arr_slice = df[['Client', 'Month']].values
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
df["Nbcontrats"] = counts[unqtags]
2) Verify results :
In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
...: # Run the function on the inputs
...: original_app(df)
...: vectorized_app(df1)
...:
In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True
3) Finally time them :
In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop
In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop
With the DataFrameGroupBy.size method:
df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)
The most work goes into assigning the result back into a column of the source DataFrame.

Pandas: Filter dataframe for values that are too frequent or too rare

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')

changing height (feet and inches) to an integer in python pandas

I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.
I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.
here is my dataframe:
HeightNoShoes HeightShoes
0 7' 5.5" NaN
1 6' 11" 7' 0.25"
2 6' 7.75" 6' 9"
3 6' 5.5" 6' 6.75"
4 5' 11" 6' 0"
Output should be in inches:
HeightNoShoes HeightShoes
0 89.5 NaN
1 83 84.25
2 79.75 81
3 77.5 78.75
4 71 72
My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.
The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).
It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.
For example:
py_str = "7' 5.5\""
This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"
The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\"", or NaN if there is no valid match:
Code:
r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
m = r.match(el)
if m == None:
return float('NaN')
else:
return int(m.group(1))*12 + float(m.group(2))
Example:
>>> get_inches("7' 5.5\"")
89.5
You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.
One possible method without using regex is to write your own function and just apply it to the column/Series of your choosing.
Code:
import pandas as pd
df = pd.read_csv("test.csv")
def parse_ht(ht):
# format: 7' 0.0"
ht_ = ht.split("' ")
ft_ = float(ht_[0])
in_ = float(ht_[1].replace("\"",""))
return (12*ft_) + in_
print df["HeightNoShoes"].apply(lambda x:parse_ht(x))
Output:
0 89.50
1 83.00
2 79.75
3 77.50
4 71.00
Name: HeightNoShoes, dtype: float64
Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.
Comparison versus the accepted solution:
In [9]: import re
In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
...: def get_inches2(el):
...: m = r.match(el)
...: if m == None:
...: return float('NaN')
...: else:
...: return int(m.group(1))*12 + float(m.group(2))
...:
In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop
In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop
parse_ht is a little more than twice as fast.
First create the dataframe of height values
Let's first set up a Pandas dataframe to match the question. Then convert the values shown in feet and inches to a numerical value using apply. NOTE: The questioner asks if the values can be converted to integers, however the first value in the 'HeightNoShoes' column is 7' 5.5" Since this string value is expressed in half inches, it will be converted first to a float value. Then you can use the round function to round it before typcasting the values as integers.
# libraries
import pandas as pd
# height data
no_shoes = ['''7' 5.5"''',
'''6' 11"''',
'''6' 7.75"''',
'''6' 5.5" ''',
'''5' 11"''']
shoes = [np.nan,
'''7' 0.25"''',
'''6' 9"''',
'''6' 6.75"''',
'''6' 0"''']
# put height data into a Pandas dataframe
height_data = pd.DataFrame({'HeightNoShoes':no_shoes, 'HeightShoes':shoes})
height_data.head()
Next use a function to convert feet to float values
Here is a function that converts feet and inches to a float value.
def feet_to_float(cell_string):
try:
split_strings = cell_string.replace('"','').replace("'",'').split()
float_value = float(split_strings[0])+float(split_strings[1])
except:
float_value = np.nan
return float_value
Next, apply the function to each column in the dataframe.
# obtain a copy of the height data
df = height_data.copy()
for col in df.columns:
print(col)
df[col] = df[col].apply(feet_to_float)
df.head()
Here is a function to convert float values to integer values with NaN values in the Pandas column
If you would like to convert the dataframe to integer values with a NaN value in one column you can use the following function and code. Note, that the function rounds the values first before typecasting them as integers. Typecasting the float values as integers before rounding them will just truncate the values.
def float_to_int(cell_value):
try:
return int(round(cell_value,0))
except:
return cell_value
for col in df.columns:
df[col] = df[col].apply(feet_to_float)
Note: Pandas displays columns that contain both NaN values and integers as float values.
Here is the code to convert a single column in the dataframe to a numerical value.
df = height_data.copy()
df['HeightNoShoes'] = df['HeightNoShoes'].apply(feet_to_float)
df.head()
This is how to convert the single column of float values to integers. Note, that it's important to round the values first. Typecasting the values as integers before rounding them will incorrectly truncate the values.
df['HeightNoShoes'] = round(df['HeightNoShoes'],0).astype(int)
df.head()
There are NaN values in the second Pandas column labeled 'HeightShoes'. Both the feet_to_float and float_to_int functions found above should be able to handle these.
df = height_data.copy()
df['HeightShoes'] = df['HeightShoes'].apply(feet_to_float)
df['HeightShoes'] = df['HeightShoes'].apply(float_to_int)
df.head()
This may also serve the purpose
def inch_to_cm(x):
if x is np.NaN:
return x
else:
ft,inc = x.split("'")
inches = inc[1:-1]
return ((12*int(ft)) + int(inches)) * 2.54
df['Height'] = df['Height'].apply(inch_to_cm)
Here is a way using str.extract()
(df.stack()
.str.extract(r"(\d+)' (\d+\.?\d*)")
.rename({0:'feet',1:'inches'},axis=1)
.astype(float)
.assign(feet = lambda x: x['feet'].mul(12))
.sum(axis=1)
.unstack())
Output:
HeightNoShoes HeightShoes
0 89.50 NaN
1 83.00 84.25
2 79.75 81.00
3 77.50 78.75
4 71.00 72.00

Constructing a waterfall algorithm from multiple columns in a Pandas Data Frame

Suppose I have a multi-column data frame and I wish to implement a waterfall style algorithm that takes the first column if it is present, then looks at the second if it is not, and if that is not present takes the value in the third column, and so on, and if missing in the last column takes a default value (say zero). I have a way of doing this involving adding up a series of vector operations (see below) but it doesn't seem to scale to more columns very well. And, of course, I could do it with nested loops through rows (very unpythonic -- right?).
frame = pd.DataFrame(np.arange(15).reshape((5,3)),index=['a','b','c','d','e'],columns=['X','Y', 'Z'])
#Make some missing values
frame['X'].ix[0:2] = None
frame['Y'].ix[1:4] = None
frame['Z'].ix[3:5] = None
#This is my kludgy waterfall for the three column case.
frame['Waterfall'] = frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & frame['Y'].isnull())
I am hoping for a solution to this problem that scales well to waterfalls of arbitrary length. If it could be more Pythonic that would be great. Ideally, it would be a function that takes an ordered list of column labels a dataframe as an argument and returns the desired values.
Thank you for your help.
First of all, don't use None as your missing data value. That forces all your columns to the object dtype, which will be slow. Use nan instead (this makes everything doubles so just be careful with floating point stuff.
I'd use the bfill method for fillna():
In [26]: frame.fillna(method='bfill', axis=1)['X'].fillna(0)
Out[26]:
a 1
b 5
c 6
d 9
e 12
Name: X, dtype: float64
performance:
In [27]: %timeit frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & fra
me['Y'].isnull())
1000 loops, best of 3: 776 µs per loop
In [28]: %timeit frame.fillna(method='bfill', axis=1)['X']
10000 loops, best of 3: 138 µs per loop

Categories

Resources