Iteratively subset and calculate means of pandas DataFrame

Iteratively subset and calculate means of pandas DataFrame - python

I have a rather large DataFrame (~30k rows, ~30k columns), for which I am trying to iteratively create two subsets based on every columns values, and store the ratio arrays for each column:
for col in df.columns:
high_subset = df.query(col>cutoff_vals['high'][col]).mean(axis=0)
low_subset = df.query(col<cutoff_vals['low'][col]).mean(axis=0)
ratios = high_subset / low_subset
///
store_ratios_for_col
I have the low_cutoff and high_cuttoff values precomputed and stored in a dictionary cutoff_vals. I would like to be able to store the ratio array for every column, which should result in a NxN array of ratios (N == number of columns).
Is there a more efficient method to iterate through the columns, subset them, and perform math/comparisons on the results Series?
I understand that using something like Dask or Ray-project may help, but thought there may be a clever vectorization or built in pandas trick first.

Use gt to compare all the columns, then .where to mask:
cutoffs = pd.DataFrame(cutoff_vals)
highs = df.where(df.gt(cutoffs['high'])).mean()
lows = df.where(df.lt(cutoffs['low'])).mean()
# ratios for all columns
# get any with ratios[col_name]
ratios = highs / lows

Related

Python pandas datafram drop colums where correlation abovethreshold

I have a pandas dataframe (adjusted_data) containing many independent variables and a target variable called RainTomorrow. I found out how I can get the correlation between the independent variables and the target variable by using:
adjusted_data.corr()['RainTomorrow'][:].abs()
I would like to create a new dataframe (adjusted_data_narrowed) that only consists of columns where the correlation value is above a certain threshold. What is the best way to do that?

This is all you need:
df2 = adjusted_data.corr()['RainTomorrow'][:].abs()
df2[df2>0.05]

The following should work. It might not be the best solution or the most convenient, but I think it will do for now. You can replace the threshold with the correlation that you want. For example if you only want columns where the correlation is higher than 0.5 or lower than -0.5, use 0.5 instead of threshold.
from itertools import combinations
corr = adjust_data.corr()
passed = set()
for (r,c) in combinations(corr.columns, 2):
if (abs(corr.loc[r,c]) >= threshold):
passed.add(r)
passed.add(c)
passed = sorted(passed)
corr = corr.loc[passed, passed]
corr is now your correlation matrix where you can see which column does not meet your requirement. Now you can filter your dataframe via:
df_adjusted = df_adjusted[corr.columns]

How to select n equally spaced rows from Dask dataframe?

I have a number of parquet files, where all of the chunks together are too big to fit into memory. I would like to load them into a dask dataframe, compute some results (cumsum) and then display the cumsum as a plot. For this reason I wanted to select equally spaced subset of data (some k rows) from the cumsum row, and then plot this subset. How would I do that?

You could try:
slices = 10 # or whatever
slice_point = int(df.shape[0]/slices)
for i in range(slices):
current_sliced_df = df.loc[i*slice_point:(i+1)*slice_point]
and do whatever you want with the current slice

I think that using df[serie].sample(...)(doc) would allow you to avoid to code the way of selecting a representative subset of rows.

How to get around slow groupby for a sparse matrix?

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.

I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).
The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.
## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)
## Add an auxiliary variable
df['count'] = 1
## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
shape=(date_c.categories.size, action_c.categories.size))
## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)
## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(),
index = action_c.categories, columns = action_c.categories)

There are a couple of fairly straightforward simplifications you can consider.
One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:
df = df.groupby('date').max()
Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So try that as well:
df = df.groupby('date', sort=False).max()
Third is you can also use a simple pivot_table() to produce the same result.
df = df.pivot_table(index='date', aggfunc='max')
Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)
actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)
Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.
If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.
So you can enter:
%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)
And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]

I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.

I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

How to maintain or recover Dataframe indexing after running Pairwise Distance function?

I'm using sklearn's pairwise distance function, which saved my life when computing a huge matrix, but the problem I'm having is that I lose my indices.
Specifically, I initially have a huge dataframe of 17000 x 300, which I break down into 4 different dataframes based on some class condition.
The 4 separate dataframes keep the original indices, but after I run the pairwise distance function on one of those dataframes, it gives me back a 2d array with correct values but the indices have been reset from 0 up.
How do I keep or recover the original indices?
distance1 = pair.pairwise_distances(df1, metric='euclidean')

You can create a DataFrame with matching indices using the DataFrame constructor taking the index parameter:
pd.DataFrame(distance1, index=df1.index)
Furthermore, if you would like to concatenate it horizontally to your existing DataFrame, you can use
pd.concat((df1, pd.DataFrame(distance1, index=df1.index)), axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iteratively subset and calculate means of pandas DataFrame - python

Use gt to compare all the columns, then .where to mask: cutoffs = pd.DataFrame(cutoff_vals) highs = df.where(df.gt(cutoffs['high'])).mean() lows = df.where(df.lt(cutoffs['low'])).mean() # ratios for all columns # get any with ratios[col_name] ratios = highs / lows

Related

Python pandas datafram drop colums where correlation abovethreshold

How to select n equally spaced rows from Dask dataframe?

How to get around slow groupby for a sparse matrix?

Finding euclidean distance from multiple mean vectors

How to maintain or recover Dataframe indexing after running Pairwise Distance function?

Categories

Resources