Avoiding looping through pandas dataframe for feature generation

Avoiding looping through pandas dataframe for feature generation - python

I'm working on some code that generates features from a dataframe and adds these features as columns to the dataframe.
The trouble is I'm working with a time series so that for any given tuple, I need (let's say) 5 of the previous tuples to generate the corresponding feature for that tuple.
lookback_period = 5
df['feature1'] = np.zeros(len(df)) # preallocate
for index, row in df.iterrows():
if index < lookback_period:
continue
slice = df[index - lookback_period:index]
some_int = SomeFxn(slice)
row['feature1'] = some_int
Is there a way to execute this code without explicitly looping through every row and then slicing?
One way is to create several lagged columns using df['column_name'].shift() such that all the necessary information is contained in each row, but this quickly gets intractable for my computer's memory since the dataset is large (millions of rows).

I don't have enough reputation to comment so will just post it here.
Can't you use apply for your dataframe e.g.
df['feature1'] = df.apply(someRowFunction, axis=1)
where someRowFunction will accept the full row and you can perform whatever row based slice and logic you want to do.
--- updated ---
As we do not have much information about the dataframe and the required/expected output, I just based the answer on the information from the comments
Let's define a function that will take a DataFrame slice (based on current row index and lookback) and the row and will return sum of the first column of the slice and value of the current row.
def someRowFunction (slice, row):
if slice.shape[0] == 0:
return 0
return slice[slice.columns[0]].sum() + row.b
d={'a':[1,2,3,4,5,6,7,8,9,0],'b':[0,9,8,7,6,5,4,3,2,1]}
df=pd.DataFrame(data=d)
lookback = 5
df['c'] = df.apply(lambda current_row: someRowFunction(df[current_row.name -lookback:current_row.name],current_row),axis=1)
we can get row index from apply using its name attribute and as such we can retrieve the required slice. Above will result to the following
print(df)
a b c
0 1 0 0
1 2 9 0
2 3 8 0
3 4 7 0
4 5 6 0
5 6 5 20
6 7 4 24
7 8 3 28
8 9 2 32
9 0 1 36

Related

How to drop dataframe columns using both, a list and not from a list?

I am trying to drop pandas column in the following way. I have a list with columns to drop. This list will be used many times in my notebook. I have 2 columns which are only referenced once
drop_cols=['var1','var2']
df = df.drop(columns={'var0',drop_cols})
So basically, I want to drop all columns from list drop_cols in addition to a hard-coded "var0" column all in one swoop. This gives an error, How do I resolve?

df = df.drop(columns=drop_cols+['var0'])

From what I gather you have a set of columns you wish to drop from several different dataframes while at the same time adding another unique column to also be dropped a data frame. The command you have used is close but misses the point in that you can't create a concatenated list in the way you are trying to do it. This is how I would approach the problem.
Given a Dataframe of the form:
V0 V1 V2 V3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
define a function to merge colnames
def mergeNames(spc_col, multi_cols):
rslt = [spc_col]
rslt.extend(multi+cols)
return rslt
Then with
drop_cols = ['V1', 'V2']
df.drop(columns=mergeNames('V0', drop_cols)
yields:
V3
0 4
1 8
2 12

Define column values to be selected / disselected as default

I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).

Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6

Is it normal for pandas to need 30 secs to perform calculations on a 50k-row dataframe?

I use the pandas read_excel function to work with data. I have two excel files with 70k rows and 3 columns (the first column is date), and it only takes 4-5 seconds to combine, align the data, delete any rows with incomplete data and return a new dataframe (df) with 50k rows and 4 columns, where date is the index.
Then, i use the below code to perform some calculations and add another 2 columns in my df:
for i, row in df.iterrows():
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It takes approx 30 seconds for the above code to be executed, even though the calculations are simple. Is this normal, or is there a way to speed up the execution? (i am on win 10, 16GB Ram and i7-8565U processor)
I am not particularly interested in increasing the columns in my database - getting the two new columns on a list would suffice.
Thanks.

Note that the code in your loop contains neither row nor i.
So drop for ... row and execute just:
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It is enough to execute the above code only once, not in a loop.
Your code unnecessarily performs the above operations multiple times
(actually as many times as how many rows has your DataFrame) and this
is why it takes so long.
Edit following question as of 18:59Z
To perform vectorized operations like "check one column and do something
to another column", use the following schema, base on boolean indexing.
Assume that the source df contains:
column1 column4
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
Then if you want to:
select rows with even value in column1,
and add some value (e.g. 200) to column4,
run:
df.loc[df.column1 % 2 == 0, 'column4'] += 200
In this example:
df.column1 % 2 == 0 - provides boolean indexing over rows,
column4 - selects particular column,
+= 200 - performs the actual operation.
The result is:
column1 column4
0 1 11
1 2 212
2 3 13
3 4 214
4 5 15
5 6 216
6 7 17
7 8 218
But there ase more complex cases, when the condition involves calling of
some custom code or you want to update several columns.
In such cases you should use either iterrow of apply, but these
operations are executed much slower.

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?

You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.

Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

Split dataframe into chunks and add them to a multiindex

I have an indexed dataframe which has 77000 rows.
I want to group every 7000 rows into a higher dimension multiindex, making 11 groups of higher dimension index.
I know that I can write a loop through all the indexes and make a tuple and assign it by dataframe.MultiIndex.from_tuples method.
Is there an elegant way to do this simple thing?

You could use the pd.qcut function to create a new column that you can add to the index.
Here is an example that creates five groups/chunks:
df = pd.DataFrame({'data':range(1,10)})
df['chunk'] = pd.qcut(df.data, 5, labels=range(1,6))
df.set_index('chunk', append=True, inplace=True)
df
data
index chunk
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 4 6
6 4 7
7 5 8
8 5 9
You would do df['chunk'] = pd.qcut(df.index, 11) to get your chunks assigned to your dataframe.

The code below creates an ordered column in the range 0-10, which is tiled up to the length of your DataFrame. Since you want to group based on your old index plus your new folds, you first need to reset the index before performing a groupby.
groups = 11
folds = range(groups) * (len(df) // groups + 1)
df['folds'] = folds[:len(df)]
gb = df.reset_index().groupby(['old_index', 'folds'])
Where old_index is obviously the name of your index.
If you prefer to have sequential groups (e.g. the first 7k rows, the next 7k rows, etc.), then you can do the following :
df['fold'] = [i // (len(df) // groups) for i in range(len(df))]
Note: The // operator is for floor division to truncate any remainder.

Another way is to use the integer division // assuming that your dataframe has the default integer index:
import pandas as pd
import numpy as np
# data
# ===============================================
df = pd.DataFrame(np.random.randn(10), columns=['col'])
df
# processing
# ===============================================
df['chunk'] = df.index // 5
df.set_index('chunk', append=True)
col
chunk
0 0 2.0955
1 0 -1.2891
2 0 -0.3313
3 0 0.1508
4 0 -1.0215
5 1 0.6051
6 1 -0.3227
7 1 -0.6394
8 1 -0.7355
9 1 0.5949

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoiding looping through pandas dataframe for feature generation - python

Related

How to drop dataframe columns using both, a list and not from a list?

Define column values to be selected / disselected as default

Is it normal for pandas to need 30 secs to perform calculations on a 50k-row dataframe?

How to keep only the top n% rows of each group of a pandas dataframe?

Split dataframe into chunks and add them to a multiindex

Categories

Resources