Split dataframe into chunks and add them to a multiindex

Split dataframe into chunks and add them to a multiindex - python

I have an indexed dataframe which has 77000 rows.
I want to group every 7000 rows into a higher dimension multiindex, making 11 groups of higher dimension index.
I know that I can write a loop through all the indexes and make a tuple and assign it by dataframe.MultiIndex.from_tuples method.
Is there an elegant way to do this simple thing?

You could use the pd.qcut function to create a new column that you can add to the index.
Here is an example that creates five groups/chunks:
df = pd.DataFrame({'data':range(1,10)})
df['chunk'] = pd.qcut(df.data, 5, labels=range(1,6))
df.set_index('chunk', append=True, inplace=True)
df
data
index chunk
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 4 6
6 4 7
7 5 8
8 5 9
You would do df['chunk'] = pd.qcut(df.index, 11) to get your chunks assigned to your dataframe.

The code below creates an ordered column in the range 0-10, which is tiled up to the length of your DataFrame. Since you want to group based on your old index plus your new folds, you first need to reset the index before performing a groupby.
groups = 11
folds = range(groups) * (len(df) // groups + 1)
df['folds'] = folds[:len(df)]
gb = df.reset_index().groupby(['old_index', 'folds'])
Where old_index is obviously the name of your index.
If you prefer to have sequential groups (e.g. the first 7k rows, the next 7k rows, etc.), then you can do the following :
df['fold'] = [i // (len(df) // groups) for i in range(len(df))]
Note: The // operator is for floor division to truncate any remainder.

Another way is to use the integer division // assuming that your dataframe has the default integer index:
import pandas as pd
import numpy as np
# data
# ===============================================
df = pd.DataFrame(np.random.randn(10), columns=['col'])
df
# processing
# ===============================================
df['chunk'] = df.index // 5
df.set_index('chunk', append=True)
col
chunk
0 0 2.0955
1 0 -1.2891
2 0 -0.3313
3 0 0.1508
4 0 -1.0215
5 1 0.6051
6 1 -0.3227
7 1 -0.6394
8 1 -0.7355
9 1 0.5949

Related

Selecting first n columns and last n columns with pandas

I am trying to select the first 2 columns and the last 2 column from a data frame by index with pandas and save it on the same dataframe.
is there a way to do that in one step?

You can use the iloc function to get the columns, and then pass in the indexes.
df.iloc[:,[0,1,-1,-2]]

You are looking for iloc:
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df.iloc[:,:2] # Grabs all rows and first 2 columns
df.iloc[:,-2:] # Grabs all rows and last 2 columns
pd.concat([df.iloc[:,:2],df.iloc[:,-2:]],axis=1) # Puts them together row wise

df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df[['a','b','d','e']]
result
a b d e
0 1 2 4 5
1 2 3 5 6
2 3 4 6 7

Define column values to be selected / disselected as default

I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).

Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6

What do the following commands do in Pandas?

I was just looking at some code for Random Forests, and came across these two lines.
Let's assume I have a pandas dataframe 'df' that consists of 12 columns.
What will the following code return
X = df.iloc[:,0:11].values
Y = df.iloc[:, 12].values

To generate a dataframe to consider:
>>> df = pd.DataFrame(np.random.randint(10, size=(5, 2)),
columns=['Col 1', 'Col 2'])
If we print the dataframe, you get:
>>> print(df)
Col 1 Col 2
0 8 4
1 6 4
2 7 5
3 9 6
4 1 5
To determine what the : does, lets consider
>>> print(df.iloc[:,0])
0 8
1 6
2 7
3 9
4 1
which appears to produce every single row in the 0-th column.
Lets try another example:
>>> print(df.iloc[0:3,0])
0 8
1 6
2 7
It looks like that gives the rows at position 0 through position 2 in the 0-th column.
So, from playing with those examples, you can infer that : returns the full dimension. In your example, it returns all rows since the : comes first. The 0:11 returns columns 0 through column 10. The 12 returns the 12th column.

X = df.iloc[:,0:11].values
The above line will return all rows and columns starting from 1st till 11th column (11th column inclusive) in the form of an array.
Y = df.iloc[:, 12].values
The above line returns 13th column values (not 12th column) in the form of an array of the data frame
Example :
Sample dataframe:
df = pd.DataFrame(np.random.randint(0,120,size=(5, 14)), columns=[k+l for k,l in zip(list('ABCDEFGHIJKLMN'), [str(i) for i in range(1,15)])]) #Just tried to name the columns with letters combined with numbers for convenient tracking.
df
X = df.iloc[:,0:11]#.values
X
Y = df.iloc[:, 12].values
Y

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?

You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.

Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

Avoiding looping through pandas dataframe for feature generation

I'm working on some code that generates features from a dataframe and adds these features as columns to the dataframe.
The trouble is I'm working with a time series so that for any given tuple, I need (let's say) 5 of the previous tuples to generate the corresponding feature for that tuple.
lookback_period = 5
df['feature1'] = np.zeros(len(df)) # preallocate
for index, row in df.iterrows():
if index < lookback_period:
continue
slice = df[index - lookback_period:index]
some_int = SomeFxn(slice)
row['feature1'] = some_int
Is there a way to execute this code without explicitly looping through every row and then slicing?
One way is to create several lagged columns using df['column_name'].shift() such that all the necessary information is contained in each row, but this quickly gets intractable for my computer's memory since the dataset is large (millions of rows).

I don't have enough reputation to comment so will just post it here.
Can't you use apply for your dataframe e.g.
df['feature1'] = df.apply(someRowFunction, axis=1)
where someRowFunction will accept the full row and you can perform whatever row based slice and logic you want to do.
--- updated ---
As we do not have much information about the dataframe and the required/expected output, I just based the answer on the information from the comments
Let's define a function that will take a DataFrame slice (based on current row index and lookback) and the row and will return sum of the first column of the slice and value of the current row.
def someRowFunction (slice, row):
if slice.shape[0] == 0:
return 0
return slice[slice.columns[0]].sum() + row.b
d={'a':[1,2,3,4,5,6,7,8,9,0],'b':[0,9,8,7,6,5,4,3,2,1]}
df=pd.DataFrame(data=d)
lookback = 5
df['c'] = df.apply(lambda current_row: someRowFunction(df[current_row.name -lookback:current_row.name],current_row),axis=1)
we can get row index from apply using its name attribute and as such we can retrieve the required slice. Above will result to the following
print(df)
a b c
0 1 0 0
1 2 9 0
2 3 8 0
3 4 7 0
4 5 6 0
5 6 5 20
6 7 4 24
7 8 3 28
8 9 2 32
9 0 1 36

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split dataframe into chunks and add them to a multiindex - python

Related

Selecting first n columns and last n columns with pandas

Define column values to be selected / disselected as default

What do the following commands do in Pandas?

How to keep only the top n% rows of each group of a pandas dataframe?

Avoiding looping through pandas dataframe for feature generation

Categories

Resources