I have a data frame like this,
df
col1 col2 col3
1 2 3
2 5 6
7 8 9
10 11 12
11 12 13
13 14 15
14 15 16
Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1.
So the result data frames will look like,
df1
col1 col2 col3
1 2 3
2 5 6
df2
col1 col2 col3
7 8 9
df3
col1 col2 col3
10 11 12
11 12 13
df4
col1 col2 col3
13 14 15
14 15 16
I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.
You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:
d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))
print(d[0])
col1 col2 col3
0 1 2 3
1 2 5 6
print(d[1])
col1 col2 col3
2 7 8 9
A more detailed break-down:
df.assign(difference=(diff:=df.col1.diff()),
condition=(gt1:=diff.gt(1)),
grouper=gt1.cumsum())
col1 col2 col3 difference condition grouper
0 1 2 3 NaN False 0
1 2 5 6 1.0 False 0
2 7 8 9 5.0 True 1
3 10 11 12 3.0 True 2
4 11 12 13 1.0 False 2
5 13 14 15 2.0 True 3
6 14 15 16 1.0 False 3
You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.
row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])
df_tup = ()
for n in range(0,len(split_inds)-1):
tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
df_tup.append(tempdf)
(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)
Related
I have a dataframe as such:
Col1 Col2 Col3.... Col64 Col1 Volume Col2 Volume....Col64 Volume.... Col1 Value Col2 Value...Col 64 Value
2 3 4 5 5 7 9 3 5
3 4 5 11 8 6 5 6 5
5 3 4 6 10 11 5 3 4
I want to multiply Col1 with Col1 Volume and then divide by Col1 Value and place the value in a new column called 'Col1 result'
similarly multiply Col2 with Col2 Volume and then divide by Col2 Value and place the value in a new column called 'Col2 result'
I wish to do this for every row of those columns.
Output should be as such and these columns should be appended to the existing dataframe.
Col1 Result Col2 Result
3.33 4.2
6 4.8
16.6 8.25
...
How can I perform this operation? It also has to be 1 to 1 multiplication, that is only the first row of Col1 should be multiplied with Col1 Volume and divided by first row of Col1 Value.
Doing it manually would take a lot of time.
Use DataFrame.filter for get all columns with Volume and Value with $ for end of string, remove substrings and then filter df by columns from df1, multiple and divide columns with DataFrame.add_suffix, replace missing columns 0 and append to original DataFrame:
df1 = df.filter(regex='Volume$').rename(columns=lambda x: x.replace(' Volume',''))
df2 = df.filter(regex='Value$').rename(columns=lambda x: x.replace(' Value',''))
df = df.join(df[df1.columns].mul(df1).div(df2).add_suffix(' Result').fillna(0))
print (df)
Col1 Col2 Col3 Col64 Col1 Volume Col2 Volume Col64 Volume \
0 2 3 4 5 5 7 9
1 3 4 5 11 8 6 5
Col1 Value Col2 Value Col64 Value Col1 Result Col2 Result Col64 Result
0 3 5 7 3.333333 4.2 6.428571
1 6 5 7 4.000000 4.8 7.857143
I have two dataframes, first one is:
col1 col2 col3
1 14 2 6
2 12 3 3
3 9 4 2
Second one is:
col4 col5 col6
2 14 2 6
3 12 3 3
I want to concatenate them and get the index values from second one and row values from the first one.
The result will be like this:
col1 col2 col3
2 12 3 3
3 9 4 2
My solution was pd.concat([df2, df1, axis=1)]).drop(df2, axis=1) but I believe there is more efficient way to do this.
You can use index from df2 with loc function on df1:
df1.loc[df2.index]
Output:
col1 col2 col3
2 12 3 3
3 9 4 2
So I have a dataframe and I would like to be able to compare each value with other values in its row and column at the same time. For example, I have something like this
Col1 Col2 Col3 NumCol
Row1 1 4 7 16
Row2 2 5 8 13
Row3 3 6 9 30
NumRow 28 14 10
For each value that isn't in the NumRow or NumCol, I would like to compare the NumCol and NumRow values in the same column/row as it.
I would like it to return the value of the first instance where NumCol is larger than NumRow in each row.
So the result would be this:
Row1 4
Row2 8
Row3 3
I have no clue on how to even begin this, but is there a way to do this elegantly without using for loops to loop through the whole dataframe to find these values?
First we flatten the dataframe (here df is your original dataframe):
df2 = (df.fillna('NumRow')
.set_index('NumCol')
.transpose()
.set_index('NumRow')
.stack()
.reset_index(name='value')
)
df2
output
NumRow NumCol value
0 28 16.0 1
1 28 13.0 2
2 28 30.0 3
3 14 16.0 4
4 14 13.0 5
5 14 30.0 6
6 10 16.0 7
7 10 13.0 8
8 10 30.0 9
now for each row of the new dataframe df2 we have the corresponding number from NumRow, corresponding number from NumCol, and the number from within the 'body' of the original dataframe df
Next we apply the condition, group by NumCol and within each group find the first row where the condition is satisfied. We report corresponding value:
df3 = (df2.assign(cond = df2['NumCol']>df2['NumRow'])
.groupby('NumCol')
.apply(lambda d:d[d['cond']].iloc[0])['value']
)
df3.index = df3.index.map(dict(zip(df['NumCol'],df.index)))
df3.sort_index()
Output
NumCol
Row1 4
Row2 8
Row3 3
Name: value, dtype: int64
I have a data frame like this,
col1 col2 col3
1 2 3
2 3 4
4 2 3
7 2 8
8 3 4
9 3 3
15 1 12
Now I want to group those rows where there difference between two consecutive col1 rows is less than 3. and sum other column values, create another column(col4) with the last value of the group,
So the final data frame will look like,
col1 col2 col3 col4
1 7 10 4
7 8 15 9
using for loop to do this is tedious, looking for some pandas shortcuts to do it most efficiently.
You can do a named aggregation on groupby:
(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)
Output:
col1 col2 col3 col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15
update without named aggregation you can do some thing like this:
groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()
I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.
Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0