How to delete all rows under a special row in pandas (python)? - python

How to delete all the rows under the row with one column "Exercises" in pandas (Python)?
Data:
2021.08.16 19:37:15 146242975 XAUEUR buy 0.02 1 517.04 1 517.19 1 519.54 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
2021.08.16 19:37:15 146242976 XAUEUR buy 0.02 1 517.04 1 517.19 1 522.04 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
Exercises
2021.08.16 01:02:11 146037881 XAUUSD buy 0.18 / 0.18 market 1 777.72 1 781.47 2021.08.16 01:02:11 filled TP1
...

df = pd.DataFrame({'num':[1,2,3,4,'Excercises',6,7,8]})
#First find the row index by filtering the column value
my_index = df.index[df['num'] == 'Exercises'].tolist()[0] # as you my find multiple match, take the first index found by [0]
#my_index = 4
#Then slice the Dataframe and take values into new df
df_new = df[:my_index] # or if you want to exclude that row , then add +1 to my_index

I used the loc function.
df = pd.DataFrame({'col':['2021.08.16 19:37:15 146242975','2021.08.16 19:37:15 146242976','Exercises','2021.08.16 01:02:11 146037881'],'values':['a','b','c','d']})
df2 = df.set_index('col')
df2.loc[:'Exercises'][:-1].reset_index()

Related

Calculate running total based off original value in pandas

I wish to take an inital value of 1000 and multiply it by the first value in the 'Change' and then take that value and multiply it to the second value in the 'Change' column and so on.
I could do this by using a loop as follows
changes = [0.97,1.02,1.1,0.88,1.01 ]
df = pd.DataFrame()
df['Change'] = changes
df['Total'] = np.nan
df['Total'][0] = 1000*df['Change'][0]
for i in range(1,len(df)):
df['Total'][i] = df['Total'][i-1] * df['Change'][i]
Output:
Change Total
0 0.97 970.000000
1 1.02 989.400000
2 1.10 1088.340000
3 0.88 957.739200
4 1.01 967.316592
But this will be too slow for a large dataset. Is there any way to do this without loops?
Thanks

Sequential Calculation of Pandas Column without For Loop

I have the sample dataframe below
perc 2018_norm
0 0.009069 27.799849
1 0.011384 0.00
2 -0.000592 0.00
3 -0.002667 0.00
The value of the first row of 2018_norm comes from another DataFrame. I then want to calculate the value of the second row through the end of the DataFrame of the 2018_norm column using the percentage change in the perc column and previous row's value in 2018_norm column, which I can currently achieve using a For Loop to give the following result:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116324
2 -0.000592 28.099667
3 -0.002667 28.024713
4 -0.006538 27.841490
For Loops on DataFrames are just slow so I know I am missing something basic but my google searching hasn't yielded what I am looking for.
I've tried variations of y1df['2018_norm'].iloc[1:] = (y1df['perc'] * y1df['2018_norm'].shift(1)) + y1df['2018_norm'].shift(1) that just yield:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116324
2 -0.000592 0.00
3 -0.002667 0.00
4 -0.006538 0.00`
What am I missing?
EDIT: To clarify, a basic For loop and df.iloc were not preferable and a for loop with iterrows sped the computation up substantially such that a for loop using that function is a great solution for my use. Wen-Ben's respone also directly answers the question I didn't mean to ask in my original post.
You can use df.iterrows() to loop much more quickly through a pandas data frame:
for idx, row in y1df.iterrows():
if idx > 0: # Skip first row
y1df.loc[idx, '2018_norm'] = (1 + row['perc']) * y1df.loc[idx-1, '2018_norm']
print(y1df)
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116322
2 -0.000592 28.099678
3 -0.002667 28.024736
This is just cumprod
s=(df.perc.shift(-1).fillna(1)+1).cumprod().shift().fillna(1)*df['2018_norm'].iloc[0]
df['2018_norm']=s
df
Out[390]:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116322
2 -0.000592 28.099678
3 -0.002667 28.024736

Pandas dataframe to nested counter dictionary

I've seen a lot of questions on how to convert pandas dataframes to nested dictionaries, but none of them deal with aggregating the information. I may even be able to do what I need within pandas, but I'm stuck.
Input
I have a dataframe that looks like this:
FeatureID gene Target pos bc_count
0 1_1_1 NRAS_3 TAGCAC 0 0.42
1 1_1_1 NRAS_3 TGCACA 1 1.00
2 1_1_1 NRAS_3 GCACAA 2 0.50
3 1_1_1 NRAS_3 CACAAA 3 2.00
4 1_1_1 NRAS_3 CAGAAA 3 0.42
# create df as below
import pandas as pd
df = pd.DataFrame([{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TAGCAC",
"pos":0, "bc_count":.42},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TGCACA", "pos":1,
"bc_count":1.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"GCACAA", "pos":2,
"bc_count":0.50},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CACAAA", "pos":3,
"bc_count":2.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CAGAAA", "pos":4,
"bc_count":0.42}])
The problem
I need to break apart the Target column for each row to return a tuple of (position, letter, count), where the starting position is given in the "pos" column, and then enumerating the string for each position following, and the count is the value found for that row in the "bc_count" column.
For example, in the first row, the desired list of tuples would be:
[(0, "T", 0.42), (1,"A", 0.42), (2,"G", 0.42), (3,"C", 0.42), (4,"A", 0.42), (5,"C", 0.42)]
What I've tried
I've created code that breaks up the target column into the position found, returning a tuple of position, nucleotide (letter) and count for that letter, and adds them as a column to the dataframe:
def index_target(row):
count_list = [((row.pos + x),y,
row.bc_count) for x,y in
enumerate(row.Target)]
df['pos_count'] = df.apply(self.index_target, axis=1)
Which returns a list of tuples for each row based on that row's target column.
I need to take every row in df, for each target, and sum the counts. Which is why I thought of using a dictionary as a counter:
position[letter] += bc_count
I've tried creating a defaultdict, but it is appending each list of tuples separately instead of summing the counts for each position:
from collections import defaultdict
d = defaultdict(dict) # also tried defaultdict(list) here
for x,y,z in row.pos_count:
d[x][y] += z
Desired Output
For each feature in the dataframe, where the numbers below represent a sum of the individual counts found in the bc_count column for each position and x representing positions where ties were found and no one letter can be returned as the max:
pos A T G C
0 25 80 25 57
1 32 19 100 32
2 27 18 16 27
3 90 90 90 90
4 10 42 37 18
consensus= TGXXT
This may not be the most elegant solution, but I think it might accomplish what you need:
new_df = pd.DataFrame(
df.apply(
# this lambda is basically the same thing you're doing,
# but we create a pd.Series with it
lambda row: pd.Series(
[(row.pos + i, c, row.bc_count) for i, c in enumerate(row.Target)]
),
axis=1)
.stack().tolist(),
columns=["pos", "nucl", "count"]
)
Where new_df looks like this:
pos nucl count
0 0 T 0.42
1 1 A 0.42
2 2 G 0.42
3 3 C 0.42
4 4 A 0.42
5 5 C 0.42
6 1 T 1.00
7 2 G 1.00
8 3 C 1.00
9 4 A 1.00
Then I would pivot this to get the aggregated counts:
nucleotide_count_by_pos = new_df.pivot_table(
index="pos",
columns="nucl",
values="count",
aggfunc="sum",
fill_value=0
)
Where nucleotide_count_by_pos looks like:
nucl A C G T
pos
0 0.00 0.00 0.00 0.42
1 0.42 0.00 0.00 1.00
2 0.00 0.00 1.92 0.00
3 0.00 4.34 0.00 0.00
4 4.34 0.00 0.00 0.00
And then to get the consensus:
def get_consensus(row):
max_value = row.max()
nuc = row.idxmax()
if (row == max_value).sum() == 1:
return nuc
else:
return "X"
consensus = ''.join(nucleotide_count_by_pos.apply(get_consensus, axis=1).tolist())
Which in the case of your example data would be:
'TTGCACAAA'
Unsure how to get your desired output, but I created the list d which contains the tuples you desired for a dataframe. Hopefully, it provides some direction in what you want to create:
d = []
for t,c,p in zip(df.Target,df.bc_count,df.pos):
d.extend([(p,c,i) for i in list(t)])
df_new = pd.DataFrame(d, columns = ['pos','count','val'])
df_new = df_new.groupby(['pos','val']).agg({'count':'sum'}).reset_index()
df_new.pivot(index = 'pos', columns = 'val', values = 'count')

Remove strongly correlated columns from DataFrame [duplicate]

This question already has answers here:
How to calculate correlation between all columns and remove highly correlated ones using pandas?
(28 answers)
Closed 1 year ago.
I have a DataFrame like this
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())
I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
which yields
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors
Col3
0 0.33
1 0.98
2 1.54
3 0.01
4 0.99
So far I am happy with the result, but I would like to keep one column from each correlated pair, so in the above example I would like to include Col1 or Col2. To get s.th. like this
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Also on a side note, is there any further evaluation I can do to determine which of the correlated columns to keep?
thanks
You can use np.tril() instead of np.eye() for the mask:
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Output:
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Use this directly on the dataframe to sort out the top correlation values.
import pandas as pd
import numpy as np
def correl(X_train):
cor = X_train.corr()
corrm = np.corrcoef(X_train.transpose())
corr = corrm - np.diagflat(corrm.diagonal())
print("max corr:",corr.max(), ", min corr: ", corr.min())
c1 = cor.stack().sort_values(ascending=False).drop_duplicates()
high_cor = c1[c1.values!=1]
## change this value to get more correlation results
thresh = 0.9
display(high_cor[high_cor>thresh])
correl(X)
output:
max corr: 0.9821068918331252 , min corr: -0.2993837739125243
object at 0x0000017712D504E0>
count_rech_2g_8 sachet_2g_8 0.982107
count_rech_2g_7 sachet_2g_7 0.979492
count_rech_2g_6 sachet_2g_6 0.975892
arpu_8 total_rech_amt_8 0.946617
arpu_3g_8 arpu_2g_8 0.942428
isd_og_mou_8 isd_og_mou_7 0.938388
arpu_2g_6 arpu_3g_6 0.933158
isd_og_mou_6 isd_og_mou_8 0.931683
arpu_3g_7 arpu_2g_7 0.930460
total_rech_amt_6 arpu_6 0.930103
isd_og_mou_7 isd_og_mou_6 0.926571
arpu_7 total_rech_amt_7 0.926111
dtype: float64

Merging dataframes based on ranges that are defined by columns

I have two dataframes. One has some probability brackets.
df1 = pd.DataFrame({'ProbabilityBrackets' : [0,0.50,0.75,1.0,0.75,0.90,1.0,0],\
'Group' : pd.Categorical(["test","test","test","test","train","train","train","train"]),'Destination' : pd.Categorical(["-","A","B","C","AA","BB","CC","-"])})
Destination Group ProbabilityBrackets
0 - test 0.00
1 A test 0.50
2 B test 0.75
3 C test 1.00
4 AA train 0.75
5 BB train 0.90
6 CC train 1.00
7 - train 0.00
The other dataframe has some random numbers and the group column.
df2 = pd.DataFrame({'randomnumbers' : [0.2,0.15,0.78,0.35],\
'Group' : pd.Categorical(["test","train","test","train"])})
Group randomnumbers
0 test 0.20
1 train 0.15
2 test 0.78
3 train 0.35
Now, I need to merge the two dataframes together by both group and based on the probability brackets. Merging by group is trivial. The challenging requirement is merging by based on probabilitybrackets and random numbers. A random number in df2 should be mapped to the smallest probability bracket that is larger than itself. E.g., test 0.2 in df2 is mapped to test 0.5 in df1. test 0.78 in df2 is mapped to test 1.0 in df1.
I did it as follows, which works well and :
for group in ['test','train']:
brackets=df1[df1['Group']==group].sort_values(by='ProbabilityBrackets')['ProbabilityBrackets'].unique()
bracketlabels = brackets[1:] #remove the first element of the list. (e.g., remove 0 from (0,0.5,1))
df2.loc[df2['Group']==group,'ProbabilityBrackets']=pd.cut(df2['randomnumbers'],brackets, labels=bracketlabels) #assign random numbers to the brackets so that we can easily merge them with df1
df3=df2.merge(df1,on=['Group','ProbabilityBrackets'],how='left')
It generates the following output, which is what I want but it is slower than I want because I have thousands of groups in my dataset. Is there a way to do it faster in a pythonic way?
Group randomnumbers ProbabilityBrackets Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA
You can try this.
# Step 1
df_m = df2.merge(df1, on="Group", how="outer")
# Step 2
df_m["diff"] = df_m["randomnumbers"] - df_m["ProbabilityBrackets"]
# Step 3
df_m_filtered = df_m[df_m["diff"] < 0].set_index(
["Destination", "ProbabilityBrackets"])
# Step 4
df_desired = df_m_filtered.groupby(
["Group", "randomnumbers"])["diff"].nlargest(1).reset_index()
index Group randomnumbers Destination ProbabilityBrackets diff
0 0 test 0.20 A 0.50 -0.30
1 1 test 0.78 C 1.00 -0.22
2 2 train 0.15 AA 0.75 -0.60
3 3 train 0.35 AA 0.75 -0.40
Explanation:
Begin with an outer merge
Calculate differences between randomnumbers and ProbabilityBrackets
Filter the results with condition df_merged["diff"] < 0 as we are interested in finding those whose randomnumbers is smaller than ProbabilityBrackets
Groupby ["Group" and "randomnumbers"] and find the one with the largest diff within each group.
Comparing “Group” for every element in df2 to every element in df1 is a lot of unnecessary string comparisons. You could instead try putting all the elements of df1 into a dictionary with Group as the key and having lists of (ProbabilityBrackets, Destination) tuples as the values. When inserting each element from df1, insert the tuple into the list maintaining the sort by ProbabilityBracket so that you don’t have to sort it again. Then you can retrieve the appropriate (ProbabilityBracket, Destination) for each element in df2 by looking in the dictionary by Group and performing a binary search on the list by ProbabilityBracket.
This is another way of doing it. Taking some cues from #JasonR.
Explanation:
- We create a dictionary of tuples (Destination, ProbablityBrackets). This is done to avoid multiple times looping on df1
- Next, we check dictionary keys in df2 and assign the result based on given criteria.
from collections import defaultdict
# remove these rows
df1 = df1[df1['ProbabilityBrackets'] > 0]
df_dict = defaultdict(list)
# create a dictionary of tuples in list
for index, row in df1.iterrows():
df_dict[row['Group']].append((row['Destination'],row['ProbabilityBrackets']))
## this calculates the output
for index, row in df2.iterrows():
d = df_dict[row['Group']]
randnum = row['randomnumbers']
## this checks the suitable probablity bracket
low = 10000
tuple_ix = 10000
for ix, (i, j) in enumerate(d):
sub = (j - randnum)
if sub > 0 and sub < low:
low = sub
tuple_ix = ix
combination = d[tuple_ix]
df2.loc[index, 'ProbabilityBracket'] = combination[1]
df2.loc[index, 'Destination'] = combination[0]
Group randomnumbers ProbabilityBracket Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA

Categories

Resources