How to delete an outlier from a np.where condition - python

I have this dataframe that has an outlier, which I recognized through a boxplot. Then, I caught the value of it through np.where but the thing is, I don't know how to delete this value and its whole row from my dataframe so that I can get rid of the outlier.
This is the code I used for it so far:
sns.boxplot(x=df_cor_inc['rt'].astype(float))
outlier = np.where(df_cor_inc['rt'].astype(float)>50000)
Any help would be great. Thanks.

No need for np.where, a simple boolean mask will do the trick:
df_cor_inc = df_cor_inc[df_cor_inc['rt'] <= 50000]]
Also, why are you casting df_cor_inc['rt'] as float? Is it not already numeric?
If you want to reset the indices of your dataframe, tack on a .reset_index(drop=True).

Try this:
df_cor_inc[np.where(df_cor_inc['rt'].astype(float)>50000,False,True)]

Related

Pandas conditional row values based on an other column

Picture of the dataframe1
Hi! I've been trying to figure out how I could calculate wallet balances of erc-20 tokens, but can't get this to work.The idea is simple, when the "Status" columns row value is "Sending", the value would be negative, and when it is "receiving", it would be positive. Lastly I would use groupby and calculate sums by token symbols. The problem is, I can't get the conditional statement to work. What would be a way to do this? I've tried making loop iterations but they don't seem to work.
Assuming that df is the dataframe you presented, it's enough to select proper slice and multiply values by -1:
df.loc[df['Status'] == 'Sending', 'Value'] *= -1
And then grouping:
df = df.groupby(['Symbol']).sum().reset_index()
The looping in pandas is not a good idea – you are able to perform operations in a more optimal, vectorised manner, so try to avoid that.

Is there a way to overwrite Nan values in a pandas dataframe with values of the previous row?

I am working with a dataframe called ´tabla_combinada´ that looks like this:
Structure of the dataframe used:
What I am attempting to do is to get rid of the Nan values in the 'End Meter' column and replace it with the value of the same column in the previous row. I tried to implement the following code:
counter=0
for x in tabla_combinada['End Meter']:
if math.isnan(x):
x = tabla_combinada['End Meter'][counter-1]
tabla_combinada['End Meter'][counter-1] = tabla_combinada['Start Meter'][counter]
counter = counter + 1
This is not working for me, in the first place I am getting the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
But what bugs me is that I am obtaining no change in the dataframe at all. I do understand the cause of the warning and I suspect that this is not the optimal approach to solve the problem. I guess there is a proper way to do this with loc, but I couldn't find out how to tell the program to replace the Nan with the value of the previous row.
Sorry for the long question and thanks in advance.
All you need to do is this:
tabla_combinada['End Meter'].fillna(method='ffill')
This will propagate non-null values forward

pandas string noise string canceling

Say I have a list of drinks:
drinks=['coke','water','milk','yoghourt']
And I have a pandas series containing some of the items mixed with other noisy strings
s = pd.Series(['cokeabc',Nan,Nan,'water coke',Nan,'milk and yoghourt','only water'])
My purpose is filter out the noise first, fill in the missing value based on other column, and then get_dummies of the s column
My try was:
buff=[]
for i in material:
if df['drink'].str.contains(i):
buff.append(i)
kvkl['drink']=' '.join(buff)
but df['drink'].str.contains(i) returns the whole column of bools
should I try apply()?
You may easily make your code work by just adding .any() at the end of the code:
buff=[]
for i in material:
if df['drink'].str.contains(i).any():
buff.append(i)
kvkl['drink']=' '.join(buff)
This will check if any cell got True and deliver expected result.
OK, I figure it out
def drink_format(mtr):
drinks=['coke','water','milk','yoghourt']
buff=[]
for i in drinks:
if i in mtr:
buff.append(i)
return ' '.join(buff)
s=s.map(drink_format)

pandas set one cell value equals to another

I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.

Pandas: Get duplicated indexes

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I'd like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:
df[df.index.duplicated(keep=False)]
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:
df.index[df.index.duplicated()].unique()
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup

Categories

Resources