Remove values above/below standard deviation - python

I have a database that is made out of 18 columns and 15 million rows, in each column there are outliers and I wanted to remove values above and below 2 standard deviations. My code doesn't seem to edit anything in the database though.
Thank you.
import pandas as pd
import random as r
import numpy as np
df = pd.read_csv('D:\\Project\\database\\3-Last\\LastCombineHalf.csv')
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
df.to_csv('D:\\Project\\database\\3-Last\\Removal.csv', index=False)

Perhaps because you didn't assign the results back to df?
From:
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
To:
df = df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]

Related

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Randomly select rows from DataFrame Pandas

Okay this is somewhat tricky. I have a DataFrame of people and I want to randomly select 27% of them. I want to create a new Boolean column in that DataFrame that shows if that person was randomly selected.
Anyone have any idea how to do this?
The in-built sample function provides a frac argument to give the fraction contained in the sample.
If your DataFrame of people is people_df:
percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)
people_df['is_selected'] = people_df.index.isin(sample_df.index)
n = len(df)
idx = np.arange(n)
idx = random.shuffle(idx)
*selected_idx = idx[:int(0.27*n)]
selected_df = df[df.index.isin(selected_idx)]
Defining a dataframe with 100 random numbers in column 0:
import random
import pandas as pd
import numpy as np
a = pd.DataFrame(range(100))
random.shuffle(a[0])
Using random.sample to choose 27 random numbers from the list, WITHOUT repetition: (replace 27 with 0.27*int(len(a[0]) if you want to define this as percentage)
choices = random.sample(list(a[0]),27)
Using np.where to assign boolean values to new column in dataframe:
a['Bool'] = np.where(a[0].isin(choices),True,False)

Select python data-frame columns with similar names

I have a data frame named df1 like this:
as_id TCGA_AF_2687 TCGA_AF_2689_Norm TCGA_AF_2690 TCGA_AF_2691_Norm
31 1 5 9 2
I wanna select all the columns which end with "Norm", I have tried the code down below
import os;
print os.getcwd()
os.chdir('E:/task')
import pandas as pd
df1 = pd.read_table('haha.txt')
Norms = []
for s in df1.columns:
if s.endswith('Norm'):
Norms.append(s)
print Norms
but I only get a list of names. what can I do to select all the columns including their values rather than just the columns names? I know it may be a silly question, but I am a new beginner, really need someone to help, thank you so much for your kindness and your time.
df1[Norms] will get the actual columns from df1.
As a matter of fact the whole code can be simplified to
import os
import pandas as pd
os.chdir('E:/task')
df1 = pd.read_table('haha.txt')
norm_df = df1[[column for column in df1.columns if column.endswith('Norm')]]
One can also use the filter higher-order function:
newdf = df[list(filter(lambda x: x.endswith("Norm"),df.columns))]
print(newdf)
Output:
TCGA_AF_2689_Norm TCGA_AF_2691_Norm
0 5 2

pandas sample based on criteria

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].
You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

In Python, given that there is a matrix, how do I print a slice of it?

Let me first outline the overall context of the problem at hand, through the following code snippet.
import pandas as pd
df = pd.read_csv("abc.csv")
df.as_matrix
The desired matrix [100 rows x 785 columns] is output.
I am having difficulty in outputting(using print()) a row of the above matrix.
I tried the following, but in vain:
print(df[0])
print(df[:, 0])
The return value of as_matrix() is array of arrays. So following code should work:
import pandas as pd
df = pd.read_csv("abc.csv")
matrix = df.as_matrix()
print(matrix[0]) # out put first row.
print(matrix[3:5] # output from 3 row up to 4 row.
I think you are missing parenthesis:
df.as_matrix()[0]
Or you can use:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
.head(n=5) Returns first n rows
df.head(1).as_matrix()
To get a specific row where the row index is equal A:
df.iloc[A, :785]
In general, you can use df.iloc to slice from a pandas dataframe using the idnex. For example, to slice rows from up to 100 and columns up to 785 you can do the following:
import pandas as pd
df = pd.read_csv("abc.csv")
df = df.iloc[:100, :785]
df.as_matrix()
If you want to slice the first row after converting to matrix, you are working with a list of lists, so you can do that as follows:
print(df.as_matrix()[1,:])
Here is a working example:
from StringIO import StringIO
import pandas as pd
st = """
col1|col2|col3
1|2|3
4|5|6
7|8|9
"""
pd.read_csv(StringIO(st), sep="|")
df = pd.read_csv(StringIO(st), sep="|")
print("print first row from a matrix")
print(df.as_matrix()[0,:])
print("print one column")
print(df.iloc[:2,1])
print("print a slice")
print(df.iloc[:2,:])
print("print one row")
print(df.iloc[1,:])

Categories

Resources