How to remove outliers in a text dataframe?

How to remove outliers in a text dataframe? - python

I'm writing a program that reads a text file and sorts the data into name, job, company and location fields in the form of a pandas dataframe. The location field is the same for all of the rows except for one or two outliers. I want to remove these rows from the df and put them in a separate list.
Example:
Name Job Company Location
1. n1 j1 c1 l
2. n2 j2 c2 l
3. n3 j3 c3 x
4. n4 j4 c4 l
Is there a way to remove only the row with location 'x'(row 3)?

I would extract the two groups into separate DFS
same_df = df.query('location == "<onethatisthesame>"')
Then I would repeat this but using != To get the others
other_df = df.query('location =! "<onethatisthesame>"')

You can use :
import pandas as pd
# df = df[df['location'] == yourRepeatedValue]
df = pd.DataFrame(columns = ['location'] )
df.at[1, 'location'] = 'mars'
df.at[2, 'location'] = 'pluto'
df.at[3, 'location'] = 'mars'
print(df)
df = df[df['location'] == 'mars']
print(df)
This will create a new DataFrame that only contains yourRepeatedValue.
In the example, the new df won't contain rows that are different from 'mars'
The output would be:
location
1 mars
2 pluto
3 mars
location
1 mars
3 mars

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.

Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360

You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.

A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()

You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

find gene name from liste to dataframe

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:
For exemple
liste["gene1","gene2","gene3","gene4","gene5"]
and a dataframe:
name1 name2
gene1_0035 gene1_0042
gene56_0042 gene56_0035
gene4_0042 gene4_0035
gene2_0035 gene2_0042
gene57_0042 gene57_0035
then I did:
df=pd.read_csv("dataframe_not_max.txt",sep='\t')
df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
#print(df)
print(list(df.columns.values))
name1=df.ix[:,1]
name2=df.ix[:,2]
liste=[]
for record in SeqIO.parse(data, "fasta"):
liste.append(record.id)
print(liste)
print(len(liste))
count=0
for a, b in zip(name1, name2):
if a in liste:
count+=1
if b in liste:
count+=1
print(count)
And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.
Is it possible to say something like :
if a without_number in liste:
In the above exemple it would be :
count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.
Here is a more complicated exemple to see if your script indeed works for my data:
Let's say I have a dataframe such:
cluster_name qseqid sseqid pident_x
15 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0035
16 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0042
18 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0035
19 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0042
29 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0042
30 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0035
34 cluster_015707 EOG090X00LI_0042_0035 g1726.t1_0035_0042
37 cluster_015707 EOG090X00LI_0042_0042 g1726.t1_0035_0042
and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO
in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454
I do not know if it is clear?
I used for the exemple :
df=pd.read_csv("test_busco_augus.csv",sep=',')
#df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
print(df)
print(list(df.columns.values))
name1=df.ix[:,3]
name2=df.ix[:,4]
liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
print(liste)
#get boolean mask for each column
m1 = name1.str.contains('|'.join(liste))
m2 = name2.str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)

Using isin
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).eq(2).sum()
Out[923]: 3
Adding value_counts
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).value_counts()
Out[925]:
2 3
0 2
dtype: int64

Adjusted for updated OP
find where sum is equal to 1
df.stack().str.split('_').str[0].isin(liste).sum(level=0).eq(1).sum()
2
Old Answer
stack and str accessor
You can use split on '_' to scrape the first portion then use isin to determine membership. I also use stack and all with the parameter level=0 to see if membership is True for all columns
df.stack().str.split('_').str[0].isin(liste).all(level=0).sum()
3
applymap
df.applymap(lambda x: x.split('_')[0] in liste).all(1).sum()
3
sum/all with generators
sum(all(x.split('_')[0] in liste for x in r) for r in df.values)
3
Two many map
sum(map(lambda r: all(map(lambda x: x.split('_')[0] in liste, r)), df.values))
3

I think need:
#add _ to end of values
liste = [record.id + '_' for record in SeqIO.parse(data, "fasta")]
#liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]
#get boolean mask for each column
m1 = df['name1'].str.contains('|'.join(liste))
m2 = df['name2'].str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
3
EDIT:
liste=["EOG090X00LI","EOG090X00GO","EOG090X00BA"]
#extract each values before _, remove duplicates and compare by liste
a = name1.str.split('_').str[0].drop_duplicates().isin(liste)
b = name2.str.split('_').str[0].drop_duplicates().isin(liste)
#compare a with a for equal and sum Trues
c = a.eq(b).sum()
print (c)
2

You could convert your dataframe to a series (combining all columns) using stack(), then search for your gene names in liste followed by an underscore _ using Series.str.match():
s = df.stack()
sum([s.str.match(i+'_').any() for i in liste])
Which returns 3
Details:
df.stack() returns the following Series:
0 name1 gene1_0035
name2 gene1_0042
1 name1 gene56_0042
name2 gene56_0035
2 name1 gene4_0042
name2 gene4_0035
3 name1 gene2_0035
name2 gene2_0042
4 name1 gene57_0042
name2 gene57_0035
Since all your genes are followed by an underscore in that series, you just need to see if gene_name followed by _ is in that Series. s.str.match(i+'_').any() returns True if that is the case. Then, you get the sum of True values, and that is your count.

pandas dataframe contains list

I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.

df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom

def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))

Reading excel and storing data with xlrd

I have this data in excel sheet
FT_NAME FC_NAME C_NAME
FT_NAME1 FC1 C1
FT_NAME2 FC21 C21
FC22 C22
FT_NAME3 FC31 C31
FC32 C32
FT_NAME4 FC4 C4
where column names are
FT_NAME,FC_NAME,C_NAME
and I want to store this values in a data structure for further use, currently I am trying to store them in a list of list but could not do so with following code
i=4
oc=sheet.cell(i,8).value
fcl,ocl=[],[]
while oc:
ft=sheet.cell(i,6).value
fc=sheet.cell(i,7).value
oc=sheet.cell(i,8).value
if ft:
self.foreign_tables.append(ft)
fcl.append(fc)
ocl.append(oc)
self.foreign_col.append(fcl)
self.own_col.append(ocl)
fcl,ocl=[],[]
else:
fcl.append(fc)
ocl.append(oc)
i+=1
i expect output as
ft=[FT_NAME1,FT_NAME2,FT_NAME3,FT_NAME4]
fc=[FC1, [FC21,FC22],[FC31,FC32],FC4]
oc=[C1,[C21,C22],[C31,C32],C4]
could anyone please help for better pythonic solution ?

You can use pandas. It reads the data into a DataFrame which is essentially a big dictionary.
import pandas as pd
data =pd.read_excel('file.xlsx', 'Sheet1')
data = data.fillna(method='pad')
print(data)
it gives the following output:
FT_NAME FC_NAME C_NAME
0 FT_NAME1 FC1 C1
1 FT_NAME2 FC21 C21
2 FT_NAME2 FC22 C22
3 FT_NAME3 FC31 C31
4 FT_NAME3 FC32 C32
5 FT_NAME4 FC4 C4
To get the sublist structure try using this function:
def group(data):
output = []
names = list(set(data['FT_NAME'].values))
names.sort()
output.append(names)
headernames = list(data.columns)
headernames.pop(0)
for ci in list(headernames):
column_group = []
column_data = data[ci].values
for name in names:
column_group.append(list(column_data[data['FT_NAME'].values == name]))
output.append(column_group)
return output
If you call it like this:
ft, fc, oc = group(data)
print(ft)
print(fc)
print(oc)
you get the following output:
['FT_NAME1', 'FT_NAME2', 'FT_NAME3', 'FT_NAME4']
[['FC1'], ['FC21', 'FC22'], ['FC31', 'FC32'], ['FC4']]
[['C1'], ['C21', 'C22'], ['C31', 'C32'], ['C4']]
which is what you want except for the single element now also being in a list.
It is not the cleanest method but it gets the job done.
Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove outliers in a text dataframe? - python

I would extract the two groups into separate DFS same_df = df.query('location == "<onethatisthesame>"') Then I would repeat this but using != To get the others other_df = df.query('location =! "<onethatisthesame>"')

Related

Extract values within the quotes signs into two separate columns with python

groupby and sum two columns and set as one column in pandas

find gene name from liste to dataframe

pandas dataframe contains list

Reading excel and storing data with xlrd

Categories

Resources