pandas dataframe contains list

pandas dataframe contains list - python

I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.

df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom

def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks

In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)

I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.

Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360

You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

More pythonic way to remove rows where one value begins by another row's value in a pandas dataframe

I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.
In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):
Full Path Value
0 c:/dir/ x
1 c:/dir/sub1/ x
2 c:/dir/sub2/ x
3 c:/dir/sub2/a x
4 c:/dir/sub2/b x
5 c:/anotherdir/ x
6 c:/anotherdir_A/ x
7 c:/anotherdir_C/ x
Rows 6 & 7 are kept because the path is not contained in 5 (a in b in my code below).
The code I came up with is the following, res is the initial dataframe:
to_drop = []
for index, row in res.iterrows():
a = row['Full Path']
for idx, row2 in res.iterrows():
b = row2['Full Path']
if a != b and a in b:
to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]
It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?

pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
.drop_duplicates([0])\
.drop(columns = 0)
You can use .str.extract and regex to extract the base directory, concating the extract back to the original df, dropping the duplicates of the base directory, followed by finally dropping the extracted column.
Edit: Alternate if Path is not in order:
df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]

The time complexity of this is in the tank (no matter how you turn it, you have to check every path against every other path), but a single line solution using str.startswith:
df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
'c:/anotherdir/'],
'Value': ['A', 'B', 'C', 'D']})
print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
for b in df['Full Path']]])
output
Full Path Value
0 c:/dir/ A
3 c:/anotherdir/ D

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.

A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()

You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

find gene name from liste to dataframe

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:
For exemple
liste["gene1","gene2","gene3","gene4","gene5"]
and a dataframe:
name1 name2
gene1_0035 gene1_0042
gene56_0042 gene56_0035
gene4_0042 gene4_0035
gene2_0035 gene2_0042
gene57_0042 gene57_0035
then I did:
df=pd.read_csv("dataframe_not_max.txt",sep='\t')
df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
#print(df)
print(list(df.columns.values))
name1=df.ix[:,1]
name2=df.ix[:,2]
liste=[]
for record in SeqIO.parse(data, "fasta"):
liste.append(record.id)
print(liste)
print(len(liste))
count=0
for a, b in zip(name1, name2):
if a in liste:
count+=1
if b in liste:
count+=1
print(count)
And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.
Is it possible to say something like :
if a without_number in liste:
In the above exemple it would be :
count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.
Here is a more complicated exemple to see if your script indeed works for my data:
Let's say I have a dataframe such:
cluster_name qseqid sseqid pident_x
15 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0035
16 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0042
18 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0035
19 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0042
29 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0042
30 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0035
34 cluster_015707 EOG090X00LI_0042_0035 g1726.t1_0035_0042
37 cluster_015707 EOG090X00LI_0042_0042 g1726.t1_0035_0042
and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO
in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454
I do not know if it is clear?
I used for the exemple :
df=pd.read_csv("test_busco_augus.csv",sep=',')
#df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
print(df)
print(list(df.columns.values))
name1=df.ix[:,3]
name2=df.ix[:,4]
liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
print(liste)
#get boolean mask for each column
m1 = name1.str.contains('|'.join(liste))
m2 = name2.str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)

Using isin
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).eq(2).sum()
Out[923]: 3
Adding value_counts
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).value_counts()
Out[925]:
2 3
0 2
dtype: int64

Adjusted for updated OP
find where sum is equal to 1
df.stack().str.split('_').str[0].isin(liste).sum(level=0).eq(1).sum()
2
Old Answer
stack and str accessor
You can use split on '_' to scrape the first portion then use isin to determine membership. I also use stack and all with the parameter level=0 to see if membership is True for all columns
df.stack().str.split('_').str[0].isin(liste).all(level=0).sum()
3
applymap
df.applymap(lambda x: x.split('_')[0] in liste).all(1).sum()
3
sum/all with generators
sum(all(x.split('_')[0] in liste for x in r) for r in df.values)
3
Two many map
sum(map(lambda r: all(map(lambda x: x.split('_')[0] in liste, r)), df.values))
3

I think need:
#add _ to end of values
liste = [record.id + '_' for record in SeqIO.parse(data, "fasta")]
#liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]
#get boolean mask for each column
m1 = df['name1'].str.contains('|'.join(liste))
m2 = df['name2'].str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
3
EDIT:
liste=["EOG090X00LI","EOG090X00GO","EOG090X00BA"]
#extract each values before _, remove duplicates and compare by liste
a = name1.str.split('_').str[0].drop_duplicates().isin(liste)
b = name2.str.split('_').str[0].drop_duplicates().isin(liste)
#compare a with a for equal and sum Trues
c = a.eq(b).sum()
print (c)
2

You could convert your dataframe to a series (combining all columns) using stack(), then search for your gene names in liste followed by an underscore _ using Series.str.match():
s = df.stack()
sum([s.str.match(i+'_').any() for i in liste])
Which returns 3
Details:
df.stack() returns the following Series:
0 name1 gene1_0035
name2 gene1_0042
1 name1 gene56_0042
name2 gene56_0035
2 name1 gene4_0042
name2 gene4_0035
3 name1 gene2_0035
name2 gene2_0042
4 name1 gene57_0042
name2 gene57_0035
Since all your genes are followed by an underscore in that series, you just need to see if gene_name followed by _ is in that Series. s.str.match(i+'_').any() returns True if that is the case. Then, you get the sum of True values, and that is your count.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe contains list - python

df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom

def replace_all(eg): rep = {"[":"", "]":"", "u":"", "}":"", "'":"", '"':"", "frozenset":""} for i,j in rep.items(): eg = eg.replace(i,j) return eg for each in df.columns: df[each] = df[each].apply(lambda x : replace_all(str(x)))

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

Extract values within the quotes signs into two separate columns with python

More pythonic way to remove rows where one value begins by another row's value in a pandas dataframe

groupby and sum two columns and set as one column in pandas

find gene name from liste to dataframe

Categories

Resources