I'm with a challenge in python/pandas script.
My data is a gene expression table, which is organized as follow:
Basically, Index 0 contain the both conditions studied, while Index 1 has the information about the gene identified between the samples.
Then, I would like to produce a table with index 0 and 1 close together, as follow:
I've tried a lot of things, such as generate a list of index 0 to join in index 1...
Save me, guys, please!
Thank you
Assuming your first row of column names are in row 0, and your second column names are in row 1 try this:
df.columns = [f'{c1}.{c2}'.strip('.') for c1,c2 in zip(df.loc[0], df.loc[1])]
df.loc[2:]
Should look like this
According to OP's comment, I change the add_suffix function.
construct the dataframe
s1 = "Gene name,Description,Foldchange,Anova,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6".split(",")
s2 = "HK1,Hexokinase,Infinity,0.05,1213,1353,14356,0,0,0".split(",")
df = pd.DataFrame(s2).T
df.columns = s1
define a function, (change the funcition according to different situations)
def add_suffix(x):
try:
flag = int(x[-1])
except:
return x
if flag <= 4:
return x + '.Conditon1'
else:
return x + '.Condition2'
and then assign the columns
cols = df.columns.to_series().apply(add_suffix)
df.columns = cols
Related
I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]
I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows
I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']
I want to iterate through the rows of a DataFrame and assign values to a new DataFrame. I've accomplished that task indirectly like this:
#first I read the data from df1 and assign it to df2 if something happens
counter = 0 #line1
for index,row in df1.iterrows(): #line2
value = row['df1_col'] #line3
value2 = row['df1_col2'] #line4
#try unzipping a file (pseudo code)
df2.loc[counter,'df2_col'] = value #line5
counter += 1 #line6
#except
print("Error, could not unzip {}") #line7
#then I set the desired index for df2
df2 = df2.set_index(['df2_col']) #line7
Is there a way to assign the values to the index of df2 directly in line5? Sorry my original question was unclear. I'm creating an index based on the something happening.
There are a bunch of ways to do this. According to your code, all you've done is created an empty df2 dataframe with an index of values from df1.df1_col. You could do this directly like this:
df2 = pd.DataFrame([], df1.df1_col)
# ^ ^
# | |
# specifies no data, yet |
# defines the index
If you are concerned about having to filter df1 then you can do:
# cond is some boolean mask representing a condition to filter on.
# I'll make one up for you.
cond = df1.df1_col > 10
df2 = pd.DataFrame([], df1.loc[cond, 'df1_col'])
No need to iterate, you can do:
df2.index = df1['df1_col']
If you really want to iterate, save it to a list and set the index.
Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']