I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows
Related
I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)
I have a dataframe with 2415 columns and I want to drop consecutive duplicate columns. That is, if column 1 and column 2 have the same values, I want to drop column 2.
I wrote the below code but it doesn't seem to work:
for i in (0,len(df.columns)-1):
if (df[i].tolist() == df[i+1].tolist()):
df=df.drop([i+1], axis=1)
else:
df=df
You need to select column name from the index.Try this.
columns = df.columns
drops = []
for i in (0,len(df.columns)-1):
if (df[columns[i]].tolist() == df[columns[i+1]].tolist()):
drops.append(columns[i])
df = df.drop(drops,axis=1)
Let us try shift
df.loc[:,~df.eq(df.astype(object).shift(axis=1)).all()]
I'm with a challenge in python/pandas script.
My data is a gene expression table, which is organized as follow:
Basically, Index 0 contain the both conditions studied, while Index 1 has the information about the gene identified between the samples.
Then, I would like to produce a table with index 0 and 1 close together, as follow:
I've tried a lot of things, such as generate a list of index 0 to join in index 1...
Save me, guys, please!
Thank you
Assuming your first row of column names are in row 0, and your second column names are in row 1 try this:
df.columns = [f'{c1}.{c2}'.strip('.') for c1,c2 in zip(df.loc[0], df.loc[1])]
df.loc[2:]
Should look like this
According to OP's comment, I change the add_suffix function.
construct the dataframe
s1 = "Gene name,Description,Foldchange,Anova,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6".split(",")
s2 = "HK1,Hexokinase,Infinity,0.05,1213,1353,14356,0,0,0".split(",")
df = pd.DataFrame(s2).T
df.columns = s1
define a function, (change the funcition according to different situations)
def add_suffix(x):
try:
flag = int(x[-1])
except:
return x
if flag <= 4:
return x + '.Conditon1'
else:
return x + '.Condition2'
and then assign the columns
cols = df.columns.to_series().apply(add_suffix)
df.columns = cols
I have 2 csv files with exact same rows as below:
asas,asafdfdd,fgffgdvnufg,rterrtrrtr,wewewtyuhe,yuuiiyuyuy,uiuiui9u
absas,a2assafdfdd,fgffgedkfg,rtdfrtrrtr,wewewuikjhe,yuuiuiyouyuy,ui7u8iuiu
asbas,asasdfdfdd,fgffgfpoftg,rtrjktrrtr,wewewuyihe,yuyuyyupuy,uiu7iuiu
asabs,asafddffdd,fgffg2floig,rtrtrcxcrtr,weweyjunwe,yuyuyumy,uiui6uiu
asasbb,asafddfdd,fgffgdfkfg,rtrtrjkhrtr,wewewdfxe,yuyuyuny,uiui5uiu
absbas,asafdrtfdd,fgffgvbfg,rtrt3rrcxvtr,wewedfcwe,yuycuyuy,uiu4iuiu
I read these 2 csv files in 2 dataframes named df1 and df2 respectively. When I do result = (df1==df2), I get another dataframe in results having True/False values for match (In this case True for all).
Now when with below code the first row is displayed even if there isnt 'False' value in that tuple.
for row in result.itertuples():
if(False in row):
print (row)
Why is this? Do i need to do something different here?
Whole code is here for reference:
import pandas as pd
df1 = pd.read_csv('test3.csv', header=None)
df2 = pd.read_csv('test4.csv', header=None)
result = (df1==df2)
print result
for row in result.itertuples():
if(False in row):
print (row)
It's because the first row has a zero in it. So False in [0] is True. This happens because the first index Value is zero.
if you shift your index values
result.index += 1
Then run your loop... it won't print.
Now this explains why this happens. But I wouldn't do whatever your doing this way.
I'd do
for i, row in result.iterrows():
if not row.all():
print(row)
I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().