I am seeking to drop some rows from a DataFrame based on two conditions needing to be met in the same row. So I have 5 columns, in which; if two columns have equal values (code1 and code2) AND one other column (count) is greater than 1, then when these two conditions are met in the same row - the column is dropped.
I could alternatively keep columns that meet the conditions of:
count == 1 'OR' (as opposed to AND) df_1.code1 != df_1.code2
In terms of the first idea what I am thinking is:
df_1 = '''drop row if''' [df_1.count == 1 & df_1.code1 == df_1.code2]
Here is what I have so far in terms of the second idea;
df_1 = df_1[df_1.count == 1 or df_1.code1 != df_1.code2]
You can use .loc to specify multiple conditions.
df_new = df_1.loc[(df_1.count != 1) & (df_1.code1 != df_1.code2), :]
df.drop(df[(df['code1'] == df['code2']) & (df['count'] > 1)].index, inplace=True)
Breaking it to steps:
df[(df['code1'] == df['code2']) & (df['count'] > 1)] returns a subset of rows from df where the value in code1 equals to the value in code2 and the value in count is greater than 1.
.index returns the indexes of those rows.
The last step is calling df.drop() that expects indexes to be dropped from the dataframe, and using inplace=True so we won't need to re-assign, ie
df = df.drop(...).
Related
what I have
I have multiple dataframes stored as CSVs. The dataframes have 2 columns : Col1, Col2
what I do
I read the CSVs files into separate dataframes, make list of the dataframes, and then search for the first occurrence of a value in "Col2" but only under condition that the value is there at least 10 consecutive rows. When it finds the value in "Col2" then I print corresponding value from the "Col1".
where is problem
When I did it with 5 consecutive rows my script worked, however when I try it with (at least) 10 rows, this error occurrs: IndexError: single positional indexer is out-of-bounds. It happens because some dataframes doesn't meet the criteria and the value in "Col2" is not here 10 consecutive rows.
how I want to solve it (I guess)
For the dataframes not meeting the criteria there would be the "Nan" printed instead of corresponding value from "Col1".
Code 1/2:
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
#read the csv files as separate dataframes
dfs = []
for file in csvs:
df = pd.read_csv(file)
dfs.append(df)
Code 2/2:
for dff1 in dfs:
dff1[dff1['Col2'] == 0.000000].groupby((dff1['Col2'] != 0.000000).cumsum()).filter(lambda x: len(x) > 10)
index = dff1[dff1['Col2'] == 0.000000].groupby((dff1['Col2'] != 0.000000).cumsum()).filter(lambda x: len(x) > 10).iloc[0][0] #this finds the corresponding value in Col1 if the Col2-value meets the criteria
print(index) #this prints corresponding value from Col1
Then my code continues to make one dataframe of the printed values with assigned CSV names in a new column...
Maybe there is some trick with the "iloc" or "filter"?
desired output:
11
9
Nan
Nan
6
...
Now it just stops after second dataframe and print only "11,9" and then raises error because thirs dataframe doesn't meet the criteria.
Could be avoided with a simple if-statement
for dff1 in dfs:
revised_df = dff1[dff1['Col2'] == 0.000000].groupby((dff1['Col2'] != 0.000000).cumsum()).filter(lambda x: len(x) > 10)
if not revised_df.empty:
index = dff1[dff1['Col2'] == 0.000000].groupby((dff1['Col2'] != 0.000000).cumsum()).filter(lambda x: len(x) > 10).iloc[0][0]
print(index)
else:
print(np.nan)
I have a large Pandas dataframe, and want to replace some values in a subset of the columns based on a condition.
Specifically, I want to replace the values that are greater than one with 1 in every column to the right of the 9th column.
Because the dataframe is so large and growing in both the number of rows and columns over time, I cannot manually specify the names of the columns to change values in. Rather, I just need to specify that column 10 and greater should be inspected for values > 1.
After looking at many different Stack Overflow posts and Pandas documentation, I tried:
df.iloc[df[:,10: ] > 1] = 1
However, this gives me the error “unhashable type: ‘slice’”.
I then tried:
df[df.iloc[:, 10:] > 1] = 1
and
df[df.loc[:, df.columns[10:]] > 1] = 1
as per 2 suggestions in the comments, but both of those give me the error “Cannot do inplace boolean setting on mixed-types with a non np.nan value”.
Does anyone know why I’m getting these errors and/or what I should change about my code to avoid them?
Thank you!
1. DataFrame.where
We can use iloc to select all the columns to the right of 9th column, then using where we can replace the values in the slice of dataframe where the condition x.le(1) is False.
df.iloc[:, 10:] = df.iloc[:, 10:].where(lambda x: x.le(1), 1)
2. DataFrame.clip
Alternatively we can use clip where we can define the upper limit as 1 which assigns all the values greater than 1 in the slice of dataframe to 1.
df.iloc[:, 10:] = df.iloc[:, 10:].clip(upper=1)
I have an example dataframe as given below, and am trying to drop the rows where the column cluster_num has only 1 distinct value.
df = pd.DataFrame([[1,2,3,4,5],[1,3,4,2,5],[1,3,7,9,10],[2,6,2,7,9],[2,2,4,7,0],[3,1,9,2,7],[4,9,5,1,2],[5,8,4,2,1],[5,0,7,1,2],[6,9,2,5,7]])
df.rename(columns = {0:"cluster_num",1:"value_1",2:"value_2",3:"value_3",4:"value_4"},inplace=True)
# Dropping rows for which cluster_num has only one distinct value
count_dict = df['cluster_num'].value_counts().to_dict()
df['count'] = df['cluster_num'].apply(lambda x : count_dict[x])
df[df['count']>1]
In the above example, the rows where cluster_num equals 3,4 and 6 would be dropped.
Is there a way of doing this without having to create a separate column? I need all 5 initial columns (cluster_num, value_1, value_2, value_3, value_4) in the output. My output dataframe according to the above code is :
I have tried to filter using groupby() with count() but it was not working out.
groupby/filter
df.groupby('cluster_num').filter(lambda d: len(d) > 1)
duplicated
df[df.duplicated('cluster_num', keep=False)]
groupby/transform
Per #QuangHoang
df[df.groupby('cluster_num')['cluster_num'].transform('size') >= 2]
I have a dataframe with 2415 columns and I want to drop consecutive duplicate columns. That is, if column 1 and column 2 have the same values, I want to drop column 2.
I wrote the below code but it doesn't seem to work:
for i in (0,len(df.columns)-1):
if (df[i].tolist() == df[i+1].tolist()):
df=df.drop([i+1], axis=1)
else:
df=df
You need to select column name from the index.Try this.
columns = df.columns
drops = []
for i in (0,len(df.columns)-1):
if (df[columns[i]].tolist() == df[columns[i+1]].tolist()):
drops.append(columns[i])
df = df.drop(drops,axis=1)
Let us try shift
df.loc[:,~df.eq(df.astype(object).shift(axis=1)).all()]
I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows