I am trying to select the first 2 columns and the last 2 column from a data frame by index with pandas and save it on the same dataframe.
is there a way to do that in one step?
You can use the iloc function to get the columns, and then pass in the indexes.
df.iloc[:,[0,1,-1,-2]]
You are looking for iloc:
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df.iloc[:,:2] # Grabs all rows and first 2 columns
df.iloc[:,-2:] # Grabs all rows and last 2 columns
pd.concat([df.iloc[:,:2],df.iloc[:,-2:]],axis=1) # Puts them together row wise
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df[['a','b','d','e']]
result
a b d e
0 1 2 4 5
1 2 3 5 6
2 3 4 6 7
Related
I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.
I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).
Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6
I have a big dataframe with many duplicates in it. I want to keep the first and last entry of each duplicate but drop every duplicate in between.
I've already tried to get this done by using df.drop_duplicates with the parameters 'first' and 'last' to get two dataframes and then merge them again to one df so I have the first and last entry, but that didn't work.
df_first = df
df_last = df
df_first['Path'].drop_duplicates(keep='first', inplace=True)
df_last['Path'].drop_duplicates(keep='last', inplace=True)
Thanks for your help in advance!
Use GroupBy.nth for avoid duplicates if group with length is 1:
df = pd.DataFrame({
'a':[5,3,6,9,2,4],
'Path':list('aaabbc')
})
print(df)
a Path
0 5 a
1 3 a
2 6 a
3 9 b
4 2 b
5 4 c
df = df.groupby('Path').nth([0, -1])
print (df)
a
Path
a 5
a 6
b 9
b 2
c 4
**Using group by.nth which is an Updated code from previous solution to get nth entry
def keep_second_dup(duplicate):
duplicate[Columnname]=duplicate[Columnname'].value_counts()
second_duplicate=duplicate[duplicate['Count']>=1]
residual=duplicate[duplicate['Count']==1]
sec=second_duplicated.groupby([Columnname]).nth([1]).reset_index()
final_data=pd.concat([sec,residual])
final_data.drop('Count',axis=1,inplace=True)
return final_data
I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.
I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]
Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7
Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]
I have a the following data frame:
I want to remove duplicate data in WD column, if they have the same drug_id.
For example, there is two "crying" in WD column with the same drug_id = 32. So I want to remove one of the row that has crying.
How I can do it? I know how to duplicate rows, but I do not know how to add this condition to this code.
df = df.apply(lambda x:x.drop_duplicates())
You can use drop_duplicates with subset parameter which optionally considers certain columns for duplicates:
df.drop_duplicates(subset = ["drug_id", "WD"])
If the upper/lower cases are important for considering duplicates, you could try:
df[~df[['drug_id', 'WD']].apply(lambda x: x.str.lower()).duplicated()]
Where you can convert both drug_id and WD columns to lower case, use duplicated() method to identify duplicated rows and then use the generated logical series to filter out duplicated rows.
Example:
df = pd.DataFrame({"A": [1,1,2,2], "B":[1,2,3,4], "C":[1,1,2,3]})
df
# A B C
#0 1 1 1
#1 1 2 1
#2 2 3 2
#3 2 4 3
df.drop_duplicates(subset=['A', 'C'])
# A B C
#0 1 1 1
#2 2 3 2
#3 2 4 3