I want to select columns with a specific value (say 1) in a specific row (say first row) for Pandas Dataframe
you can use this
df['a'][df['a']==0]
Use iloc with boolean indexing, for performance is better filtering index not DataFrame and then select index (see performance):
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
s = df.iloc[0]
a = s.index[s == 1]
print (a)
Index(['D'], dtype='object')
a = s.index.values[(s == 1)]
print (a)
['D']
You can use iloc to extract a row as a series, then apply your condition:
row = df.iloc[0] # extract first row as series
res = row[res == 1].index # filter for values equal to 1 and get columns via index
Related
I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3
Let's say I have a DataFrame and don't know the names of all columns. However, I know there's a column called "N_DOC" and I want this to be the first column of the DataFrame - (while keeping all other columns, regardless its order).
How can I do this?
You can reorder the columns of a datframe with reindex:
cols = df.columns.tolist()
cols.remove('N_DOC')
df.reindex(['N_DOC'] + cols, axis=1)
Use DataFrame.insert with DataFrame.pop for extract column:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'N_DOC':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
c = 'N_DOC'
df.insert(0, c, df.pop(c))
Or:
df.insert(0, 'N_DOC', df.pop('N_DOC'))
print (df)
N_DOC A B C E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
Here's a simple, one line, solution using DataFrame masking:
import pandas as pd
# Building sample dataset.
cols = ['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC']
df = pd.DataFrame(columns=cols)
# Re-order columns.
df = df[['N_DOC'] + df.columns.drop('N_DOC').tolist()]
Before:
Index(['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC'], dtype='object')
After:
Index(['N_DOC', 'N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe'], dtype='object')
I have a dataframe like this -
What I want to do is, whenever there is 'X' in Col3, that row should get duplicated and 'X' should be changed to 'Z'. The result must look like this -
I did try a few approaches, but nothing worked!
Can somebody please guide on how to do this.
You can filter first by boolean indexing and set Z to Col3 by DataFrame.assign, join with original with concat, sorting index by DataFrame.sort_index with stabble algo mergesort and last create default RangeIndex by DataFrame.reset_index with drop=True:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'Col3':list('aXcdXf'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = (pd.concat([df, df[df['Col3'].eq('X')].assign(Col3 = 'Z')])
.sort_index(kind='mergesort')
.reset_index(drop=True))
print (df)
B C Col3 D E F
0 4 7 a 1 5 a
1 5 8 X 3 3 a
2 5 8 Z 3 3 a
3 4 9 c 5 6 a
4 5 4 d 7 9 b
5 5 2 X 1 2 b
6 5 2 Z 1 2 b
7 4 3 f 0 4 b
I select columns 2 - end from a pandas DataFrame with iloc as
d=c.iloc[:,2:]
now how can I apply a condition to this selection? For example, if column1==1.
You can use DataFrame.iloc if need filter first column select by position, : means here select all rows:
c[c.iloc[:, 0] == 1]
Sample:
c = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (c)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df = c[c.iloc[:, 3] == 1]
print (df)
A B C D E F
0 a 4 7 1 5 a
4 e 5 2 1 2 b
This is referred to as mixed indexing in that you want to index by boolean results in rows and position in columns. I'd use loc in order to take advantage of boolean indexing for the rows. But that implies that you need column names values for the column slice.
d.loc[d.column1 == 1, d.columns[2:]]
If your column names are not unique then you can resort to the dreaded chained index.
d.loc[d.column1 == 1].iloc[:, 2:]
What might also be intuitive is to use query afterwards:
d.iloc[:, 2:].query('column1 == 1')
I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]