can I use dataframe.set_index with the index of the column or it only works with the name of the column??
Example:
df4 = df.set_index(0).T instead of df4 = df.set_index('Parametres').T
thank you
If want create new index by first column use indexing:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
print (df.columns[0])
A
df = df.set_index(df.columns[0])
print (df)
B C
A
a 4 7
b 5 8
c 4 9
d 5 4
e 5 2
f 4 3
Related
Let's say I have a DataFrame and don't know the names of all columns. However, I know there's a column called "N_DOC" and I want this to be the first column of the DataFrame - (while keeping all other columns, regardless its order).
How can I do this?
You can reorder the columns of a datframe with reindex:
cols = df.columns.tolist()
cols.remove('N_DOC')
df.reindex(['N_DOC'] + cols, axis=1)
Use DataFrame.insert with DataFrame.pop for extract column:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'N_DOC':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
c = 'N_DOC'
df.insert(0, c, df.pop(c))
Or:
df.insert(0, 'N_DOC', df.pop('N_DOC'))
print (df)
N_DOC A B C E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
Here's a simple, one line, solution using DataFrame masking:
import pandas as pd
# Building sample dataset.
cols = ['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC']
df = pd.DataFrame(columns=cols)
# Re-order columns.
df = df[['N_DOC'] + df.columns.drop('N_DOC').tolist()]
Before:
Index(['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC'], dtype='object')
After:
Index(['N_DOC', 'N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe'], dtype='object')
I want to select columns with a specific value (say 1) in a specific row (say first row) for Pandas Dataframe
you can use this
df['a'][df['a']==0]
Use iloc with boolean indexing, for performance is better filtering index not DataFrame and then select index (see performance):
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
s = df.iloc[0]
a = s.index[s == 1]
print (a)
Index(['D'], dtype='object')
a = s.index.values[(s == 1)]
print (a)
['D']
You can use iloc to extract a row as a series, then apply your condition:
row = df.iloc[0] # extract first row as series
res = row[res == 1].index # filter for values equal to 1 and get columns via index
How can I join values in columns with the same name in MultiIndex pandas DataFrame?
data = [['1','1','2','3','4'],['2','5','6','7','8']]
df = pd.DataFrame(data, columns=['id','A','B','A','B'])
df = df.set_index('id')
df.columns = pd.MultiIndex.from_tuples([('result','A'),('result','B'),('student','A'),('student','B')])
df
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
Desired results:
A B
id
1 "1 3" "2 4"
2 "5 7" "6 8"
I am not completely sure what you are asking. If you have two separate dataframes then you should be able to just use pd.concat.
pd.concat([df1, df2], axis=1)
If you have one dataframe then just drop the top level of the index.
df.columns = df.columns.droplevel(0)
New answer:
For join values by second level of MultiIndex in columns use groupby with agg:
#select columns define in list
df = df[['result','student']]
df1 = df.astype(str).groupby(level=1, axis=1).agg(' '.join)
print (df1)
A B
id
1 1 3 2 4
2 5 7 6 8
Old answer:
You can use sort_index for sorting columns and then droplevel for remove first level of MultiIndex.
But get duplicate columns names.
print (df)
result student col
A B A B A B
id
1 1 2 3 4 6 7
2 5 6 7 8 2 1
#select columns define in list
df = df[['result','student']]
print (df)
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.droplevel(0)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8
So better, unique columns names can be created by map with join:
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
print (df)
result_A student_A result_B student_B
id
1 1 3 2 4
2 5 7 6 8
df = pd.concat([df['result'],df['student']], axis=1).sort_index(axis=1)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8
I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]
I have a pandas dataframe with multiple columns and I want to "flatten" it to just two columns - one with column name and the other with values. E.g.
df1 = pd.DataFrame({'A':[1,2],'B':[2,3], 'C':[3,4]})
How can I convert it to look like:
df2 = pd.DataFrame({'column name': ['A','A','B','B','C','C'], 'value': [1,2,2,3,3,4]})
You can stack to stack all column values into a single, column, then drop the first level index calling reset_index, overwrite the column names with the ones you desire and then finally sort using sort_values:
In [37]:
df2 = df1.stack().reset_index(level=0, drop=True).reset_index()
df2.columns = ['column name', 'value']
df2.sort_values(['column name', 'value'], inplace=True)
df2
Out[37]:
column name value
0 A 1
3 A 2
1 B 2
4 B 3
2 C 3
5 C 4
You can reshape by stack to MultiIndex Series and then reset_index with sort_values:
df2 = df1.stack().reset_index(level=0, drop=True).reset_index().sort_values('index')
df2.columns = ['column name','value']
print (df2)
column name value
0 A 1
3 A 2
1 B 2
4 B 3
2 C 3
5 C 4
One row solution with rename column index to column name:
df2 = df1.stack()
.reset_index(level=0, drop=True)
.reset_index(name='value')
.sort_values(['index'])
.rename(columns={'index':'column name'})
print (df2)
column name value
0 A 1
3 A 2
1 B 2
4 B 3
2 C 3
5 C 4
If need sort by both columns:
df2 = df1.stack().reset_index(level=0, drop=True).reset_index().sort_values(['index',0])
df2.columns = ['column name','value']
print (df2)
column name value
0 A 1
3 A 2
1 B 2
4 B 3
2 C 3
5 C 4