is there any way to get pandas to show the column number and the column name at the same time? I'm dealing with a dataset with >30 columns, all of very long column names and some with little variation with each other. Its an absolute chore to type out the names when writing out the code. (i would still need to see the column names to know which columns to select)
thanks.
One possible solution is create MultiIndex and then select columns by DataFrame.xs:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.xs(2, level=0, axis=1))
C
0 7
1 8
2 9
Related
I have one pandas dataframe, with one row and multiple columns.
I want to get the column number/index of the minimum value in the given row.
The code I found was: df.columns.get_loc('colname')
The above code asks for a column name.
My dataframe doesn't have column names. I want to get the column location of the minimum value.
Use argmin with converting DataFrame to array by values, only necessary only numeric data:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[-5,3,6,9,2,-4]
})
print (df)
B C D E
0 4 7 1 -5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 -4
df['col'] = df.values.argmin(axis=1)
print (df)
B C D E col
0 4 7 1 -5 3
1 5 8 3 3 2
2 4 9 5 6 0
3 5 4 7 9 1
4 5 2 1 2 2
5 4 3 0 -4 3
I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10
I have a dataset D with Columns from [A - Z] in total 26 columns. I have done some test and got to know which are the useful columns to me in a series S.
D #Dataset with columns from A - Z
S
B 0.78
C 1.04
H 2.38
S has the columns and a value associated with it, So I now know their importance and would like to keep only those Columns in the Dataset eg(B, C, D) How can I do it?
IIUC you can use:
cols = ['B','C','D']
df = df[cols]
Or if column names are in Series as values:
S = pd.Series(['B','C','D'])
df = df[S]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
S = pd.Series(['B','C','D'])
print (S)
0 B
1 C
2 D
dtype: object
print (df[S])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
Or index values:
S = pd.Series([1,2,3], index=['B','C','D'])
print (S)
B 1
C 2
D 3
dtype: int64
print (df[S.index])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
If I have these columns in a dataframe:
a b
1 5
1 7
2 3
1,2 3
2 5
How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.
a b c
1 5 15
1 7 15
2 3 11
1,2 3 26
2 5 11
Is there an easy and efficient solution as the dataframe I have is very large.
You can first need split column a and join it to original DataFrame:
print (df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
0 1
1 1
2 2
3 1
3 2
4 2
Name: a, dtype: object
df1 = df.drop('a', axis=1)
.join(df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
print (df1)
b a
0 5 1
1 7 1
2 3 2
3 3 1
3 3 2
4 5 2
Then use transform for sum without aggragation.
df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
b a c
0 5 1 15
1 7 1 15
2 3 2 11
3 3 1 15
3 3 2 11
4 5 2 11
Last groupby by index and aggregate columns by agg:
print (df1.groupby(level=0)
.agg({'a':','.join,'b':'first' ,'c':sum})
[['a','b','c']] )
a b c
0 1 5 15
1 1 7 15
2 2 3 11
3 1,2 3 26
4 2 5 11
Data in excel:
a b a d
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Code:
df= pd.io.excel.read_excel(r"sample.xlsx",sheetname="Sheet1")
df
a b a.1 d
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
how to delete the column a.1?
when pandas reads the data from excel it automatically changes the column name of 2nd a to a.1.
I tried df.drop("a.1",index=1) , this does not work.
I have a huge excel file which has duplicate names, and i am interested only in few of columns.
You need to pass axis=1 for drop to work:
In [100]:
df.drop('a.1', axis=1)
Out[100]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Or just pass a list of the cols of interest for column selection:
In [102]:
cols = ['a','b','d']
df[cols]
Out[102]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
Also works with 'fancy indexing':
In [103]:
df.ix[:,cols]
Out[103]:
a b d
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 7
If you know the name of the column you want to drop:
df = df[[col for col in df.columns if col != 'a.1']]
and if you have several columns you want to drop:
columns_to_drop = ['a.1', 'b.1', ... ]
df = df[[col for col in df.columns if col not in columns_to_drop]]
Much more generally drop all duplicated columns
df= df.drop(df.filter(regex='\.\d').columns, axis=1)