Renaming Pandas DataFrame columns that are numbers - python

I have a DataFrame that has integers for column names that looks like this:
1 2 3 4
Red 7 3 2 9
Blue 3 1 6 4
I'd like to rename the columns. I tried using the following
df = df.rename(columns={'1': 'One', '2': 'Two', '3': 'Three', '4': 'Four'})
However that doesn't change the column names. Do I need to do something else to change column names when they are numbers?

You need to remove the quotes:
df = df.rename(columns={1: 'One', 2: 'Two', 3: 'Three', 4: 'Four'})

What if you use the following:
>>> df.columns = ['One', 'Two', 'Three', 'Four']
>>> df
One Two Three Four
0 7 3 6 9
1 3 1 2 4

You can use two way to change columns name in Pandas DataFrame.
Changing the column name using df.columns attribute.
df.columns = ['One', 'Two', 'Three', 'Four']
Using rename() function
df = df.rename(columns={1: 'One', 2: 'Two', 3: 'Three', 4: 'Four'})

Related

Subsetting a Dataframe by Column Contents in Separate DataFrame - Python 3

How do I subset a dataframe with the contents of another dataframes column?
df1 = pd.DataFrame({"0": ['one', 'two', 'three', 'four'], "Index": [1, 2, 3, 4]})
df2 = pd.DataFrame({"0": ['two', 'two', 'three', 'three']})
a = [i for i in df1['0'] if i in df2['0']]
Results
print(a)
[]
Desired output:
print(a)
0 index
0 two 2
1 three 3
If you really want to keep keep this strange unnecessary list comprehension, you need to pass s.values:
>>> [i for i in df1['0'].values if i in df2['0'].values]
['two', 'three']
But to get directly that your input implies you want, just select entries in df1 where the 0 column value is in df2's 0 column values:
>>> df1[df1['0'].isin(df2['0'])]
0 Index
1 two 2
2 three 3
And you can reset the indices too, invoke df_name.reset_index(drop=True).

Add column to pandas multiindex dataframe

I have a pandas dataframe that looks like this:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,4),index=arrays,columns=['A','B','C','D'])
I want to add a column E such that df.loc[(slice(None),'one'),'E'] = 1 and df.loc[(slice(None),'two'),'E'] = 2, and I want to do this without iterating over ['one', 'two']. I tried the following:
df.loc[(slice(None),slice('one','two')),'E'] = pd.Series([1,2],index=['one','two'])
but it just adds a column E with NaN. What's the right way to do this?
Here is one way reindex
df.loc[:,'E']=pd.Series([1,2],index=['one','two']).reindex(df.index.get_level_values(1)).values
df
A B C D E
bar one -0.856175 -0.383711 -0.646510 0.110204 1
two 1.640114 0.099713 0.406629 0.774960 2
baz one 0.097198 -0.814920 0.234416 -0.057340 1
two -0.155276 0.788130 0.761469 0.770709 2
foo one 1.593564 -1.048519 -1.194868 0.191314 1
two -0.755624 0.678036 -0.899805 1.070639 2
qux one -0.560672 0.317915 -0.858048 0.418655 1
two 1.198208 0.662354 -1.353606 -0.184258 2
Methinks this is a good use case for Index.map:
df['E'] = df.index.get_level_values(1).map({'one':1, 'two':2})
df
A B C D E
bar one 0.956122 -0.705841 1.192686 -0.237942 1
two 1.155288 0.438166 1.122328 -0.997020 2
baz one -0.106794 1.451429 -0.618037 -2.037201 1
two -1.942589 -2.506441 -2.114164 -0.411639 2
foo one 1.278528 -0.442229 0.323527 -0.109991 1
two 0.008549 -0.168199 -0.174180 0.461164 2
qux one -1.175983 1.010127 0.920018 -0.195057 1
two 0.805393 -0.701344 -0.537223 0.156264 2
You can just get it from df.index.labels:
df['E'] = df.index.labels[1] + 1
print(df)
Output:
A B C D E
bar one 0.746123 1.264906 0.169694 -0.180074 1
two -1.439730 -0.100075 0.929750 0.511201 2
baz one 0.833037 1.547624 -1.116807 0.425093 1
two 0.969887 -0.705240 -2.100482 0.728977 2
foo one -0.977623 -0.800136 -0.361394 0.396451 1
two 1.158378 -1.892137 -0.987366 -0.081511 2
qux one 0.155531 0.275015 0.571397 -0.663358 1
two 0.710313 -0.255876 0.420092 -0.116537 2
Thanks to coldspeed, if you want different values (i.e x and y), use:
df['E'] = pd.Series(df.index.labels[1]).map({0: 'x', 1: 'y'}).tolist()
print(df)

Modify pandas group

I have a DataFrame, which I group.
I would like to add another column to the data frame, that is a result of function diff, per group. Something like:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df_grouped = df.groupby('B')
for name, group in df_grouped:
new_df["D_diff"] = group["D"].diff()
I would like to get per each group the differnece of column D, and have a DF that include a new column with the diff calculation.
IIUC you can use DataFrameGroupBy.diff:
df['D_diff'] = df.groupby('B')['D'].diff()
print (df)
A B C D D_diff
0 foo one 1.996084 0.580177 NaN
1 bar one 1.782665 0.042979 -0.537198
2 foo two -0.359840 1.952692 NaN
3 bar three -0.909853 0.119353 NaN
4 foo two -0.478386 -0.970906 -2.923598
5 bar two -1.289331 -1.245804 -0.274898
6 foo one -1.391884 -0.555056 -0.598035
7 foo three -1.270533 0.183360 0.064007

compute mean of unique combinations in groupby pandas

I have he following pandas dataframe:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})
that looks like:
A B C
0 foo one 2
1 bar one 1
2 foo two 2
3 bar three 1
4 foo two 2
5 bar two 1
6 foo one 2
7 foo three 1
What I need is to compute the mean of each unique combination of A and B. i.e.:
A B C
foo one 2
foo two 2
foo three 1
mean = 1.66666667
and having as output the 'means' computed per value of A i.e.:
foo 1.666667
bar 1
I tried with :
data.groupby(['A'], sort=False, as_index=False).mean()
but it returns me:
foo 1.8
bar 1
Is there a way to compute the mean of only unique combinations? How ?
This is essentially the same as #S_A's answer, but a bit more concise.
You can calculate the means across A and B with:
In [41]: df.groupby(['A', 'B']).mean()
Out[41]:
C
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
And then calculate the mean of these over A with:
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]:
C
A
bar 1.000000
foo 1.666667
Yes. Here is a solution which you want. Firstly you make group corresponding column for making unique combination A and B column. Later from making group, you count mean() corresponding A column.
You can do this like:
from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()
Output:
A C
0 foo 1.666667
1 bar 1.000000
When you data.groupby(['A'], sort=False, as_index=False).mean() do, it's mean you count group_by all value of C column according to A Column. That's why it return
foo 1.8 (9/8)
bar 1.0 (3/3)
I think you should find your answer :) :)
This worked for me
test = data
test = test.drop_duplicates()
test = test.groupby(['A']).mean()
Output:
C
A
bar 1.000000
foo 1.666667

Re-assignment in Pandas: Copy or view?

Say we have the following dataframe:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
shown below:
> df
A B C D
0 foo one 0.846192 0.478651
1 bar one 2.352421 0.141416
2 foo two -1.413699 -0.577435
3 bar three 0.569572 -0.508984
4 foo two -1.384092 0.659098
5 bar two 0.845167 -0.381740
6 foo one 3.355336 -0.791471
7 foo three 0.303303 0.452966
And then I do the following:
df2 = df
df = df[df['C']>0]
If you now look at df and df2 you will see that df2 holds the original data, whereas df was updated to only keep the values where C was greater than 0.
I thought Pandas wasn't supposed to make a copy in an assignment like df2 = df and that it would only make copies with either:
df2 = df.copy(deep=True)
df2 = copy.deepcopy(df)
What happened above then? Did df2 = df make a copy? I presume that the answer is no, so it must have been df = df[df['C']>0] that made a copy, and I presume that, if I didn't have df2=df above, there would have been a copy without any reference to it floating in memory. Is that correct?
Note: I read through Returning a view versus a copy and I wonder if the following:
Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy.
explains this behavior.
It's not that df2 is making the copy, it's that the df = df[df['C'] > 0] is returning a copy.
Just print out the ids and you'll see:
print id(df)
df2 = df
print id(df2)
df = df[df['C'] > 0]
print id(df)

Categories

Resources