Say we have the following dataframe:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
shown below:
> df
A B C D
0 foo one 0.846192 0.478651
1 bar one 2.352421 0.141416
2 foo two -1.413699 -0.577435
3 bar three 0.569572 -0.508984
4 foo two -1.384092 0.659098
5 bar two 0.845167 -0.381740
6 foo one 3.355336 -0.791471
7 foo three 0.303303 0.452966
And then I do the following:
df2 = df
df = df[df['C']>0]
If you now look at df and df2 you will see that df2 holds the original data, whereas df was updated to only keep the values where C was greater than 0.
I thought Pandas wasn't supposed to make a copy in an assignment like df2 = df and that it would only make copies with either:
df2 = df.copy(deep=True)
df2 = copy.deepcopy(df)
What happened above then? Did df2 = df make a copy? I presume that the answer is no, so it must have been df = df[df['C']>0] that made a copy, and I presume that, if I didn't have df2=df above, there would have been a copy without any reference to it floating in memory. Is that correct?
Note: I read through Returning a view versus a copy and I wonder if the following:
Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy.
explains this behavior.
It's not that df2 is making the copy, it's that the df = df[df['C'] > 0] is returning a copy.
Just print out the ids and you'll see:
print id(df)
df2 = df
print id(df2)
df = df[df['C'] > 0]
print id(df)
Related
Consider this dataframe:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo'], ['one', 'two']]
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo
second one two one two one two
A -1.954583 -1.347156 -1.117026 -1.253150 0.057197 -1.520180
B 0.253937 1.267758 -0.805287 0.337042 0.650892 -0.379811
C 0.354798 -0.835234 1.172324 -0.663353 1.145299 0.651343
I would like to drop 'one' from each column, while retaining other structure.
With the end result looking something like this:
first bar baz foo
second two two two
A -1.347156 -1.253150 -1.520180
B 1.267758 0.337042 -0.379811
C -0.835234 -0.663353 0.651343
Use drop:
df.drop('one', axis=1, level=1)
first bar baz foo
second two two two
A 0.127419 -0.319655 -0.878161
B -0.563335 1.193819 -0.469539
C -1.324932 -0.550495 1.378335
This should work as well:
df.loc[:,df.columns.get_level_values(1)!='one']
Try:
print(df.loc[:, (slice(None), "two")])
Prints:
first bar baz foo
second two two two
A -1.104831 0.286379 1.121148
B -1.637677 -2.297138 0.381137
C -1.556391 0.779042 2.316628
Use pd.IndexSlice:
indx = pd.IndexSlice
df.loc[:, indx[:, 'two']]
Output:
first bar baz foo
second two two two
A 1.169699 1.434761 0.917152
B -0.732991 -0.086613 -0.803092
C -0.813872 -0.706504 0.227000
I've got the following multiindex dataframe:
first bar baz foo
second one two one two one two
first second
bar one NaN -0.056213 0.988634 0.103149 1.5858 -0.101334
two -0.47464 -0.010561 2.679586 -0.080154 <LQ -0.422063
baz one <LQ 0.220080 1.495349 0.302883 -0.205234 0.781887
two 0.638597 0.276678 -0.408217 -0.083598 -1.15187 -1.724097
foo one 0.275549 -1.088070 0.259929 -0.782472 -1.1825 -1.346999
two 0.857858 0.783795 -0.655590 -1.969776 -0.964557 -0.220568
I would like to to extract the max along one level. Expected result:
first bar baz foo
second
one 0.275549 1.495349 1.5858
two 0.857858 2.679586 -0.964557
Here is what I tried:
df.xs('one', level=1, axis = 1).max(axis=0, level=1, skipna = True, numeric_only = False)
And the obtained result:
first baz
second
one 1.495349
two 2.679586
How do I get Pandas to not ignore the whole column if one cell contains a string?
(created like this:)
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
df['bar','one'].loc['bar','one'] = np.NaN
df['bar','one'].loc['baz','one'] = '<LQ'
df['foo','one'].loc['bar','two'] = '<LQ'
I guess you would need to replace the non-numeric with na:
(df.xs('one', level=1, axis=1)
.apply(pd.to_numeric, errors='coerce')
.max(level=1,skipna=True)
)
Output (with np.random.seed(1)):
first bar baz foo
second
one 0.900856 1.133769 0.865408
two 1.744812 0.319039 0.901591
How do I subset a dataframe with the contents of another dataframes column?
df1 = pd.DataFrame({"0": ['one', 'two', 'three', 'four'], "Index": [1, 2, 3, 4]})
df2 = pd.DataFrame({"0": ['two', 'two', 'three', 'three']})
a = [i for i in df1['0'] if i in df2['0']]
Results
print(a)
[]
Desired output:
print(a)
0 index
0 two 2
1 three 3
If you really want to keep keep this strange unnecessary list comprehension, you need to pass s.values:
>>> [i for i in df1['0'].values if i in df2['0'].values]
['two', 'three']
But to get directly that your input implies you want, just select entries in df1 where the 0 column value is in df2's 0 column values:
>>> df1[df1['0'].isin(df2['0'])]
0 Index
1 two 2
2 three 3
And you can reset the indices too, invoke df_name.reset_index(drop=True).
I have a DataFrame, which I group.
I would like to add another column to the data frame, that is a result of function diff, per group. Something like:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df_grouped = df.groupby('B')
for name, group in df_grouped:
new_df["D_diff"] = group["D"].diff()
I would like to get per each group the differnece of column D, and have a DF that include a new column with the diff calculation.
IIUC you can use DataFrameGroupBy.diff:
df['D_diff'] = df.groupby('B')['D'].diff()
print (df)
A B C D D_diff
0 foo one 1.996084 0.580177 NaN
1 bar one 1.782665 0.042979 -0.537198
2 foo two -0.359840 1.952692 NaN
3 bar three -0.909853 0.119353 NaN
4 foo two -0.478386 -0.970906 -2.923598
5 bar two -1.289331 -1.245804 -0.274898
6 foo one -1.391884 -0.555056 -0.598035
7 foo three -1.270533 0.183360 0.064007
I have he following pandas dataframe:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})
that looks like:
A B C
0 foo one 2
1 bar one 1
2 foo two 2
3 bar three 1
4 foo two 2
5 bar two 1
6 foo one 2
7 foo three 1
What I need is to compute the mean of each unique combination of A and B. i.e.:
A B C
foo one 2
foo two 2
foo three 1
mean = 1.66666667
and having as output the 'means' computed per value of A i.e.:
foo 1.666667
bar 1
I tried with :
data.groupby(['A'], sort=False, as_index=False).mean()
but it returns me:
foo 1.8
bar 1
Is there a way to compute the mean of only unique combinations? How ?
This is essentially the same as #S_A's answer, but a bit more concise.
You can calculate the means across A and B with:
In [41]: df.groupby(['A', 'B']).mean()
Out[41]:
C
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
And then calculate the mean of these over A with:
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]:
C
A
bar 1.000000
foo 1.666667
Yes. Here is a solution which you want. Firstly you make group corresponding column for making unique combination A and B column. Later from making group, you count mean() corresponding A column.
You can do this like:
from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()
Output:
A C
0 foo 1.666667
1 bar 1.000000
When you data.groupby(['A'], sort=False, as_index=False).mean() do, it's mean you count group_by all value of C column according to A Column. That's why it return
foo 1.8 (9/8)
bar 1.0 (3/3)
I think you should find your answer :) :)
This worked for me
test = data
test = test.drop_duplicates()
test = test.groupby(['A']).mean()
Output:
C
A
bar 1.000000
foo 1.666667