I have a df with multi-indexed columns, like this:
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
I want to be able to select all rows where the values in one of the level 1 columns pass a certain test. If there were no multi-index on the columns I would say something like:
data[data['d']<1]
But of course that fails on a multindex. The level 1 indexes are unique, so I don't want to have to specify the level 0 index, just level 1. I'd like to return the table above but missing row 1, where d>1.
If values are unique in second level id necessary convert mask from one column DataFrame to Series - possible solution with DataFrame.squeeze:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
print (data.xs('d', axis=1, level=1))
two
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
print (data.xs('d', axis=1, level=1).squeeze())
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
Name: two, dtype: float64
print (data.xs('d', axis=1, level=1).squeeze().lt(1))
0 False
1 True
2 True
3 True
Name: two, dtype: bool
df = data[data.xs('d', axis=1, level=1).squeeze().lt(1)]
Alternative with DataFrame.iloc:
df = data[data.xs('d', axis=1, level=1).iloc[:, 0].lt(1)]
print (df)
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
If working with MultiIndex after select is possible get multiple columns, like here if select by c level:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
So first select by DataFrame.xs and compare by DataFrame.lt for <
print (data.xs('c', axis=1, level=1))
two
0 1.481278 0.685609
1 -0.235634 -0.344943
2 0.244509 0.464802
3 -0.963336 1.054969
m = data.xs('c', axis=1, level=1).lt(1)
#alternative
#m = data.xs('c', axis=1, level=1) < 1
print (m)
two
0 False True
1 True True
2 True True
3 True False
And then test if at least one True per rows by DataFrame.any and filter by boolean indexing:
df1 = data[m.any(axis=1)]
print (df1)
one two
a b c a b c
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Or test if all Trues per row by DataFrame.any with filtering:
df1 = data[m.all(axis=1)]
print (df1)
one two
a b c a b c
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
Using your supplied data, a combination of xs and squeeze can help with the filtering. This works on the assumption that the level 1 entries are unique, as indicated in your question :
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
one two
a b c d e f
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Say you want to filter for d less than 1 :
#squeeze turns it into a series, making it easy to pass to loc via boolean indexing
condition = data.xs('d',axis=1,level=1).lt(1).squeeze()
#or you could use loc :
# condition = data.loc(axis=1)[:,'d'].lt(1).squeeze()
data.loc[condition]
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
I think this can be done using query;
data.query("some_column <1")
and get_level_values
data[data.index.get_level_values('some_column') < 1]
Thanks everyone for your help. As usual with these things, the specific answer to the problem is not as interesting as what you've learned in trying to fix it, and I learned a lot about .query, .xs and much more.
However, I ended up taking a side route to addressing my specific issue - namely that I copied the columns to a new variable, dropped an index, did my calculations, then put the original indexes in place. Eg:
cols = data.columns
data..droplevel(level=1, axis=1)
# do calculations
data.columns = cols
The advantage was that I could top and tail the operation modifying the indexes, but all the data manipulation in between used idioms I'm familiar with.
At some point I'll sit down and read about multi-indexes at length.
Related
I have a large dataset called pop and want to return the only 2 rows that have the same value in column 'J'. I do not know what rows have the same value and do not know what the common value is... I want to return these two rows.
Without knowing the common value, this code is not helpful:
pop.loc[pop['X'] == some_value]
I tried this but it returned the entire dataset:
pop.query('X' == 'X')
Any input is appreciated...
You can do .value_counts() then get the first element, which has been sorted to be the most common value.
I'll use some dummy data here:
In [2]: df = pd.DataFrame(['a', 'b', 'c', 'd', 'b', 'f'], columns=['X'])
In [3]: df
Out[3]:
X
0 a
1 b
2 c
3 d
4 b
5 f
In [4]: wanted_value = df['X'].value_counts().index[0]
In [5]: wanted_value
Out[5]: 'b'
In [6]: df[df['X'] == wanted_value]
Out[6]:
X
1 b
4 b
For reference, df['X'].value_counts() is:
b 2
a 1
c 1
d 1
f 1
Name: X, dtype: int64
Thanks, I figured out another way that seemed a bit easier...
pop['X'].value_counts()
the top value was 21 and showed '2', indicating 21 was the duplicated value; all remaining values indicated '1', no duplicates
pop.loc[pop['X'] == 21]
returned the 2 rows with the duplicated value in column X.
I have a dataframe with a unique index and columns 'users', 'tweet_time' and 'tweet_id'.
I want to count the number of duplicate tweet_time values per user.
users = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
tweet_times = ['01-01-01 01:00', '02-02-02 02:00', '03-03-03 03:00', '09-09-09 09:00',
'04-04-04 04:00', '04-04-04 04:00', '05-05-05 05:00', '09-09-09 09:00',
'06-06-06 06:00', '06-06-06 06:00', '07-07-07 07:00', '07-07-07 07:00']
d = {'users': users, 'tweet_times': tweet_times}
df = pd.DataFrame(data=d)
Desired Output
A: 0
B: 1
C: 2
I manage to get the desired output (except for the A: 0) using the code below. But is there a more pythonic / efficient way to do this?
# group by both columns
df2 = pd.DataFrame(df.groupby(['users', 'tweet_times']).tweet_id.count())
# filter out values < 2
df3 = df2[df2.tweet_id > 1]
# turn multi-index level 1 into column
df3.reset_index(level=[1], inplace=True)
# final groupby
df3.groupby('users').tweet_times.count()
We can use crosstab to create a frequency table then check for counts greater than 1 to create a boolean mask then sum this mask along axis=1
pd.crosstab(df['users'], df['tweet_times']).gt(1).sum(1)
users
A 0
B 1
C 2
dtype: int64
This works,
df1 = pd.DataFrame(df.groupby(['users'])['tweet_times'].value_counts()).reset_index(level = 0)
df1.groupby('users')['tweet_times'].apply(lambda x: sum(x>1))
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
you can use a custom boolean with your groupby.
the keep=False returns True when a value is duplicated and false if not.
# df['tweet_times'] = pd.to_datetime(df['tweet_times'],errors='coerce')
df.groupby([df.duplicated(subset=['tweet_times'],keep=False),'users']
).nunique().loc[True]
tweet_times
users
A 0
B 1
C 2
There might be a simpler way, but this is all I can come up with for now :)
df.groupby("users")["tweet_times"].agg(lambda x: x.count() - x.nunique()).rename("count_dupe")
Output:
users
A 0
B 1
C 2
Name: count_dupe, dtype: int64
This looks quite pythonic to me:
df.groupby("users")["tweet_times"].count() - df.groupby("users")["tweet_times"].nunique()
Output:
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
Assume a dataframe df like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.
Use Series.duplicated only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
You can group your col1 by col2 and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
A more generalised solution using Groupby.count and index.get_level_values:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']
I have df below:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C']
})
I want to achieve the following. For each unique ID, the bottom row is True (this is V1). I want to count how many times each unique value of V2 occurs where V1==True. This part would be achieved by something like:
df.groupby('V2').V1.sum()
However, I also want to add, for each unique value of V2, a column indicating how many times that value occurred after the point where V1==True for the V2 value indicated by the row. I understand this might sound confusing; here's how the output woud look like in this example:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
It is important that the solution is general enough to be applicable on a similar case with more unique values than just A, B and C.
UPDATE
As a bonus, I am also interested in how, instead of the count, one can instead return the sum of some value column, under the same conditions, divided by the corresponding "count" in the rows. Example: suppose we now depart from df below instead:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C'],
'V3': [1, 2, 3, 4, 5, 6],
})
The output would need to sum V3 for the cases indicated by the counts in the solution by #jezrael, and divide that number by V1. The output would instead look like:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 3.5 0
First aggregate sum:
df1 = df.groupby('V2').V1.sum().astype(int).reset_index()
print (df1)
V2 V1
0 A 0
1 B 1
2 C 2
Then grouping by ID and create heper column by last value by GroupBy.transform and last, then remove last rows of ID by Series.duplicated and use crosstab for counts, add all possible unique values of V2 and last append to df1 by DataFrame.join:
val = df['V2'].unique()
df['new'] = df.groupby('ID').V2.transform('last')
df = df[df.duplicated('ID', keep='last')]
df = pd.crosstab(df['new'], df['V2']).reindex(columns=val, index=val, fill_value=0)
df = df1.join(df, on='V2')
print (df)
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
UPDATE
The updated part of the question should be possible to achieve by changing the crosstab part with pivot table:
df = df.pivot_table(
index='n',
columns='V2',
aggfunc=({
'V3': 'mean'
})
).V3.reindex(columns=v, index=v, fill_value=0)
Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas
Say I have two dataframes A and B and that for some column col_name, their values are:
A[col_name] | B[col_name]
--------------| ------------
1 | 3
2 | 4
3 | 5
4 | 6
I want to compute the set difference between A and B based on col_name. The result of this operation should be:
The rows of A where A[col_name] didn't match any entries in B[col_name].
Below is the result for the above example (showing other columns of A as well):
A[col_name] | A[other_column_1] | A[other_column_2]
------------+-------------------|------------------
1 | 'foo' | 'xyz' ....
2 | 'bar' | 'abc'
Keep in mind that some entries in A[col_name] and B[col_name] could hold the value np.NaN. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.
How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)
One way is to use the Series isin method:
In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B'])
In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])
Now you can check whether each item in df1['A'] is in of df2['A']:
In [13]: df1['A'].isin(df2['A'])
Out[13]:
0 False
1 False
2 True
3 True
Name: A, dtype: bool
In [14]: df1[~df1['A'].isin(df2['A'])] # not in df2['A']
Out[14]:
A B
0 1 foo
1 2 bar
I think this does what you want for NaNs too:
In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B'])
In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A'])
In [23]: df1[~df1['A'].isin(df2['A'])]
Out[23]:
A B
0 1.0 foo
1 NaN bar
3 NaN baz
Note: For large frames it may be worth making these columns an index (to perform the join as discussed in the other question).
More generally
One way to merge on two or more columns is to use a dummy column:
In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B'])
In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B'])
In [33]: cols = ['A', 'B']
In [34]: df2['dummy'] = df2[cols].isnull().any(1) # rows with NaNs in cols will be True
In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left')
In [36]: merged
Out[36]:
A B dummy
0 1 foo NaN
1 NaN bar True
2 4 meh False
3 NaN eurgh NaN
The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:
In [37]: merged.loc[merged.dummy != False, df1.columns]
Out[37]:
A B
0 1 foo
1 NaN bar
3 NaN eurgh
Inelegant.
Here is one option that is also not elegant since it pre-maps the NaN values to some other value (0) so that they can be used as an index:
def left_difference(L, R, L_on, R_on, NULL_VALUE):
L[L_on] = L[L_on].fillna(NULL_VALUE)
L.set_index(L_on, inplace=True)
R[R_on] = R[R_on].fillna(NULL_VALUE)
R.set_index(R_on, inplace=True)
# MultiIndex difference:
diff = L.ix[L.index - R.index]
diff = diff.reset_index()
return diff
To make this work peroperly, NULL_VALUE should be a value not used by L_on nor R_on.