Advanced MultiIndex sorting and indexing - python

I have a data with >100k rows and I need to efficiently regroup it from the left DataFrame to the multiindexed right one which indices are sorted by the sum of values in the 3rd column and inside each index 2nd column values are sorted by values in the 3rd column. All sortings are descending.
I have no idea how to do it correctly and already spent whole day figuring it out.
a b c a sum b c %
foo one 1 foo 5 one 3 3/5
foo two 2 two 2 2/5
bar one 1 => baz 4 two 3 3/4
baz one 1 one 1 1/4
baz two 3 bar 3 six 2 2/3
foo one 2 one 1 1/3
bar six 2
UPDATE:
The code given by #jezrael works really good but it outputs in this way:
%
a sum b c
foo 5 one 3 0.60
two 2 0.40
six NaN NaN
baz 4 two 3 0.75
one 1 0.25
six NaN NaN
bar 1 one 1 1.00
two NaN NaN
six NaN NaN
Is it possible to get rid of these strings with NaN?
UPDATE #2:
I've found the problem which gives NaNs problem. This was caused by 'category' data type. How it affects the behavior of the code I don't know. Just pointing out the cause.

I believe need:
#aggregate sum by a, b columns
df = df.groupby(['a','b'], as_index=False)['c'].sum()
print (df)
a b c
0 bar one 1
1 baz one 1
2 baz two 3
3 foo one 3
4 foo two 2
#create new column by position with transform sum per a column
df.insert(1, 'sum', df.groupby('a')['c'].transform('sum'))
#division of columns
df['%'] = df['c'].div(df['sum'])
print (df)
a sum b c %
0 bar 1 one 1 1.00
1 baz 4 one 1 0.25
2 baz 4 two 3 0.75
3 foo 5 one 3 0.60
4 foo 5 two 2 0.40
#sorting by multiple columns and create MultiIndex
df = df.sort_values(['sum','c'], ascending=False).set_index(['a','sum','b', 'c'])
print (df)
%
a sum b c
foo 5 one 3 0.60
two 2 0.40
baz 4 two 3 0.75
one 1 0.25
bar 1 one 1 1.00

Related

How to delete different column names with duplicated values?

Given this DF:
a b c d
1 2 1 4
4 3 4 2
foo bar foo yes
What is the best way to delete same columns but with different name in a large pandas DF? For example:
a b d
1 2 4
4 3 2
foo bar yes
Column c was removed from the above dataframe becase a and c where the same column but with different name. So far I tried to
df = df.iloc[:, ~df.columns.duplicated()]
However it is not clear to me how to check the row values inside the DF?
use transpose as below
df.T.drop_duplicates().T
I tried straight forward approach - loop through column names and compare each column with rest of others. Use np.all for exact match. These approach took only 336ms.
repeated_columns = []
for i, column in enumerate(df.columns):
r_columns = df.columns[i+1:]
for r_c in r_columns:
if np.all(df[column] == df[r_c]):
repeated_columns.append(r_c)
new_columns = [x for x in df.columns if x not in repeated_columns]
df[new_columns]
It will give you following output
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
df.loc[:,~df.T.duplicated()]
a b d
0 1 2 4
1 4 3 2
2 foo bar yes

Remove rows if it exists in the previous group

I have a GroupBy object. I want to to remove rows from current group if the same row exists in the previous group. Let's say this is (n-1)th group:
A B
0 foo 0
1 baz 1
2 foo 1
3 bar 1
And this n-th group
A B
0 foo 2
1 foo 1
2 baz 1
3 baz 3
After dropping all duplicates. Result of n-th group:
A B
0 foo 2
3 baz 3
EDIT:
I would like to achieve it without loop if possible
I am using merge with indicator here
yourdf=dfn.merge(df1,indicator=True,how='left').loc[lambda x : x['_merge']!='both']
yourdf
A B _merge
0 foo 2 left_only
3 baz 3 left_only
#yourdf.drop('_merge',1,inplace=True)
Since it is GrouBy Object so you can do with for loop here , using above code for n times

operations on mutliindex dataframe

I would like to add columns into a Pandas multiindex dataframe, which will contain the result of an operation performed on other columns.
I have a dataframe similar to this one:
first bar baz
second one two one two
A 5 2 9 2
B 6 4 7 6
C 5 4 5 1
Now, for each group in the dataframe, I'd like to add a column "three" which equals column "one" minus column "two":
first bar baz
second one two three one two three
A 5 2 3 9 2 7
B 6 4 2 7 6 1
C 5 4 1 5 1 4
In reality my dataframe is much larger. I'm struggling to find the answer to this (hopefully) easy question. Any suggestions are appreciated.
Use DataFrame.xs for select one and two levels and subtract, then create MultiIndex in column by MultiIndex.from_product:
df1 = df.xs('one', axis=1, level=1) - df.xs('two', axis=1, level=1)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['three']])
print (df1)
bar baz
three three
A 3 7
B 2 1
C 1 4
Then concat to original and for change ordering use reindex by helper MultiIndex:
mux = pd.MultiIndex.from_product([['bar','baz'], ['one','two','three']],
names=df.columns.names)
df = pd.concat([df, df1], axis=1).reindex(columns=mux)
print (df)
first bar baz
second one two three one two three
A 5 2 3 9 2 7
B 6 4 2 7 6 1
C 5 4 1 5 1 4
Create your append df by using MultiIndex
s=pd.DataFrame([[1,2],[2,3],[3,4]],columns=pd.MultiIndex.from_arrays([['bar','baz'],['three','three']]))
s
Out[458]:
bar baz
three three
0 1 2
1 2 3
2 3 4
Then we do concat
yourdf=pd.concat([df,s],axis=1).sort_index(level=0,axis=1)
If the order is matter , you can reindex or may consider factorized the level .

Get the count for each subgroup in a multiple grouped pandas.DateFrame aggregated on one group

I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?
You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.

make 2d array from pandas dataframe

I have a pandas dataframe over here with two columns: participant names and reaction times (note that one participant has more measures oh his RT).
ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4
I would like to get a 2d array from this where every row contains the reaction times for one participant.
[[1,2,1,2]
[3,4,3,4,4]]
In case it's not possible to have a shape like that, the following options for obtaining a good a x b shape would be acceptable for me: fill missing elements with NaN; truncate the longer rows to the size of the shorter rows; fill the shorter rows with repeats of their mean value.
I would go for whatever is easiest to implement.
I have tried to sort this out by using groupby, and I expected it to be very easy to do this but it all gets terribly terribly messy :(
import pandas as pd
import io
data = io.BytesIO(""" ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4""")
df = pd.read_csv(data, delim_whitespace=True)
df.groupby("ID").RT.apply(pd.Series.reset_index, drop=True).unstack()
output:
0 1 2 3 4
ID
bar 3 4 3 4 4
foo 1 2 1 2 NaN

Categories

Resources