Sum over all uniqe unordered combinations in a MultiIndex - python

Question
I want to find all unique, unordered combinations of the MultiIndex in which element 'A' is present and sum these rows.
Data
I have the following dataframe:
df = pd.DataFrame({'col1': [2, 4, 6, 3], 'col2': [8, -2, 5, 3]},
index=pd.MultiIndex.from_tuples([('A', 'B'), ('B', 'A'), ('A', 'C'), ('C', 'A')]))
df.index.names = ['from', 'to']
Output:
col1 col2
from to
A B 2 8
B A 4 -2
A C 6 5
C A 3 3
Expected Output
col1 col2
from to
A B 6 6
C 9 8
The row line of the expected output is the sum of the first two rows. They are added, because you can change the order of the two elements so that they are identical sorted(['A', 'B']) == sorted(['B', 'A']). Respectively, the second row of the output is the sum of the third and fourth row.
Ugly, unreadable code
My solution builds two new columns 'group1' and 'group2' which are ordered for the sole purpose to group the data. Yes, I could refactor my two functions into one lambda function to have fewer lines of code. However, I don't mind a few extra lines of code to use apply(). My problem is that the solution is almost unintelligible to any reader (this includes me in a few weeks).
The code produces the expected output, but ... (to quote Raymond Hettinger) There Must Be a Better Way!
def switch_cols(col1: str, col2: str, search_word):
if col2 == search_word:
return [col2, col1]
return [col1, col2]
def apply_switch_cols(df):
return switch_cols(df['from'], df['to'], search_word='A')
df['group1'] = ''
df['group2'] = ''
df[['group1', 'group2']] = df.reset_index().apply(apply_switch_cols, axis=1).to_list()
df = df.groupby(['group1', 'group2']).sum()
df.index.names = ['from', 'to']

Here is one way:
df = pd.DataFrame({'col1': [2, 4, 6, 3], 'col2': [8, -2, 5, 3]},
index=pd.MultiIndex.from_tuples([('A', 'B'), ('B', 'A'), ('A', 'C'), ('C', 'A')]))
df.index.names = ['from', 'to']
df = df.reset_index()
df.index = pd.MultiIndex.from_tuples(df[['from','to']].apply(lambda x: sorted(x), axis=1))
df.groupby(level=[0,1]).sum()
Output:
col1 col2
A B 6 6
C 9 8

Related

List of tuples for each pandas dataframe slice

I need to do something very similar to this question: Pandas convert dataframe to array of tuples
The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.
Supposing this is my data set:
t_id A B
----- ---- -----
0 AAAA 1 2.0
1 AAAA 3 4.0
2 AAAA 5 6.0
3 BBBB 7 8.0
4 BBBB 9 10.0
...
I want to produce as output:
[[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]
That is, one list for 'AAAA', another for 'BBBB' and so on.
I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):
result = []
for t in df['t_id'].unique():
tuple_list= []
for x in df[df['t_id' == t]].iterrows():
row = x[1][['A', 'B']]
tuple_list.append(tuple(x))
result.append(tuple_list)
Is there a faster way to do it?
You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:
[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
I think this should work too:
import pandas as pd
import itertools
df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})
tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)
Out:
[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]

Returna value in Pandas by index row number and column name?

I have a DF where the index is equal strings.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=['a', 'a', 'a'], columns=['A', 'B', 'C'])
>>> df
A B C
a 0 2 3
a 0 4 1
a 10 20 30
Let's say I am trying to access the value in col 'B' at the first row. I am using something like this:
>>> df.iloc[0]['B']
2
Reading the post here it seems .at is recommended to be used for efficiency. Is there any better way in my example to return the value by the index row number and column name?
Try with iat with get_indexer
df.iat[0,df.columns.get_indexer(['B'])[0]]
Out[124]: 2

Remove 'nan' from Dictionary of list

My data contain columns with empty rows that are read by pandas as nan.
I want to create a dictionary of list from this data. However, some list contains nan and I want to remove it.
If I use dropna() in data.dropna().to_dict(orient='list'), this will remove all the rows that contains at least one nan, thefore I lose data.
Col1 Col2 Col3
a x r
b y v
c x
z
data = pd.read_csv(sys.argv[2], sep = ',')
dict = data.to_dict(orient='list')
Current output:
dict = {Col1: ['a','b','c',nan], Col2: ['x', 'y',nan,nan], Col3: ['r', 'v', 'x', 'z']}
Desire Output:
dict = {Col1: ['a','b','c'], Col2: ['x', 'y'], Col3: ['r', 'v', 'x', 'z']}
My goal: get the dictionary of a list, with nan remove from the list.
Not sure exactly the format you're expecting, but you can use list comprehension and itertuples to do this.
First create some data.
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict({'Col1': (1, 2, 3), 'Col2': (4, 5, 6), 'Col3': (7, 8, np.nan)})
print(data)
Giving a data frame of:
Col1 Col2 Col3
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
And then we create the dictionary using the iterator.
dict_1 = {x[0]: [y for y in x[1:] if not pd.isna(y)] for x in data.itertuples(index=True) }
print(dict_1)
>>>{0: [1, 4, 7.0], 1: [2, 5, 8.0], 2: [3, 6]}
To do the same for the columns is even easier:
dict_2 = {data[column].name: [y for y in data[column] if not pd.isna(y)] for column in data}
print(dict_2)
>>>{'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7.0, 8.0]}
I am not sure if I understand your question correctly, but if I do and what you want is to replace the nan with a value so as not to lose your data then what you are looking for is pandas.DataFrame.fillna function. You mentioned the original value is an empty row, so filling the nan with data.fillna('') which fills it with empty string.
EDIT: After providing the desired output, the answer to your question changes a bit. What you'll need to do is to use dict comprehension with list comprehension to build said dictionary, looping by column and filtering nan. I see that Andrew already provided the code to do this in his answer so have a look there.

How can I get the name of grouping columns from a Pandas GroupBy object?

Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']

How to turn pandas dataframe row into ordereddict fast

Looking for a fast way to get a row in a pandas dataframe into a ordered dict with out using list. List are fine but with large data sets will take to long. I am using fiona GIS reader and the rows are ordereddicts with the schema giving the data type. I use pandas to join data. I many cases the rows will have different types so I was thinking turning into a numpy array with type string might do the trick.
This is implemented in pandas 0.21.0+ in function to_dict with parameter into:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print (df)
a b
0 1 2
1 3 4
d = df.to_dict(into=OrderedDict, orient='index')
print (d)
OrderedDict([(0, OrderedDict([('a', 1), ('b', 2)])), (1, OrderedDict([('a', 3), ('b', 4)]))])
Unfortunately you can't just do an apply (since it fits it back to a DataFrame):
In [1]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
In [2]: df
Out[2]:
a b
0 1 2
1 3 4
In [3]: from collections import OrderedDict
In [4]: df.apply(OrderedDict)
Out[4]:
a b
0 1 2
1 3 4
But you can use a list comprehension with iterrows:
In [5]: [OrderedDict(row) for i, row in df.iterrows()]
Out[5]: [OrderedDict([('a', 1), ('b', 2)]), OrderedDict([('a', 3), ('b', 4)])]
If it was possible to use a generator, rather than a list, to whatever you were working with this will usually be more efficient:
In [6]: (OrderedDict(row) for i, row in df.iterrows())
Out[6]: <generator object <genexpr> at 0x10466da50>

Categories

Resources