Python - group by multiple columns - python

I have a list of lists - representing a table with 4 columns and many rows (10000+).
Each sub-list contains 4 variables.
Here is a small part of my table:
['1810569', 'a', 5, '1241.52']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
['1810569', 'a', 5, '1993.52']
The first column represents house-hold ID, and the second represents member id in the household.
The fourth column represents weights that I want to sum - distinctly for each member.
For the example above I want the output to be:
['1810569', 'a', 5, '3235.04']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
In another words - to sum the weights in lines 1 and 5 since they are weights of the same user - while all the other users are distinct.
I saw something about group by in pandas - but I didn't understand how exactly to use it for my problem.

Assuming the following is your list then the following would work:
In [192]:
l=[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
l
Out[192]:
[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
In [201]:
# construct the df and convert the last column to float
df = pd.DataFrame(l, columns=['household ID', 'Member ID', 'some col', 'weights'])
df['weights'] = df['weights'].astype(float)
df
Out[201]:
household ID Member ID some col weights
0 1810569 a 5 1241.52
1 1437437 a 5 1123.90
2 1437437 b 5 1232.43
3 1810569 b 5 1321.31
4 1810569 a 5 1993.52
So we can now groupby on the household and member id and call sum on the 'weights' column:
In [200]:
df.groupby(['household ID', 'Member ID'])['weights'].sum().reset_index()
Out[200]:
household ID Member ID weights
0 1437437 a 1123.90
1 1437437 b 1232.43
2 1810569 a 3235.04
3 1810569 b 1321.31

You could do it with a dict, using the first three elements as keys to group the data by:
d = {}
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'b', 5, 1321.31],
['1437437', 'b', 5, 1232.43],
['1437437', 'a', 5, 1123.9],
['1810569', 'a', 5, 3235.04]]
If you wanted to maintain a first seen order:
from collections import OrderedDict
d = OrderedDict()
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'a', 5, 3235.04],
['1437437', 'a', 5, 1123.9],
['1437437', 'b', 5, 1232.43],
['1810569', 'b', 5, 1321.31]]

Related

Creating dictionary from dataframe with columns as values python

I am trying to create a dictionary from a dataframe.
from pandas import util
df= util.testing.makeDataFrame()
df.index.name = 'name'
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
KSYYLUGiJB -0.998997 -0.387378 -0.303704 0.833731
PmsVVmRbQX -1.510940 -1.062814 0.934954 0.970467
oHjAqjAv1P -1.366054 0.595680 -1.039310 -0.126625
a1cU5c4psT -0.486282 -0.369012 -0.284495 -1.263010
qnqmltdFGR -0.041243 -0.792538 0.234809 0.894919
df.to_dict()
{'A': {'qSfQX3rj48': 0.1840905950693832,
'KSYYLUGiJB': -0.9989969426889559,
'PmsVVmRbQX': -1.5109402125881068,
'oHjAqjAv1P': -1.3660539127241154,
'a1cU5c4psT': -0.48628192605203563,
'qnqmltdFGR': -0.04124312561281138,
The above dict method is using the column name as keys.
dict_keys(['A', 'B', 'C', 'D'])
How can I can set it to a dict where the columns A B C D are the values for the name column. Thus it will have just 1 key.
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
Should produce a dictionary with a list of values.
{'qSfQX3rj48': [0.184091, -1.195861, 0.998988, -0.970523],
'KSYYLUGiJB': [-0.998997, -0.387378 , -0.303704, 0.833731],
And values are column, thus:
{'name': [A, B, C, D],
d = df.T.to_dict('list')
d[df.index.name] = df.columns.tolist()
 
Example
df = pd.DataFrame(np.arange(12).reshape(3, 4),
columns=['A', 'B', 'C', 'D'],
index=['one', 'two', 'three'])
df.index.name = 'name'
df:
A B C D
name
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
d:
{'one': [0, 1, 2, 3],
'two': [4, 5, 6, 7],
'three': [8, 9, 10, 11],
'name': ['A', 'B', 'C', 'D']}
Code:
dict([(nm,[a,b,c,d ]) for nm, a,b,c,d in zip(df.index, df.A, df.B, df.C, df.D)])

Count and sort pandas dataframe

I have a dataframe with column 'code' which I have sorted based on frequency.
In order to see what each code means, there is also a column 'note'.
For each counting/grouping of the 'code' column, I display the first note that is attached to the first 'code'
df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
Now my question is, how do I display only those rows that have frequency of e.g. >= 30?
Add a query call before you sort. Also, if you only want those rows EQUALing < insert frequency here >, sort_values isn't needed (right?!).
df.groupby('code')['note'].agg(['count', 'first']).query('count == 30')
If the question is for all groups with AT LEAST < insert frequency here >, then
(
df.groupby('code')
.note.agg(['count', 'first'])
.query('count >= 30')
.sort_values('count', ascending=False)
)
Why do I use query? It's a lot easier to pipe and chain with it.
You can just filter your result accordingly:
grp = grp[grp['count'] >= 30]
Example with data
import pandas as pd
df = pd.DataFrame({'code': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'note': ['A', 'B', 'A', 'A', 'C', 'C', 'C', 'A', 'A',
'B', 'B', 'C', 'A', 'B'] })
res = df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
# count first
# code
# 2 5 C
# 3 5 B
# 1 4 A
res2 = res[res['count'] >= 5]
# count first
# code
# 2 5 C
# 3 5 B

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

Mapping values into a new dataframe column

I have a dataset (~7000 rows) that I have imported in Pandas for some "data wrangling" but I need some pointers in the right direction to take the next step. My data looks something like the below and it is a description of a structure with several sub levels. B, D and again B are sub levels to A. Cis a sub level to B. and so on...
Level, Name
0, A
1, B
2, C
1, D
2, E
3, F
3, G
1, B
2, C
But i want something like the below, with Name and Mother_name on the same row:
Level, Name, Mother_name
1, B, A
2, C, B
1, D, A
2, E, D
3, F, E
3, G, E
1, B, A
2, C, B
If I understand the format correctly, the parent of a name depends on the
nearest prior row whose level is one less than the current row's level.
Your DataFrame has a modest number of rows (~7000). So there is little harm (to
performance) in simply iterating through the rows. If the DataFrame were very
large, you often get better performance if you can use column-wise vectorized Pandas
operations instead of row-wise iteration. However, in this case it appears that
using column-wise vectorized Pandas operations is awkward and
overly-complicated. So I believe row-wise iteration is the best choice here.
Using df.iterrows to perform row-wise iteration, you can simply record the current parents for every level as you go, and fill in the "mother"s as appropriate:
import pandas as pd
df = pd.DataFrame({'level': [0, 1, 2, 1, 2, 3, 3, 1, 2],
'name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']})
parent = dict()
mother = []
for index, row in df.iterrows():
parent[row['level']] = row['name']
mother.append(parent.get(row['level']-1))
df['mother'] = mother
print(df)
yields
level name mother
0 0 A None
1 1 B A
2 2 C B
3 1 D A
4 2 E D
5 3 F E
6 3 G E
7 1 B A
8 2 C B
If you can specify the mapping of the two columns in something like a dictionary, then you can just use the map method of the original column.
import pandas
names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']
# name -> sublevel
sublevel_map = {
'A': 'A',
'B': 'A',
'C': 'B',
'D': 'A',
'E': 'D',
'F': 'E',
'G': 'E'
}
df = pandas.DataFrame({'Name': names})
df['Sublevel'] = df['Name'].map(sublevel_map)
Which gives you:
Name Sublevel
0 A A
1 B A
2 C B
3 D A
4 E D
5 F E
6 G E
7 B A
8 C B

Create a matrix from a list of key-value pairs

I have a list of numpy arrays that contains a list of name-value pairs which are both strings. Every name and value can be found multiple times in the list, and I would like to convert it to a binary matrix.
The columns represent the values while the rows represent a key/name, and when a field is set to 1 it represents that particular name value pair.
E.g
I have
A : aa
A : bb
A : cc
B : bb
C : aa
and i want to convert it to
aa bb cc
A 1 1 1
B 0 1 0
C 1 0 0
I have some code that does this but I was wondering if there is an easier/out of the box way of doing this with numpy or some other library.
This is my code so far:
resources = Set(result[:,1])
resourcesDict = {}
i = 0
for r in resources:
resourcesDict[r] = i
i = i + 1
clients = Set(result[:,0])
clientsDict = {}
i = 0
for c in clients:
clientsDict[c] = i
i = i + 1
arr = np.zeros((len(clientsDict),len(resourcesDict)), dtype = 'bool')
for line in result[:,0:2]:
arr[clientsDict[line[0]],resourcesDict[line[1]]] = True
and in result theres the following
array([["a","aa"],["a","bb"],..]
I feel that using Pandas.DataFrame.pivot is the best way
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
Or
you can load your pair list using
>>> df = pd.read_csv('ratings.csv')
Then
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
you probably have something like
m_dict = {'A': ['aa', 'bb', 'cc'], 'B': ['bb'], 'C': ['aa']}
i would go like this:
res = {}
for k, v in m_dict.items():
res[k] = defaultdict(int)
for col in v:
res[k][v] = 1
edit
given your format, it would probably be more in the line of :
m_array = [['A', 'aa'], ['A', 'bb'], ['A', 'cc'], ['B', 'bb'], ['C', 'aa']]
res = defaultdict(lambda: defaultdict(int))
for k, v in m_array:
res[k][v] = 1
which both give:
>>> res['A']['aa']
1
>>> res['B']['aa']
0
This is a job for np.unique. It is not clear what format your data is in, but you need to get two 1-D arrays, one with the keys, another with the values, e.g.:
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = np.zeros((len(rows), len(cols)), dtype=np.int)
out[row_idx, col_idx] += 1
>>> out
array([[1, 1, 1],
[0, 1, 0],
[1, 0, 0]])
>>> rows
array(['A', 'B', 'C'],
dtype='|S2')
>>> cols
array(['aa', 'bb', 'cc'],
dtype='|S2')
If you have no repeated key-value pairs, this code will work just fine. If there are repetitions, I would suggest abusing scipy's sparse module:
import scipy.sparse as sps
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa'], ['A', 'bb']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = sps.coo_matrix((np.ones_like(row_idx), (row_idx, col_idx))).A
>>> out
array([[1, 2, 1],
[0, 1, 0],
[1, 0, 0]])
d = {'A': ['aa', 'bb', 'cc'], 'C': ['aa'], 'B': ['bb']}
rows = 'ABC'
cols = ('aa', 'bb', 'cc')
print ' ', ' '.join(cols)
for row in rows:
print row, ' ',
for col in cols:
print ' 1' if col in d.get(row) else ' 0',
print
>>> aa bb cc
>>> A 1 1 1
>>> B 0 1 0
>>> C 1 0 0

Categories

Resources