Pandas group by either column [duplicate] - python

This question already has an answer here:
Group a pandas dataframe by one column OR another one
(1 answer)
Closed 7 months ago.
I want to find the groups (rather than the grouping variable) in a pandas groupby. Here is an example:
Name Col1 Col2 Col3
John 1 A C
Sam 1 B C
Mike 1 B D
Kate 2 E G
Fred 3 E H
Liz 3 F H
Jane 4 X Y
Henry 4 Z T
If I group then using Col1 and (Col2 or Col3), the corresponding groups will be
output = [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
because a group consists of people having the same Col1 values, as well as either the same Col2 or the same Col3 value.
I was able to get what I want by creating a graph and finding connected components. Grouping by Col1 first, then finding connected components is another idea. However, I believe there must be a simpler way.
I would also like to do this in a more general case, such as grouping by Col1 and Col2 and (Col3 or Col4) and (Col5 or Col6).

I've had a look around, and this question is effectively a duplicate of this post:
Group a pandas dataframe by one column OR another one. So, I cannot - not remotely - take credit for the following solution, but let me just show how you can adjust the impressive answer provided there by #AmiTavory to suit your specific needs:
import pandas as pd
import networkx as nx
import itertools
G = nx.Graph()
G.add_nodes_from(df.Name)
G.add_edges_from(
[(r1[1]['Name'], r2[1]['Name'])
for (r1, r2) in itertools.product(df.iterrows(), df.iterrows())
if r1[1].Name < r2[1].Name and
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))]
)
df['group'] = df['Name'].map(
dict(itertools.chain.from_iterable([[(ee, i) for ee in e]
for (i, e) in enumerate(nx.connected_components(G))])))
# finally, we only need to add this to get the list with nested lists
# containing the names.
output = df.groupby('group')['Name'].apply(list).values.tolist()
output
# [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
In order to achieve other combinations of and/or, you will just have to rewrite this bit:
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))

From what i understood of your question you want to formulate, a list of what it seens to be your index of each individual of your grouped data, for that you will need groups.
So first lets grab your groups names and indexes.
df.groupby('col1').groups
That is we gonna return a dict whose keys are the names of each group in the columns used and indexes well they are the index of dataframe which is what you want.
After utilizing this groups keys, try doing the following comprehension
[v.to_list() for v in df.groupby('col1').groups.values()]
That will return you wanted output independtly from what column you groupby

Related

Python transform data long to wide

I'm looking to transform some data in Python.
Originally, in column 1 there are various identifiers (A to E in this example) associated with towns in column 2. There is a separate row for each identifier and town association. There can be any number of identifier to town associations.
I'd like to end up with ONE row per identifier and with all the associated towns going horizontally separated by commas.
Tried using long to wide but having difficulty in doing the above, appreciate any suggestions.
Thank you
One way to do it is using gruopby. For example, you can group by Column 1 and apply a function that returns the list of unique values for each group (i.e. each code).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'col1': 'A A A A B B C C C D E E E E E'.split(' '),
'col2': ['Accrington', 'Acle', 'Suffolk', 'Hampshire', 'Lincolnshire',
'Derbyshire', 'Aldershot', 'Alford', 'Cumbria', 'Hampshire', 'Bath',
'Alston', 'Greater Manchester', 'Northumberland', 'Cumbria'],
})
def get_towns(town_list):
return ', '.join(np.unique(town_list))
df.groupby('col1')['col2'].apply(get_towns)
And the result is:
col1
A Accrington, Acle, Hampshire, Suffolk
B Derbyshire, Lincolnshire
C Aldershot, Alford, Cumbria
D Hampshire
E Alston, Bath, Cumbria, Greater Manchester, Nor...
Name: col2, dtype: object
Note: the last line contains also Cumbria, differently from you expected results as this value appears also with the code E. I guess that was a typo in your question...
Another option is to use .groupby with aggregate because conceptually, this is not a pivoting operation but, well, an aggregation (concatenation) of values. This solution is quite similar to Luca Clissa's answer, but it uses the pandas api instead of numpy.
>>> df.groupby("col1").col2.agg(list)
col1
A [Accrington, Acle, Suffolk, Hampshire]
B [Lincolnshire, Derbyshire]
C [Aldershot, Alford, Cumbria]
D [Hampshire]
E [Bath, Alston, Greater Manchester, Northumberl...
Name: col2, dtype: object
That gives you cells of lists; if you need strings, add a .str.join(", "):
>>> df.groupby("col1").col2.agg(list).str.join(", ")
col1
A Accrington, Acle, Suffolk, Hampshire
B Lincolnshire, Derbyshire
C Aldershot, Alford, Cumbria
D Hampshire
E Bath, Alston, Greater Manchester, Northumberla...
Name: col2, dtype: object
If you want col1 as a normal column instead of an index, add a .reset_index() at the end.

python, pandas, How to find connections between each group

I am having troubles to find connections between groups based on the associated data (groupby maybe?) in order to create a network.
For each group, if they have the same element, they are connected.
For example, my data frame is looks like this:
group_number data
1 a
2 a
2 b
2 c
2 a
3 c
4 a
4 c
So the out put would be
Source_group Target_group Frequency
2 1 1 (because a-a)
3 2 1 (because c-c)
4 2 2 (because a-a, c-c)
Of course (because...) will not be in the output, just explanation
Thank you very much
I thought about your problem. You could do something like the following:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'group_number': [1,2,2,2,2,3,4,4],
'data': ['a','a','b','c','a','c','a','c']})
# group the data using multiindex and convert it to dictionary
d = defaultdict(dict)
for multiindex, group in df.groupby(['group_number', 'data']):
d[multiindex[0]][multiindex[1]] = group.data.size
# iterate groups twice to compare every group
# with every other group
relationships = []
for key, val in d.items():
for k, v in d.items():
if key != k:
# get the references to two compared groups
current_row_rel = {}
current_row_rel['Source_group'] = key
current_row_rel['Target_group'] = k
# this is important, but at this point
# you are basically comparing intersection of two
# simple python lists
current_row_rel['Frequency'] = len(set(val).intersection(v))
relationships.append(current_row_rel)
# convert the result to pandas DataFrame for further analysis.
df = pd.DataFrame(relationships)
I’m sure that this could be done without the need to convert to a list of dictionaries. I find this solution however to be more straightforward.

Is there a way to allocates sorted values in a dataframe to groups based on alternating elements

I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]

How to filter and drop rows based on groups when condition is specific

So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]

How to get the mean of a subset of rows after using groupby?

I want to get the average of a particular subset of rows in one particular column in my dataframe.
I can use
df['C'].iloc[2:9].mean()
to get the mean of just the particular rows I want from my original Dataframe but my problem is that I want to perform this operation after using the groupby operation.
I am building on
df.groupby(["A", "B"])['C'].mean()
whereby there are 11 values returned in 'C' once I group by columns A and B and I get the average of those 11 values. I actually only want to get the average of the 3rd through 9th values though so ideally what I would want to do is
df.groupby(["A", "B"])['C'].iloc[2:9].mean()
This would return those 11 values from column C for every group of A,B and then would find the mean of the 3rd through 9th values but I know I can't do this. The error suggests using the apply method but I can't seem to figure it out.
Any help would be appreciated.
You can use agg function after the groupby and then subset within each group and take the mean:
df = pd.DataFrame({'A': ['a']*22, 'B': ['b1']*11 + ['b2']*11, 'C': list(range(11))*2})
# A dummy data frame to demonstrate
df.groupby(['A', 'B'])['C'].agg(lambda g: g.iloc[2:9].mean())
# A B
# a b1 5
# b2 5
# Name: C, dtype: int64
Try this variant:
for key, grp in df.groupby(["A", "B"]):
print grp['C'].iloc[2:9].mean()

Categories

Resources