python, pandas, How to find connections between each group - python

I am having troubles to find connections between groups based on the associated data (groupby maybe?) in order to create a network.
For each group, if they have the same element, they are connected.
For example, my data frame is looks like this:
group_number data
1 a
2 a
2 b
2 c
2 a
3 c
4 a
4 c
So the out put would be
Source_group Target_group Frequency
2 1 1 (because a-a)
3 2 1 (because c-c)
4 2 2 (because a-a, c-c)
Of course (because...) will not be in the output, just explanation
Thank you very much

I thought about your problem. You could do something like the following:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'group_number': [1,2,2,2,2,3,4,4],
'data': ['a','a','b','c','a','c','a','c']})
# group the data using multiindex and convert it to dictionary
d = defaultdict(dict)
for multiindex, group in df.groupby(['group_number', 'data']):
d[multiindex[0]][multiindex[1]] = group.data.size
# iterate groups twice to compare every group
# with every other group
relationships = []
for key, val in d.items():
for k, v in d.items():
if key != k:
# get the references to two compared groups
current_row_rel = {}
current_row_rel['Source_group'] = key
current_row_rel['Target_group'] = k
# this is important, but at this point
# you are basically comparing intersection of two
# simple python lists
current_row_rel['Frequency'] = len(set(val).intersection(v))
relationships.append(current_row_rel)
# convert the result to pandas DataFrame for further analysis.
df = pd.DataFrame(relationships)
I’m sure that this could be done without the need to convert to a list of dictionaries. I find this solution however to be more straightforward.

Related

Count the number of elements in a list where the list contains the empty string

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array

Pandas group by either column [duplicate]

This question already has an answer here:
Group a pandas dataframe by one column OR another one
(1 answer)
Closed 7 months ago.
I want to find the groups (rather than the grouping variable) in a pandas groupby. Here is an example:
Name Col1 Col2 Col3
John 1 A C
Sam 1 B C
Mike 1 B D
Kate 2 E G
Fred 3 E H
Liz 3 F H
Jane 4 X Y
Henry 4 Z T
If I group then using Col1 and (Col2 or Col3), the corresponding groups will be
output = [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
because a group consists of people having the same Col1 values, as well as either the same Col2 or the same Col3 value.
I was able to get what I want by creating a graph and finding connected components. Grouping by Col1 first, then finding connected components is another idea. However, I believe there must be a simpler way.
I would also like to do this in a more general case, such as grouping by Col1 and Col2 and (Col3 or Col4) and (Col5 or Col6).
I've had a look around, and this question is effectively a duplicate of this post:
Group a pandas dataframe by one column OR another one. So, I cannot - not remotely - take credit for the following solution, but let me just show how you can adjust the impressive answer provided there by #AmiTavory to suit your specific needs:
import pandas as pd
import networkx as nx
import itertools
G = nx.Graph()
G.add_nodes_from(df.Name)
G.add_edges_from(
[(r1[1]['Name'], r2[1]['Name'])
for (r1, r2) in itertools.product(df.iterrows(), df.iterrows())
if r1[1].Name < r2[1].Name and
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))]
)
df['group'] = df['Name'].map(
dict(itertools.chain.from_iterable([[(ee, i) for ee in e]
for (i, e) in enumerate(nx.connected_components(G))])))
# finally, we only need to add this to get the list with nested lists
# containing the names.
output = df.groupby('group')['Name'].apply(list).values.tolist()
output
# [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
In order to achieve other combinations of and/or, you will just have to rewrite this bit:
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))
From what i understood of your question you want to formulate, a list of what it seens to be your index of each individual of your grouped data, for that you will need groups.
So first lets grab your groups names and indexes.
df.groupby('col1').groups
That is we gonna return a dict whose keys are the names of each group in the columns used and indexes well they are the index of dataframe which is what you want.
After utilizing this groups keys, try doing the following comprehension
[v.to_list() for v in df.groupby('col1').groups.values()]
That will return you wanted output independtly from what column you groupby

How to filter and drop rows based on groups when condition is specific

So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]

Relation between dataframes

I have the following dataframes:
df:
points
0 a,b,c
1 a,e
2 d,f,c
and df1:
point relation
a [b,f,e]
b []
c [e]
and i want to keep only the points on the first dataframe that have a relation between them (using the second dataframe).
What i've tried :
I tried converting the second dataframe to a dictionary and search for the points in the first dataframe but i had no luck because it was searching the whole dataframe.
I tried a sort of combination of the 2 dataframes into one with only the common elements but again i had no luck.
Any suggestions out there? Thank you in advance
Edit
Desire output
points
0 a,b
1 a,e
2 []
code:
my_dict = df1.set_index('point')['relation'].to_dict() #create dict
my_dict = dict((k, v) for k, v in my_dict.items() if v) #delete empty []
for key, values in my_dict.items():
if df['point'].str.contains(key,values).any():
but then i saw with print that the code in not working

Pandas really slow join

I am trying to merge 2 dataframes, where I want to use the most recent date's row. Note that the date is not sorted, so it is not possible to use groupby.first() or groupby.last().
Left DataFrame (n=834,570) | Right DataFrame (n=1,592,005)
id_key | id_key date other_vars
1 | 1 2015-07-06 ...
2 | 1 2015-07-07 ...
3 | 1 2014-04-04 ...
Using the groupby/agg example, it takes 8 minutes! When I convert the dates to integers, then it takes 6 minutes.
gb = right.groupby('id_key')
gb.agg(lambda x: x.iloc[x.date.argmax()])
I used my own version where I make a dictionary for id, where I store the date and index of the currently highest date seen. You just iterate over the whole data once, ending up with a dictionary {id_key : [highest_date, index]}.
This way, it is really fast to just find the rows necessary.
It only takes 6 seconds to end up with the merged data; about a 85 times speedup.
I have to admit I'm very surprised as I thought pandas would be optimised for this. Does anyone have an idea what is going on, and whether the dictionary method should also be an option in pandas? It would also be simply to adapt this to other conditions of course, like sum, min etc.
My code:
# 1. Create dictionary
dc = {}
for ind, (ik, d) in enumerate(zip(right['id_key'], right['date'])):
if ik not in dc:
dc[ik] = (d, ind)
continue
if (d, ind) > dc[ik]:
dc[ik] = (d, ind)
# 2. Collecting indices at once (subsetting was slow), so to only subset once.
# It has the same amount of rows as left
inds = []
for x in left['id_key']:
# using this to append the last value that was given (missing strategy, very very few)
if x in dc:
row = dc[x][1]
inds.append(row)
# 3. Take the values
result = right.iloc[inds]

Categories

Resources