Grouping the dataset by cluster_id attribute - python

I'd like to group by my dataframe based on cluster ids and print all instances of occurrences. My dataframe is somewhat like this
Chemical Name,cluster_id
XA323, 0
ZC4-D, 2
XA324, 0
YB1050, 1
ZC5-D, 2
YB1052, 1
I'd like it grouped by the cluster_id like
cluster_id
0 XA323
XA324
1 YB1050
YB1052
2 ZC4-D
ZC5-D
NOTE: This is a dummy dataset and my original dataset has around 3000 instances. Where the cluster_id distribution is like 0: 2700+, 1: 200+ and 2: remaining
Thank you.

Following the comments you can aggregate it by the cluster_id, then use list as an aggregation function and turn it into a dict -
df.groupby("cluster_id").agg(list)["Chemical Name"].to_dict()

Related

How to attach a column containing the number of occurrences of values in other columns to an existing Dataframe?

I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.
I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql,
query). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1

random sample per group, with min_rows

I have a dataframe and I want to sample it. However while sampling it randomly I want to have at least 1 sample from every element in the column. I also want the distribution have an effect as well.(ex: values with more samples on the original have more on the sampled df)
Similar to this and this question, but with minimum sample size per group.
Lets say this is my df:
df = pd.DataFrame(columns=['class'])
df['class'] = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,2]
df_sample = df.sample(n=4)
And when I sample this I want the df_sample to look like:
Class
0
0
1
2
Thank you.
As suggested by #YukiShioriii you could :
1 - sample one row of each group of values
2 - randomly sample over the remaining rows regardless of the values
Following YukiShioriii's and mprouveur's suggestion
# random_state for reproducibility, remove in production code
sample = df.groupby('class').sample(1, random_state=1)
sample = sample.append(
df[~df.index.isin(sample.index)] # only rows that have not been selected
.sample(n=sample_size-sample.shape[0]) # sample more rows as needed
).sort_index()
Output
class
2 0
4 0
13 1
14 2

Sort values in dataframe, but randomize order of items with same value

I am writing a recommendation system that recommends products based on a score assigned to each product, for example in the following dataframe:
index product_name score
0 prod_1 2
1 prod_2 2
2 prod_3 1
3 prod_4 3
I can of course sort this dataframe by score, using sort_values('score', ascending = False), however, this will always result in the following dataframe:
index product_name score
3 prod_4 3
0 prod_1 2
1 prod_2 2
2 prod_3 1
However, I would like to randomly shuffle the order of prod_1 and prod_2, as they have the same score. It doesn't seem like sort_values has any way of achieving this.
The only solution I can come up with is to fetch all possible scores from the dataframe, then make a new dataframe for each score, shuffle those, and then stitch them back together, but it seems like there should be a better way.
What about a new column with completely random numbers (use e.g. numpy.random.randint) and then sort it by both?
sort_values(by=["score","rand_col"], ascending=[False,False])

Count each observation a row

I have a pandas df named df, with millions of observations (rows) and only 4 columns.
I'm trying to convert the event_type column into several columns, and add a count to each row for that column.
My df looks like this:
event_type event_time organization_id user_id
0 Applied Saved View 2018-11-22 10:59:57.360 3 0
And I'm looking for this:
Applied_Saved_View event_time organization_id user_id
0 1 2018-11-22 10:59:57.360 3 0
I believe you are looking for something called pd.get_dummies. I assume you are trying to make this categorical data? I have no way of testing without sample data but see code below.
df2 = pd.get_dummies(df['event_type'])
new_df = pd.concat([df2,df],axis=1)
I should mention, you should see how many unique values there are in this event type column because those each will become rows whether its 10 or 100000 unique values

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Categories

Resources