Python Pandas Create Cooccurence Matrix from two rows - python

I have a Dataframe which looks like this (The columns are filled with ids for a movie and ids for an actor:
movie actor clusterid
0 0 1 2
1 0 2 2
2 1 1 2
3 1 3 2
4 2 2 1
and i want to create a binary co-occurence matrix from this dataframe which looks like this
actor1 actor2 actor3
clusterid 2 movie0 1 1 0
movie1 1 0 1
clusterid 1 movie2 0 1 0
where my dataframe has (i) a multiindex (clusterid, movieid) and a binary count for actors which acted in the movie according to my inital dataframe.
I tried:
df.groupby("movie").agg('count').unstack(fill_value=0)
but unfortunately this doesn't expand the dataframe and counts the totals. Can something like this be done using the internal pandas functions easily?
Thank you for any advice

You can create an extra auxiliary column to indicate if the value exists and then do pivot_table:
(df.assign(actor = "actor" + df.actor.astype(str), indicator = 1)
.pivot_table('indicator', ['clusterid', 'movie'], 'actor', fill_value = 0))
Or use set_index.unstack() pattern:
(df.assign(actor = "actor" + df.actor.astype(str), indicator = 1)
.set_index(['clusterid', 'movie', 'actor']).indicator.unstack('actor', fill_value=0))

Related

Pandas Dataframe - Add a new Column with value from another row

I am trying to add a new column called ordered_1day_ago to my df.
DataFrame currently looks like this:
itemID
orderedDate
qty
1
12/2/21
3
2
12/3/21
2
1
12/3/21
2
1
12/4/21
3
I want it to look like this:
itemID
orderedDate
qty
ordered_1day_ago
1
12/2/21
3
0
2
12/3/21
2
0
1
12/3/21
2
3
1
12/4/21
3
2
itemID and ordered date must be used to insert the qty on the next orderedDate if it falls within one day, if it does not, then ordered_1day_ago is 0.
How can we use pandas for this?
This is the complete solution:
import pandas as pd
# a dict to create th dataframe
d = {
'itemID':[1,2,1,1],
'orderedDate':['12/2/21', '12/3/21', '12/3/21', '12/4/21'],
'qty':[3,2,2,3]
}
# the old dataframe
df = pd.DataFrame(d)
print(df)
# some function to do what you want to based on rows
def some_function(row):
# code goes here
z = row['itemID'] + row['qty']
return z
# add the new column given the function above
df['ordered_1day_ago'] = df.apply(some_function, axis=1)
# the new datafdrame with the extra column
print(df)
This is the original df:
itemID orderedDate qty
0 1 12/2/21 3
1 2 12/3/21 2
2 1 12/3/21 2
3 1 12/4/21 3
This is the new df with the added (example) column:
itemID orderedDate qty ordered_1day_ago
0 1 12/2/21 3 4
1 2 12/3/21 2 4
2 1 12/3/21 2 3
3 1 12/4/21 3 4
You can amend the function to contain whatever criteria you wish such that the new column ordered_1day_ago contains the results that you wish.

dataframe to presence/absence dataframe with 2 columns as comma separated strings

I have a dataframe with 3 columns, the first one (annotations) is what I want to measure presence absence on and the latter two columns (categories, CUI_desc) contain comma separated strings of factors that I would like to become columns for a presence/absence dataframe
Currently, the data looks like this:
annotations categories CUI_desc
heroine ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
heroin ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
he smoked two packs a day ['opioid', 'substance_abuse'] ['C0439234___year', 'C0748223___QUIT', 'C0028040___nicotine']
And I would like it to look like this:
annotations heroic opioid substance_abuse COO1892___heroin CO439234___year CO748223___QUIT COO22840___nicotine
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1
I used this line of code from a similar question:
from collections import Counter
test = pd.DataFrame({k:Counter(v) for k, v in master.items()}).T.fillna(0).astype(int)
But got an undesired output:
heroine heroin he smoked two packs a day
annotations 1 1 1
categories 0 0 0
CUI_desc 0 0 0
It seems to be counting how many times a certain annotations shows up in my dataframe. This is likely because the above block of code is for a dictionary and not a dataframe.
Edit: OP clarified that each cell is a string so we need to convert it into a list first before calling explode.
Assuming the index is unique:
from ast import literal_eval
categories = pd.get_dummies(master['categories'].apply(literal_eval).explode()).groupby(level=0).sum()
cui_desc = pd.get_dummies(master['CUI_desc'].apply(literal_eval).explode()).groupby(level=0).sum()
pd.concat([master['annotations'], categories, cui_desc], axis=1)
Output:
annotations heroic opioid substance_abuse C0011892___heroin C0028040___nicotine C0439234___year C0748223___QUIT
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1
Here is another approach using Series.value_counts
import ast
def row_value_counts(row):
return row.apply(ast.literal_eval).explode().value_counts()
test = (
df.set_index("annotations")
.apply(row_value_counts, axis=1)
.fillna(0)
.astype(int)
.reset_index()
)

Getting maximum values in a column

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

How to create pandas dummies based on column values

I would like to create dummies based on column values...
This is what the df looks like
I want to create this
This is so far my approach
import pandas as pd
df =pd.read_csv('test.csv')
v =df.Values
v_set=set()
for line in v:
line=line.split(',')
for x in line:
if x!="":
v_set.add(x)
else:
continue
for val in v_set:
df[val]=''
By the above code I am able to create columns in my df like this
How do I go about updating the row values to create dummies?
This is where I am having problems.
Thanks in advance.
You could use pandas.Series.str.get_dummies. This will alllow you to split the column directly with a delimiter.
df = pd.concat([df.ID, df.Values.str.get_dummies(sep=",")], axis=1)
ID 1 2 3 4
0 1 1 1 0 0
1 2 0 0 1 1
df.Values.str.get_dummies(sep=",") will generate
1 2 3 4
0 1 1 0 0
1 0 0 1 1
Then, we do a pd.concat to glue to df together.

Pandas - group by id and drop duplicate with threshold

I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })
You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3
filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3
Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]
I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.

Categories

Resources