Related
I am working with sequential and frequent pattern mining. I was given this type of dataset to do the task, and I am told to make a sequence from the dataset before processing.
This is the sample data taken from dataset, in table format. The table in .csv format is available at: https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing
User
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
A
milk
cake
citrus
B
cheese
milk
bread
cabbage
carrot
A
tea
juice
citrus
salmon
B
apple
orange
B
cake
At first, I think I have to make the csv file into Pandas Dataframe. I have no problem with that, what I want to ask is, how is it possible with dataframe to produce result like this?
Expected result 1, a group of items bought from 1 user is grouped into one tuple
User
Transactions
A
(milk cake citrus)(tea juice citrus salmon)
B
(cheese milk bread cabbage carrot)(apple orange)(cake)
Expected result 2, each item purchased by user is not grouped by one.
User
Transactions
A
milk, cake, citrus, tea, juice, citrus, salmon,
B
cheese, milk, bread, cabbage, carrot, apple, orange, cake
My question is, how to make those dataframe? I've tried a solution from this article: How to group dataframe rows into list in pandas groupby, but it is still not successful.
In order to get the first result:
out = df.set_index('User').apply(lambda x : tuple(x[x.notna()].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
For the second result which is easier than the previous one:
df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]:
User
A milk,cake,citrus,tea,juice,citrus,salmon
B cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object
Let's start with the second one:
(df.set_index('User')
.stack()
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [milk, cake, citrus, tea, juice, citrus, salmon]
1 B [cheese, milk, bread, cabbage, carrot, apple, ...
To get the first one, on just need to add a new column:
(df.assign(group=df.groupby('User').cumcount())
.set_index(['User', 'group'])
.stack()
.groupby(level=[0,1]).apply(tuple)
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
import pandas as pd
df = pd.read_csv('sampletable.csv')
df['Transactions'] = '(' + df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1) + ')'
df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()
print(df)
output:
User Transactions
0 A (milk cake citrus)(tea juice citrus salmon)
1 B (cheese milk bread cabbage carrot)(apple orange)(cake)
for the second output, use this:
df = pd.read_csv('sampletable.csv')
df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)
df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()
print(df)
So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk
In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I've a dataframe which contains a list of tuples in one of its columns. I need to split the list tuples into corresponding columns. My dataframe df looks like as given below:-
A B
[('Apple',50),('Orange',30),('banana',10)] Winter
[('Orange',69),('WaterMelon',50)] Summer
The expected output should be:
Fruit rate B
Apple 50 winter
Orange 30 winter
banana 10 winter
Orange 69 summer
WaterMelon 50 summer
You can use DataFrame constructor with numpy.repeat and numpy.concatenate:
df1 = pd.DataFrame(np.concatenate(df.A), columns=['Fruit','rate']).reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
Another solution with chain.from_iterable:
from itertools import chain
df1 = pd.DataFrame(list(chain.from_iterable(df.A)), columns=['Fruit','rate'])
.reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
This should work:
fruits = []
rates = []
seasons = []
def create_lists(row):
tuples = row['A']
season = row['B']
for t in tuples:
fruits.append(t[0])
rates.append(t[1])
seasons.append(season)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"Fruit" :fruits, "Rate": rates, "B": seasons})[["Fruit", "Rate", "B"]]
output:
Fruit Rate B
0 Apple 50 winter
1 Orange 30 winter
2 banana 10 winter
3 Orange 69 summer
4 WaterMelon 50 summer
You can do this in a chained operation:
(
df.apply(lambda x: [[k,v,x.B] for k,v in x.A],axis=1)
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index(drop=True)
.rename(columns={0:'Fruit',1:'rate',2:'B'})
)
Out[1036]:
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
I have dataframe that looks like this:
My aim is to get at:
Explanation:
Every customer has made 3 orders
One can buy from as many Categories in each order
Desired state: Get all possible permutations of Categories a customer purchased by order sequence. The second picture would help understand this better
Category1 in desired state represents Categories purchased in first order, Category2 represents Categories purchased in second order and so on.
Code I'm using:
start_time = time.time()
df = pd.DataFrame()
for CustomerName in base_df.CustomerName.unique():
df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']]
df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index()
df2['CustomerName'] = CustomerName
df = df.append(df2)
print("--- %s seconds ---" %(time.time() - start_time))
This takes about 10 mins to run on my dataset - Looking for a faster method.
I am working on Pandas right now, but pointers for R or SQL are also welcome!Thank you!
Consider a merge of three OrderSequence dataframes, each joined to a distinct CustomerName:
import pandas as pd
df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3],
'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3],
'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys',
'Clothes','Toys','Food','Furniture','Toys','Food','Food']})
finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates())
for i in range(1,4):
seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\
rename(columns={'Category':'Category'+str(i)})
finaldf = pd.merge(finaldf, seqdf, on=['CustomerName'])
print(finaldf)
# CustomerName Category1 Category2 Category3
# 0 1 Food Food Clothes
# 1 1 Food Food Food
# 2 1 Food Food Toys
# 3 1 Food Clothes Clothes
# 4 1 Food Clothes Food
# 5 1 Food Clothes Toys
# 6 1 Food Furniture Clothes
# 7 1 Food Furniture Food
# 8 1 Food Furniture Toys
# 9 2 Clothes Toys Food
# 10 3 Furniture Food Food
# 11 3 Toys Food Food
Admittedly, the above setup was first thought out in SQL using self joins, then translated to pandas:
SELECT t1.CustomerName, t2.Category AS Category1,
t3.Category AS Category2, t4.Category AS Category3
FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1
INNER JOIN DataFrame AS t2
ON t1.CustomerName = t2.CustomerName
INNER JOIN DataFrame AS t3
ON t1.CustomerName = t3.CustomerName
INNER JOIN DataFrame AS t4
ON t1.CustomerName = t4.CustomerName
WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3);
okay. took some work but i did it. hope it helps.
import pandas as pd
import numpy as np
from itertools import combinations
df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category'])
df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3]
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3]
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food']
df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])
for CN in sorted(set(df['CustomerName'])):
df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])
list_OS_1 = []
list_OS_2 = []
list_OS_3 = []
MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']:
list_OS_1.append(CTG)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']:
list_OS_2.append(CTG)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']:
list_OS_3.append(CTG)
df_temp['Category1'] = list_OS_1
df_temp['Category2'] = list_OS_2
df_temp['Category3'] = list_OS_3
df_temp['CustomerName'] = CN
df2 = pd.concat([df2,df_temp],0)
print (df2)
output:
CustomerName Category1 Category2 Category3
0 1.0 Food Food Clothes
1 1.0 Food Clothes Food
2 1.0 Food Furniture Toys
3 1.0 Food Food Clothes
4 1.0 Food Clothes Food
5 1.0 Food Furniture Toys
6 1.0 Food Food Clothes
7 1.0 Food Clothes Food
8 1.0 Food Furniture Toys
0 2.0 Clothes Toys Food
0 3.0 Furniture Food Food
1 3.0 Toys Food Food
ps: its not dinamic, so if you add or remove categories it ll get fcked over.
but as long as it follows the initial standard you passed me, it shld work