I am working with sequential and frequent pattern mining. I was given this type of dataset to do the task, and I am told to make a sequence from the dataset before processing.
This is the sample data taken from dataset, in table format. The table in .csv format is available at: https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing
User
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
A
milk
cake
citrus
B
cheese
milk
bread
cabbage
carrot
A
tea
juice
citrus
salmon
B
apple
orange
B
cake
At first, I think I have to make the csv file into Pandas Dataframe. I have no problem with that, what I want to ask is, how is it possible with dataframe to produce result like this?
Expected result 1, a group of items bought from 1 user is grouped into one tuple
User
Transactions
A
(milk cake citrus)(tea juice citrus salmon)
B
(cheese milk bread cabbage carrot)(apple orange)(cake)
Expected result 2, each item purchased by user is not grouped by one.
User
Transactions
A
milk, cake, citrus, tea, juice, citrus, salmon,
B
cheese, milk, bread, cabbage, carrot, apple, orange, cake
My question is, how to make those dataframe? I've tried a solution from this article: How to group dataframe rows into list in pandas groupby, but it is still not successful.
In order to get the first result:
out = df.set_index('User').apply(lambda x : tuple(x[x.notna()].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
For the second result which is easier than the previous one:
df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]:
User
A milk,cake,citrus,tea,juice,citrus,salmon
B cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object
Let's start with the second one:
(df.set_index('User')
.stack()
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [milk, cake, citrus, tea, juice, citrus, salmon]
1 B [cheese, milk, bread, cabbage, carrot, apple, ...
To get the first one, on just need to add a new column:
(df.assign(group=df.groupby('User').cumcount())
.set_index(['User', 'group'])
.stack()
.groupby(level=[0,1]).apply(tuple)
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
import pandas as pd
df = pd.read_csv('sampletable.csv')
df['Transactions'] = '(' + df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1) + ')'
df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()
print(df)
output:
User Transactions
0 A (milk cake citrus)(tea juice citrus salmon)
1 B (cheese milk bread cabbage carrot)(apple orange)(cake)
for the second output, use this:
df = pd.read_csv('sampletable.csv')
df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)
df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()
print(df)
Related
I have a dataframe with information, where the rows are not related to eachother:
Fruits Vegetables Protein
1 Apple Spinach Beef
2 Banana Cucumber Chicken
3 Pear Carrot Pork
I essentially just want to create a pandas series with all of that information, I want it to look like this:
All Foods
1 Apple
2 Banana
3 Pear
4 Spinach
5 Cucumber
6 Carrot
7 Beef
8 Chicken
9 Pork
How can I do this in pandas?
Dump into numpy and create a new dataframe:
out = df.to_numpy().ravel(order='F')
pd.DataFrame({'All Foods' : out})
All Foods
0 Apple
1 Banana
2 Pear
3 Spinach
4 Cucumber
5 Carrot
6 Beef
7 Chicken
8 Pork
Just pd.concat them together (and reset the index).
all_foods = pd.concat([foods[col] for col in foods.columns])
You can unstack the dataframe to get the values and then create a df/series:
df = pd.DataFrame({'Fruits':['Apple','Banana', 'Pear'], 'Vegetables':['Spinach', 'Carrot', 'Cucumber'], 'Protein':['Beef', 'Chicken', 'Pork']})
pd.DataFrame({'All Foods' : df.unstack().values})
This should help:
import pandas as pd
# Loading file with fruits, vegetables and protein
dataset = pd.read_csv('/fruit.csv')
# This is where you should apply your code
# Unpivoting (creating one column out of 3 columns)
df_unpivot = pd.melt(dataset, value_vars=['Fruits', 'Vegetables', 'Protein'])
# Renaming column from value to All Foods
df_finalized = df_unpivot.rename(columns={'value': 'All Foods'})
# Printing out "All Foods" column
print(df_finalized["All Foods"])
I have a table like this
Name Type Food Variant and Price
A Cake {‘Choco’:100, ‘Cheese’:100, ‘Mix’: 125}
B Drinks {‘Cola’:25, ‘Milk’:35}
C Side dish {‘French Fries’:20}
D Bread {None:10}
I want to use the keys and values of dictionaries in the Variant and Price column as 2 different columns but I am still confused, here is the output that I want:
Name Type Food Variant Price
A Cake Choco 100
A Cake Cheese 100
A Cake Mix 125
B Drinks Cola 25
B Drinks Milk 35
C Side dish French Fries 20
D Bread NaN 10
Can anyone help me to figure it out?
Create list of tuples and then use DataFrame.explode, last create 2 columns:
df['Variant and Price'] = df['Variant and Price'].apply(lambda x: list(x.items()))
df = df.explode('Variant and Price').reset_index(drop=True)
df[['Variant','Price']] = df.pop('Variant and Price').to_numpy().tolist()
print (df)
Name Type Food Variant Price
0 A Cake Choco 100
1 A Cake Cheese 100
2 A Cake Mix 125
3 B Drinks Cola 25
4 B Drinks Milk 35
5 C Side dish French Fries 20
6 D Bread None 10
Or create 2 columns and then use DataFrame.explode:
df['Variant'] = df['Variant and Price'].apply(lambda x: list(x.keys()))
df['Price'] = df.pop('Variant and Price').apply(lambda x: list(x.values()))
df = df.explode(['Variant', 'Price']).reset_index(drop=True)
I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple
My apologies if this has been asked/answered before but I couldn't find this an answer to my problem after some time searching.
Very simply put I would like to combine multiple columns to one seperated with a ,
The problem is that some cells are empty (NoneType)
And when combining them I get either:
TypeError: ('sequence item 3: expected str instance, NoneType found', 'occurred at index 0')
or
When added .map(str), it literally adds 'None' for every NoneType value (as kinda expected)
Let's say I have a production dataframe looking like
0 1 2
1 Rice
2 Beans Rice
3 Milk Beans Rice
4 Sugar Rice
What I would like is a single column with the values
Production
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
With some searching and tweaking I added this code:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x)), axis=1)
Which produces problem 1
or changed it like this:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x.map(str)), axis=1)
Which produces problem 2
Maybe it's good to add that I'm very new and kinda exploring Pandas/Python right now. So any help or push in the right direction is much appreciated!
pd.Series.str.cat should work here
df
Out[43]:
0 1 2
1 Rice NaN NaN
2 Beans Rice NaN
3 Milk Beans Rice
4 Sugar Rice NaN
df.apply(lambda x: x.str.cat(sep=', '), axis=1)
Out[44]:
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
dtype: object
You can use str.join after transforming NaN values to empty strings:
res = df.fillna('').apply(lambda x: ', '.join(filter(None, x)), axis=1)
print(res)
0 Rice
1 Beans, Rice
2 Milk, Beans, Rice
3 Sugar, Rice
dtype: object
This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel