Frequent pattern mining in Python

Frequent pattern mining in Python - python

I want to know how to get the absolute support and relative support of itemsets in python. Presently I have the following:
import pandas as pd
import pyfpgrowth
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from collections import Counter
dataset = [['a', 'b', 'c', 'd'],
['b', 'c', 'e', 'f'],
['a', 'd', 'e', 'f'],
['a', 'e', 'f'],
['b', 'd', 'f']
]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print (df)
#print support
print(apriori(df, min_support = 0.0))
#print frequent itemset
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x:
len(x))
frequent_itemsets
print ("frequent itemset at min support = 0.6")
print(frequent_itemsets)
but I do not know how to return the absolute support and relative support.

The relative support is part of your frequen_itemsets DataFrame. You can get it from:
frequent_itemsets['support']
And you can calculate the absolute support multiplying support by the number of baskets:
frequent_itemsets['support']*len(dataset)

Related

How to speed up nested for loop with dataframe?

I have a dataframe like this:
test = pd.DataFrame({'id':['a','C','D','b','b','D','c','c','c'], 'text':['a','x','a','b','b','b','c','c','c']})
Using the following for-loop I can add x to a new_col. This for-loop works fine for the small dataframe. However, for dataframes that have thousands of rows, it will take many hours to process. Any suggestions to speed it up?
for index, row in test.iterrows():
if row['id'] == 'C':
if test['id'][index+1] =='D':
test['new_col'][index+1] = test['text'][index]

Try using shift() and conditions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'C', 'D', 'b', 'b', 'D', 'c', 'c', 'c'],
'text': ['a', 'x', 'a', 'b', 'b', 'b', 'c', 'c', 'c']})
df['temp_col'] = df['id'].shift()
df['new_col'] = np.where((df['id'] == 'D') & (df['temp_col'] == 'C'), df['text'].shift(), "")
del df['temp_col']
print(df)
We can also do it without a temporary column. (Thanks& credits to Prayson 🙂)
df['new_col'] = np.where((df['id'].eq('D')) & (df['id'].shift().eq('C')), df['text'].shift(), "")

How to get duplicate values across any, potentially not all, lists

I have a dataset along the lines of:
data.append(['a', 'b', 'c'], ['a', 'x', 'y', z'], ['a', 'x', 'e', 'f'], ['a'])
I've searched SO and found ways to return duplicates across all lists using intersection_update() (so, in this example, 'a'), but I actually want to return duplicates from any lists, i.e.,:
retVal = ['a', 'x']
Since 'a' and 'x' are duplicated at least once among all lists. Is there a built-in for Python 2.7 that can do this?

Use a Counter to determine the number of each item and chain.from_iterable to pass the items from the sublists to the Counter.
from itertools import chain
from collections import Counter
data=[['a', 'b', 'c'], ['a', 'x', 'y', 'z'], ['a', 'x', 'e', 'f'], ['a']]
c = Counter(chain.from_iterable(data))
retVal = [k for k, count in c.items() if count >= 2]
print(retVal)
#['x', 'a']

Generating a list of random lists

I'm new to Python, so I might be doing basic errors, so apologies first.
Here is the kind of result I'm trying to obtain :
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
So basically, a list of lists of randomly permutated letters without repeat.
Here is the look of what I can get so far :
foo = ['BDCEA', 'BDCEA', 'BDCEA', 'BDCEA']
The main problem being that everytime is the same permutation. This is my code so far :
import random
import numpy as np
letters = ["A", "B", "C", "D", "E"]
nblines = 4
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
Help appreciated. Thanks

The problem with your code is that the line
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
will first create a random permutation, and then repeat that same permutation nblines times. Numpy.repeat does not repeatedly invoke a function, it repeats elements of an already existing array, which you created with random.sample.
Another thing is that numpy is designed to work with numbers, not strings. Here is a short code snippet (without using numpy) to obtain your desired result:
[random.sample(letters,len(letters)) for i in range(nblines)]
Result: similar to this:
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
I hope this helped ;)
PS: I see that others gave similar answers to this while I was writing it.

np.repeat repeats the same array. Your approach would work if you changed it to:
[''.join(random.sample(letters, len(letters))) for _ in range(nblines)]
Out: ['EBCAD', 'BCEAD', 'EBDCA', 'DBACE']
This is a short way of writing this:
foo = []
for _ in range(nblines):
foo.append(''.join(random.sample(letters, len(letters))))
foo
Out: ['DBACE', 'CBAED', 'ACDEB', 'ADBCE']

Here's a plain Python solution using a "traditional" style for loop.
from random import shuffle
nblines = 4
letters = list("ABCDE")
foo = []
for _ in range(nblines):
shuffle(letters)
foo.append(letters[:])
print(foo)
typical output
[['E', 'C', 'D', 'A', 'B'], ['A', 'B', 'D', 'C', 'E'], ['A', 'C', 'B', 'E', 'D'], ['C', 'A', 'E', 'B', 'D']]
The random.shuffle function shuffles the list in-place. We append a copy of the list to foo using letters[:], otherwise foo would just end up containing 4 references to the one list object.
Here's a slightly more advanced version, using a generator function to handle the shuffling. Each time we call next(sh) it shuffles the lst list stored in the generator and returns a copy of it. So we can call next(sh) in a list comprehension to build the list, which is a little neater than using a traditional for loop. Also, list comprehesions can be slightly faster than using .append in a traditional for loop.
from random import shuffle
def shuffler(seq):
lst = list(seq)
while True:
shuffle(lst)
yield lst[:]
sh = shuffler('ABCDE')
foo = [next(sh) for _ in range(10)]
for row in foo:
print(row)
typical output
['C', 'B', 'A', 'E', 'D']
['C', 'A', 'E', 'B', 'D']
['D', 'B', 'C', 'A', 'E']
['E', 'D', 'A', 'B', 'C']
['B', 'A', 'E', 'C', 'D']
['B', 'D', 'C', 'E', 'A']
['A', 'B', 'C', 'E', 'D']
['D', 'C', 'A', 'B', 'E']
['D', 'C', 'B', 'E', 'A']
['E', 'D', 'A', 'C', 'B']

Python permutations of heterogenous list elements

This is the sequence:
l = [['A', 'G'], 'A', ['A', 'C']]
I need the three element sequence back for each permutation
all = ['AAA','GAA','AAC','GAC']
I can't figure this one out! I'm having trouble retaining the permutation order!

You want the product:
from itertools import product
l = [['A', 'G'], 'A', ['A', 'C']]
print(["".join(p) for p in product(*l)])

Stacked bar chart with differently ordered colors using matplotlib

I am a begginer of python. I am trying to make a horizontal barchart with differently ordered colors.
I have a data set like the one in the below:
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
The first list contains numerical data, and the second one contains the order of each data item. I need the second list here, because the order of A, B, C, and D is crucial for the dataset when presenting them in my case.
Using data like the above, I want to make a stacked bar chart like the picture in the below. It was made with MS Excel by me manually. What I hope to do now is to make this type of bar chart using Matplotlib with the dataset like the above one in a more automatic way. I also want to add a legend to the chart if possible.
Actually, I have totally got lost in trying this by myself. Any help will be very, very helpful.
Thank you very much for your attention!

It's a long program, but it works, I added one dummy data to distinguish rows count and columns count:
import numpy as np
from matplotlib import pyplot as plt
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16},
{'A':35, 'B':45, 'C':66, 'D':50}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D'],
['A', 'B', 'C', 'D']]
colors = ["r","g","b","y"]
names = sorted(dataset[0].keys())
values = np.array([[data[name] for name in order] for data,order in zip(dataset, data_orders)])
lefts = np.insert(np.cumsum(values, axis=1),0,0, axis=1)[:, :-1]
orders = np.array(data_orders)
bottoms = np.arange(len(data_orders))
for name, color in zip(names, colors):
idx = np.where(orders == name)
value = values[idx]
left = lefts[idx]
plt.bar(left=left, height=0.8, width=value, bottom=bottoms,
color=color, orientation="horizontal", label=name)
plt.yticks(bottoms+0.4, ["data %d" % (t+1) for t in bottoms])
plt.legend(loc="best", bbox_to_anchor=(1.0, 1.00))
plt.subplots_adjust(right=0.85)
plt.show()
the result figure is:

>>> dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
>>> data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
>>> for i,x in enumerate(data_orders):
for y in x:
#do something here with dataset[i][y] in matplotlib

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Frequent pattern mining in Python - python

The relative support is part of your frequen_itemsets DataFrame. You can get it from: frequent_itemsets['support'] And you can calculate the absolute support multiplying support by the number of baskets: frequent_itemsets['support']*len(dataset)

Related

How to speed up nested for loop with dataframe?

How to get duplicate values across any, potentially not all, lists

Generating a list of random lists

Python permutations of heterogenous list elements

Stacked bar chart with differently ordered colors using matplotlib

Categories

Resources