Converting paired table to binary table - python

i wouldn't be posting this if i didn't do extensive research in attempt to find the answer. Alas, I have not been able to find any such answer. I have a paired dataset that looks something like this:
PERSON, ATTRIBUTE
person1, a
person1, b
person1, c
person1, d
person2, c
person2, d
person2, x
person3, a
person3, b
person3, e
person3, f
What I want to do is: 1) drop attributes that don't appear more than 10 times, 2) turn it into a binary table that would look something like this:
a b c
person1 1 1 1
person2 0 0 1
person3 1 1 0
So far, I have put together a script to drop the attributes that only appear 10 times; however, it is painfully slow as it has to go through each attribute, determine its frequency and find the corresponding x and y values to append to new variables.
import pandas as pd
import numpy as np
import csv
from collections import Counter
import time
df = pd.read_csv(
filepath_or_buffer='sample.csv',
sep=',')
x = df.ix[:, 1].values
y = df.ix[:, 0].values
x_vals = []
y_vals = []
counter = Counter(x)
start_time = time.time()
for each in counter:
if counter[each]>=10:
for i, j in enumerate(x):
if j==each:
print "Adding position:" + str(i)
x_vals.append(each)
y_vals.append(y[i])
print "Time took: %s" %(time.time()-start_time)
I would love some help in 1) finding a faster way to match attributes that appear more than 10 times and appending the values to new variables.
OR
2) An alternative method entirely to get the final binary table. I feel like converting a paired table to a binary table is probably a common occurrence in the data world, yet i couldnt find any code, module etc that could help with doing that.
Thanks a million!

I would probably add a dummy column and then call pivot_table:
>>> df = pd.DataFrame({"PERSON": ["p1", "p2", "p3"] * 10, "ATTRIBUTE": np.random.choice(["a","b","c","d","e","f","x"], 30)})
>>> df.head()
ATTRIBUTE PERSON
0 d p1
1 b p2
2 x p3
3 b p1
4 f p2
>>> df["count"] = 1
>>> p = df.pivot_table(index="PERSON", columns="ATTRIBUTE", values="count",
aggfunc=sum, fill_value=0)
>>> p
ATTRIBUTE a b c d e f x
PERSON
p1 1 3 1 1 1 0 3
p2 2 1 1 2 1 2 1
p3 0 4 1 1 2 0 2
And then we can select only the attributes with more than 10 occurrences (here 5, from my example):
>>> p.loc[:,p.sum() >= 5]
ATTRIBUTE b x
PERSON
p1 3 3
p2 1 1
p3 4 2

Related

Checking if a list is a subset of another in a pandas Dataframe

So, i have this Dataframe with almost 3 thousand rows, that looks something like this:
CITIES
0 ['A','B']
1 ['A','B','C','D']
2 ['A','B','C']
4 ['X']
5 ['X','Y','Z']
... ...
2670 ['Y','Z']
I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:
indexesToRemove=[]
for index, row in entrada.iterrows():
citiesListFixed=row['CITIES']
for index2, row2 in entrada.iloc[index+1:].iterrows():
citiesListCurrent=row2['CITIES']
if set(citiesListFixed) <= set(citiesListCurrent):
indexesToRemove.append(index)
break
Is there a more efficient way to do this?
First create the DataFrame of dummies and then we can use matrix multiplication to see if one of the rows is a complete subset of another row, by checking if the sum of multiplication with another row is equal to the number of elements in that row. (Going to be a memory intensive)
import pandas as pd
import numpy as np
df = pd.DataFrame({'Cities': [['A','B'], ['A','B','C','D'], ['A','B','C'],
['X'], ['X','Y','Z'], ['Y','Z']]})
arr = pd.get_dummies(df['Cities'].explode()).max(level=0).to_numpy()
#[[1 1 0 0 0 0 0]
# [1 1 1 1 0 0 0]
# [1 1 1 0 0 0 0]
# [0 0 0 0 1 0 0]
# [0 0 0 0 1 1 1]
# [0 0 0 0 0 1 1]]
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0) # So same row doesn't exclude itself
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)
df[mask]
# Cities
#1 [A, B, C, D]
#4 [X, Y, Z]
As it stands if you have two rows which tie for the longest subset, (i.e. two rows with ['A','B','C','D']) both are dropped. If this is not desired you can first drop_duplicates on 'Cities' (will need to covert to a hashable type like frozenset) and then apply the above.
A possible and didactic approach would be the following:
import pandas as pd
import numpy as np
import time
start = time.process_time()
lst1 = [0,1,2,3,4,2670]
lst2 = [['A','B'], ['A','B','C','D'], ['A','B','C'], ['X'], ['X','Y','Z'], ['Y','Z']]
df = pd.DataFrame(list(zip(lst1, lst2)), columns =['id', 'cities'])
df['contained'] = 0
n = df.shape[0]
for i in range(n):
for j in range(n):
if i != j:
if((set(df.loc[i,'cities']) & set(df.loc[j,'cities']))== set(df.loc[i,'cities'])):
df.loc[i,'contained'] = 1
print(df)
print("\nTime elapsed:", time.process_time() - start, "seconds")
The time complexity of this solution is .
You end up with this data frame as a result:
id cities contained
0 0 [A, B] 1
1 1 [A, B, C, D] 0
2 2 [A, B, C] 1
3 3 [X] 1
4 4 [X, Y, Z] 0
5 2670 [Y, Z] 1
Then you just have to exclude the rows where contained == 1.

Python Dataframe calculate Distance via intermediate point

I have a python dataframe with distances listed as follows
dict = {'from' : ['A','A','A','B','B','D','D','D'],
'to' : ['B','C','D','C','E','B','C','E'],
'distances': [4,3,1,1,3,4,2,9]}
df = pd.DataFrame.from_dict(dict)
I want enumerate all of the Distances for:
From point1 == > point2
where point1==> point2 =
From point1 ==> B + From B==> point2 and is included in the a
how can i do this efficiently using python - i assume some kind of pd.merge?
I would then like to reformate the dataframe into the following
columns = ['From','To','Distance','Distance via B']
If you're looking for routes of length 3, here's a solution. Note that in some cases, the direct route (e.g. A to B) is shorter than the route A-B-C:
three_route = pd.merge(df, df, left_on="to", right_on="from")
three_route["distance"] = three_route.distances_x + three_route.distances_y
three_route = three_route[["from_x", "to_x", "to_y", "distance"]]. \
rename(columns = {"from_x":"from", "to_x": "via", "to_y": "to"})
The result is:
from via to distance
0 A B C 5
1 A B E 7
2 D B C 5
3 D B E 7
4 A D B 5
5 A D C 3
6 A D E 10

Identify parent of hierarchical data in a dataframe given ordered index and depth only

Before I begin, I can hack something together to do this on a small scale, but my goal is to apply this to 200k+ row dataset, so efficiency is priority and I lack more... nuanced techniques. :-)
So, I have an ordered data set that represents data from a very complex hierarchical structure. I only have a unique ID, the tree depth, and the fact that it is in order. For example:
a
b
c
d
e
f
g
h
i
j
k
l
Which is stored as:
ID depth
0 a 0
1 b 1
2 c 2
3 d 3
4 e 3
5 f 2
6 g 2
7 h 3
8 i 0
9 j 1
10 k 2
11 l 1
Here's a line that should generate my example.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"],
"depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
What I want is to return either the index of each elements' nearest parent node or the parents' unique ID (they'll both work since they're both unique). Something like:
ID depth parent p.idx
0 a 0
1 b 1 a 0
2 c 2 b 1
3 d 3 c 2
4 e 3 c 2
5 f 2 b 1
6 g 2 b 1
7 h 3 g 6
8 i 0
9 j 1 i 8
10 k 2 j 9
11 l 1 i 8
My initial sloppy solution involved adding a column that was index-1, then self matching the data set with idx-1 (left) and idx (right), then identifying the maximum parent idx less than the child index... it didn't scale up well.
Here are a couple of routes to performing this task I've put together that work but aren't very efficient.
The first uses simple loops and includes a break to exit when the first match is identified.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
df["parent"] = ""
# loop over entire dataframe
for i1 in range(len(df.depth)):
# loop back up from current row to top
for i2 in range(i1):
# identify row where the depth is 1 less
if df.depth[i1] -1 == df.depth[i1-i2-1]:
# Set parent value and exit loop
df.parent[i1] = df.ID[i1-i2-1]
break
df.head(15)
This second merges the dataframe with itself and then uses a groupby to identify the maximum parent row less than each original row:
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
# Columns for comparision and merging
df["parent_depth"] = df.depth-1
df["row"]=df.index
# Merge to return ALL elements matching the parent depth of each row
df = df.merge(df[["ID","depth","row"]], left_on="parent_depth",right_on="depth",how="left",suffixes=('','_y'))
# Identify the maximum parent row less than the original row
g1 = df[ (df.row_y < df.row) | (df.row_y.isnull())].groupby("ID").max()
g1.reset_index(inplace=True)
#clean up
g1.drop(["parent_depth","row","depth_y","row_y"],axis=1,inplace=True)
g1.rename({"ID_y":"parent"},inplace=True)
g1.head(15)
I'm confident those with more experience can provide more elegant solutions, but since I got something working, I wanted to provide my "solution". Thanks!

Count Re-occurrence of a value in python

I have a data set which contains something like this:
SNo Cookie
1 A
2 A
3 A
4 B
5 C
6 D
7 A
8 B
9 D
10 E
11 D
12 A
So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
Can anyone give me logic or python code behind this?
One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
no_doubles.groupby('Cookie').dups.sum()
This gives you:
Cookie
A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
Name: dups, dtype: float64
Start by removing consecutive duplicates, then count the survivers:
no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from #sacul
no_dups.groupby('Cookie').count() - 1
# SNo
#Cookie
#A 2
#B 1
#C 0
#D 2
#E 0
pandas.factorize and numpy.bincount
If immediately repeated values are not counted then remove them.
Do a normal counting of values on what's left.
However, that is one more than what is asked for, so subtract one.
factorize
Filter out immediate repeats
bincount
Produce pandas.Series
i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1
pd.Series(cnts, r)
A 2
B 1
C 0
D 2
E 0
dtype: int64
pandas.value_counts
zip cookies with its lagged self, pulling out non repeats
c = df.Cookie.tolist()
pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1
A 2
B 1
C 0
D 2
E 0
dtype: int64
defaultdict
from collections import defaultdict
def count(s):
d = defaultdict(lambda:-1)
x = None
for y in s:
d[y] += y != x
x = y
return pd.Series(d)
count(df.Cookie)
A 2
B 1
C 0
D 2
E 0
dtype: int64

Calculate within categories: Equivalent of R's ddply in Python?

I have some R code I need to port to python. However, R's magic data.frame and ddply are keeping me from finding a good way to do this in python.
Sample data (R):
x <- data.frame(d=c(1,1,1,2,2,2),c=c(rep(c('a','b','c'),2)),v=1:6)
Sample computation:
y <- ddply(x, 'd', transform, v2=(v-min(v))/(max(v)-min(v)))
Sample output:
d c v v2
1 1 a 1 0.0
2 1 b 2 0.5
3 1 c 3 1.0
4 2 a 4 0.0
5 2 b 5 0.5
6 2 c 6 1.0
So here's my question for the pythonistas out there: how would you do the same? You have a data structure with a couple of important dimensions.
For each (c), and each(d) compute (v-min(v))/(max(v)-min(v))) and associate it with the corresponding (d,c) pair.
Feel free to use whatever data structures you want, so long as they're quick on reasonably large datasets (those that fit in memory).
Indeed pandas is the right (and only, I believe) tool for this in Python. It's a bit less magical than plyr but here's how to do this using the groupby functionality:
df = DataFrame({'d' : [1.,1.,1.,2.,2.,2.],
'c' : np.tile(['a','b','c'], 2),
'v' : np.arange(1., 7.)})
# in IPython
In [34]: df
Out[34]:
c d v
0 a 1 1
1 b 1 2
2 c 1 3
3 a 2 4
4 b 2 5
5 c 2 6
Now write a small transform function:
def f(group):
v = group['v']
group['v2'] = (v - v.min()) / (v.max() - v.min())
return group
Note that this also handles NAs since the v variable is a pandas Series object.
Now group by the d column and apply f:
In [36]: df.groupby('d').apply(f)
Out[36]:
c d v v2
0 a 1 1 0
1 b 1 2 0.5
2 c 1 3 1
3 a 2 4 0
4 b 2 5 0.5
5 c 2 6 1
Sounds like you want pandas and group by or aggregate.
You can also achieve a more performance if you use numpy and scipy.
Despite some ugly code it will be faster, pandas way will be slow if number of groups is very large and may even be worse than R. This will always be faster than R:
import numpy as np
import numpy.lib.recfunctions
from scipy import ndimage
x = np.rec.fromarrays(([1,1,1,2,2,2],['a','b','c']*2,range(1, 7)), names='d,c,v')
unique, groups = np.unique(x['d'], False, True)
uniques = range(unique.size)
mins = ndimage.minimum(x['v'], groups, uniques)[groups]
maxs = ndimage.maximum(x['v'], groups, uniques)[groups]
x2 = np.lib.recfunctions.append_fields(x, 'v2', (x['v'] - mins)/(maxs - mins + 0.0))
#save as csv
np.savetxt('file.csv', x2, delimiter=';')

Categories

Resources