I have a data set which contains something like this:
SNo Cookie
1 A
2 A
3 A
4 B
5 C
6 D
7 A
8 B
9 D
10 E
11 D
12 A
So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
Can anyone give me logic or python code behind this?
One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
no_doubles.groupby('Cookie').dups.sum()
This gives you:
Cookie
A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
Name: dups, dtype: float64
Start by removing consecutive duplicates, then count the survivers:
no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from #sacul
no_dups.groupby('Cookie').count() - 1
# SNo
#Cookie
#A 2
#B 1
#C 0
#D 2
#E 0
pandas.factorize and numpy.bincount
If immediately repeated values are not counted then remove them.
Do a normal counting of values on what's left.
However, that is one more than what is asked for, so subtract one.
factorize
Filter out immediate repeats
bincount
Produce pandas.Series
i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1
pd.Series(cnts, r)
A 2
B 1
C 0
D 2
E 0
dtype: int64
pandas.value_counts
zip cookies with its lagged self, pulling out non repeats
c = df.Cookie.tolist()
pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1
A 2
B 1
C 0
D 2
E 0
dtype: int64
defaultdict
from collections import defaultdict
def count(s):
d = defaultdict(lambda:-1)
x = None
for y in s:
d[y] += y != x
x = y
return pd.Series(d)
count(df.Cookie)
A 2
B 1
C 0
D 2
E 0
dtype: int64
Related
I'm working on the titanic dataset and figured that determining if a ticket is unique or not might be predictive. I've written some code below that tells us if the value is unique or not but I have a feeling that there is a much cleaner way of doing this.
import pandas as pd
dummy_data = pd.DataFrame({"Ticket" : ['A','A','B','B','C','D']
})
dummy_data
counts = dummy_data['Ticket'].value_counts()
a = counts[counts>1]
b=a.reset_index(name='Quantity')
b = b.rename(columns={'index': 'Tickets'})
b
dummy_data['Ticket_multiple'] = np.where(dummy_data['Ticket'].isin(b['Tickets']),1,0)
dummy_data.head(10)
You can use pandas.Series.duplicated with keep=False:
Parameters: keep{‘first’, ‘last’, False}, default ‘first’ :
Method to handle dropping duplicates:
‘first’ : Mark duplicates as True except for the first occurrence.
‘last’ : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
dummy_data['Ticket_multiple'] = dummy_data["Ticket"].duplicated(keep=False).astype(int)
Output :
print(dummy_data)
Ticket Ticket_multiple
0 A 1
1 A 1
2 B 1
3 B 1
4 C 0
5 D 0
Update :
If you need to consider a value duplicated only if it occurs (>N) times, use this :
#------------------------------------------------ added this value
# ˅
dummy_data = pd.DataFrame({"Ticket" : ['A','A','A','B','B','C','D']})
N = 2 #threshold
m = dummy_data.groupby("Ticket").transform("size").gt(N)
dummy_data["Ticket_multiple (threshold)"] = m.astype(int)
print(dummy_data)
Ticket Ticket_multiple Ticket_multiple (threshold)
0 A 1 1
1 A 1 1
2 A 1 1
3 B 1 0
4 B 1 0
5 C 0 0
6 D 0 0
I have a Pandas data frame like this
x y
0 0 a
1 0 b
2 0 c
3 0 d
4 1 e
5 1 f
6 1 g
7 1 h
what I want to do is for each value of x to create a series which cumulatively concatenates the strings which have already appeared in y for that value of x. In other words, I want to get a Pandas series like this.
0
1 a,
2 a,b,
3 a,b,c,
4
5 e,
6 e,f,
7 e,f,g,
I can do it using a double for loop:
dat = pd.DataFrame({'x': [0, 0, 0, 0, 1, 1, 1, 1],
'y': ['a','b','c','d','e','f','g','h']})
z = dat['x'].copy()
for i in range(dat.shape[0]):
z[i] = ''
for j in range(i):
if dat['x'][j] == dat['x'][i]:
z[i] += dat['y'][j] + ","
but I was wondering whether there is a quicker way? It seems that pandas expanding().apply() doesn't work for strings and it is an open issue. But perhaps there is an efficient way of doing it which doesn't involve apply?
You can do with shift and np.cumsum in a custom function:
def myfun(x):
y = x.shift()
return np.cumsum(y.fillna('').add(',').mask(y.isna(),'')).str[:-1]
df.groupby("x")['y'].apply(myfun)
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object
We can group the dataframe by x then for each group in x we can cumsum and shift the column y and update the values in new column cum_y in dat
dat['cum_y'] = ''
for _, g in dat.groupby('x'):
dat['cum_y'].update(g['y'].add(',').cumsum().shift().str[:-1])
>>> dat
x y cum_y
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g
Use GroupBy.transform with lambda function with Series.shift, adding ,, cumulative sum and last remove trailing separator:
f = lambda x: (x.shift(fill_value='') + ',').cumsum()
dat['z'] = dat.groupby('x')['y'].transform(f).str.strip(',')
print (dat)
x y z
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g
I would try to use lists here. Unsure for the efficiency anyway...
df.assign(y=df['y'].apply(lambda x: [x])).groupby('x')['y'].transform(
lambda x: x.cumsum()).str.join(',')
It gives as expected:
0 a
1 a,b
2 a,b,c
3 a,b,c,d
4 e
5 e,f
6 e,f,g
7 e,f,g,h
Name: y, dtype: object
Can also do:
(df['y'].apply(list)
.groupby(df['x'])
.transform(lambda x: x.cumsum().shift(fill_value=''))
.str.join(',')
)
Output:
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object
i have a huge data and when and there a lot of duplicate so i wanna remove all value have less then 5 in value_counts() function
like this and less i wanna remove it
If want remove values from counts Series use boolean indexing:
y = pd.Series(['a'] * 5 + ['b'] * 2 + ['c'] * 3 + ['d'] * 7)
s = y.value_counts()
out = s[s > 4]
print (out)
d 7
a 5
dtype: int64
If want remove values from original Series use Series.isin:
y1 = y[y.isin(out.index)]
print (y1)
0 a
1 a
2 a
3 a
4 a
10 d
11 d
12 d
13 d
14 d
15 d
16 d
dtype: object
Thank you mr.jezrael your answer so helpful and i will add a small tip, after you gathering the values this how you can replace the values :
s = y.value_counts()
x = s[s>5]
for z in y:
if z not in x:
y = y.replace([z],'Other')
else:
continue
Before I begin, I can hack something together to do this on a small scale, but my goal is to apply this to 200k+ row dataset, so efficiency is priority and I lack more... nuanced techniques. :-)
So, I have an ordered data set that represents data from a very complex hierarchical structure. I only have a unique ID, the tree depth, and the fact that it is in order. For example:
a
b
c
d
e
f
g
h
i
j
k
l
Which is stored as:
ID depth
0 a 0
1 b 1
2 c 2
3 d 3
4 e 3
5 f 2
6 g 2
7 h 3
8 i 0
9 j 1
10 k 2
11 l 1
Here's a line that should generate my example.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"],
"depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
What I want is to return either the index of each elements' nearest parent node or the parents' unique ID (they'll both work since they're both unique). Something like:
ID depth parent p.idx
0 a 0
1 b 1 a 0
2 c 2 b 1
3 d 3 c 2
4 e 3 c 2
5 f 2 b 1
6 g 2 b 1
7 h 3 g 6
8 i 0
9 j 1 i 8
10 k 2 j 9
11 l 1 i 8
My initial sloppy solution involved adding a column that was index-1, then self matching the data set with idx-1 (left) and idx (right), then identifying the maximum parent idx less than the child index... it didn't scale up well.
Here are a couple of routes to performing this task I've put together that work but aren't very efficient.
The first uses simple loops and includes a break to exit when the first match is identified.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
df["parent"] = ""
# loop over entire dataframe
for i1 in range(len(df.depth)):
# loop back up from current row to top
for i2 in range(i1):
# identify row where the depth is 1 less
if df.depth[i1] -1 == df.depth[i1-i2-1]:
# Set parent value and exit loop
df.parent[i1] = df.ID[i1-i2-1]
break
df.head(15)
This second merges the dataframe with itself and then uses a groupby to identify the maximum parent row less than each original row:
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
# Columns for comparision and merging
df["parent_depth"] = df.depth-1
df["row"]=df.index
# Merge to return ALL elements matching the parent depth of each row
df = df.merge(df[["ID","depth","row"]], left_on="parent_depth",right_on="depth",how="left",suffixes=('','_y'))
# Identify the maximum parent row less than the original row
g1 = df[ (df.row_y < df.row) | (df.row_y.isnull())].groupby("ID").max()
g1.reset_index(inplace=True)
#clean up
g1.drop(["parent_depth","row","depth_y","row_y"],axis=1,inplace=True)
g1.rename({"ID_y":"parent"},inplace=True)
g1.head(15)
I'm confident those with more experience can provide more elegant solutions, but since I got something working, I wanted to provide my "solution". Thanks!
i wouldn't be posting this if i didn't do extensive research in attempt to find the answer. Alas, I have not been able to find any such answer. I have a paired dataset that looks something like this:
PERSON, ATTRIBUTE
person1, a
person1, b
person1, c
person1, d
person2, c
person2, d
person2, x
person3, a
person3, b
person3, e
person3, f
What I want to do is: 1) drop attributes that don't appear more than 10 times, 2) turn it into a binary table that would look something like this:
a b c
person1 1 1 1
person2 0 0 1
person3 1 1 0
So far, I have put together a script to drop the attributes that only appear 10 times; however, it is painfully slow as it has to go through each attribute, determine its frequency and find the corresponding x and y values to append to new variables.
import pandas as pd
import numpy as np
import csv
from collections import Counter
import time
df = pd.read_csv(
filepath_or_buffer='sample.csv',
sep=',')
x = df.ix[:, 1].values
y = df.ix[:, 0].values
x_vals = []
y_vals = []
counter = Counter(x)
start_time = time.time()
for each in counter:
if counter[each]>=10:
for i, j in enumerate(x):
if j==each:
print "Adding position:" + str(i)
x_vals.append(each)
y_vals.append(y[i])
print "Time took: %s" %(time.time()-start_time)
I would love some help in 1) finding a faster way to match attributes that appear more than 10 times and appending the values to new variables.
OR
2) An alternative method entirely to get the final binary table. I feel like converting a paired table to a binary table is probably a common occurrence in the data world, yet i couldnt find any code, module etc that could help with doing that.
Thanks a million!
I would probably add a dummy column and then call pivot_table:
>>> df = pd.DataFrame({"PERSON": ["p1", "p2", "p3"] * 10, "ATTRIBUTE": np.random.choice(["a","b","c","d","e","f","x"], 30)})
>>> df.head()
ATTRIBUTE PERSON
0 d p1
1 b p2
2 x p3
3 b p1
4 f p2
>>> df["count"] = 1
>>> p = df.pivot_table(index="PERSON", columns="ATTRIBUTE", values="count",
aggfunc=sum, fill_value=0)
>>> p
ATTRIBUTE a b c d e f x
PERSON
p1 1 3 1 1 1 0 3
p2 2 1 1 2 1 2 1
p3 0 4 1 1 2 0 2
And then we can select only the attributes with more than 10 occurrences (here 5, from my example):
>>> p.loc[:,p.sum() >= 5]
ATTRIBUTE b x
PERSON
p1 3 3
p2 1 1
p3 4 2