Related
I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)
I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4
Before I begin, I can hack something together to do this on a small scale, but my goal is to apply this to 200k+ row dataset, so efficiency is priority and I lack more... nuanced techniques. :-)
So, I have an ordered data set that represents data from a very complex hierarchical structure. I only have a unique ID, the tree depth, and the fact that it is in order. For example:
a
b
c
d
e
f
g
h
i
j
k
l
Which is stored as:
ID depth
0 a 0
1 b 1
2 c 2
3 d 3
4 e 3
5 f 2
6 g 2
7 h 3
8 i 0
9 j 1
10 k 2
11 l 1
Here's a line that should generate my example.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"],
"depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
What I want is to return either the index of each elements' nearest parent node or the parents' unique ID (they'll both work since they're both unique). Something like:
ID depth parent p.idx
0 a 0
1 b 1 a 0
2 c 2 b 1
3 d 3 c 2
4 e 3 c 2
5 f 2 b 1
6 g 2 b 1
7 h 3 g 6
8 i 0
9 j 1 i 8
10 k 2 j 9
11 l 1 i 8
My initial sloppy solution involved adding a column that was index-1, then self matching the data set with idx-1 (left) and idx (right), then identifying the maximum parent idx less than the child index... it didn't scale up well.
Here are a couple of routes to performing this task I've put together that work but aren't very efficient.
The first uses simple loops and includes a break to exit when the first match is identified.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
df["parent"] = ""
# loop over entire dataframe
for i1 in range(len(df.depth)):
# loop back up from current row to top
for i2 in range(i1):
# identify row where the depth is 1 less
if df.depth[i1] -1 == df.depth[i1-i2-1]:
# Set parent value and exit loop
df.parent[i1] = df.ID[i1-i2-1]
break
df.head(15)
This second merges the dataframe with itself and then uses a groupby to identify the maximum parent row less than each original row:
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
# Columns for comparision and merging
df["parent_depth"] = df.depth-1
df["row"]=df.index
# Merge to return ALL elements matching the parent depth of each row
df = df.merge(df[["ID","depth","row"]], left_on="parent_depth",right_on="depth",how="left",suffixes=('','_y'))
# Identify the maximum parent row less than the original row
g1 = df[ (df.row_y < df.row) | (df.row_y.isnull())].groupby("ID").max()
g1.reset_index(inplace=True)
#clean up
g1.drop(["parent_depth","row","depth_y","row_y"],axis=1,inplace=True)
g1.rename({"ID_y":"parent"},inplace=True)
g1.head(15)
I'm confident those with more experience can provide more elegant solutions, but since I got something working, I wanted to provide my "solution". Thanks!
I have an pandas data frame and I want the average number of consecutive values in a row. For example, for the following data
a b c d e f g h i j k l
p1 0 0 4 4 4 4 4 4 1 4 4 1
p2 0 4 4 0 4 4 0 1 4 4 0 1
so the average number of consecutive 4's for p1 is (6+2)/2 = 4 and for p2 is (2+2+2)/3 = 2
Is there also a way to find the min and max number of consecutive values? i.e. max for p1 is 6.
You can transpose your dataframe and use the method suggested in below post. You will get a dataframe of count of consecutive numbers, using which you can perform Mean, Min and Max.
https://stackoverflow.com/a/29643066/12452044
This will work for p1. To get p2, just replace the 0's with 1's whenever you see an 'iloc' function being used.
dict = {0:[],1:[],2:[],3:[],4:[]}
counter = 1
for i in range(len(df.iloc[0])-1):
num = df.iloc[0,i]
num2 = df.iloc[0,i+1]
if num == num2:
counter += 1
else:
dict[num].append(counter)
counter = 1
Then to get the average number of consecutive 4's:
print(sum(dict[4])/len(dict[4]))
And to get the max number of consecutive 4's:
print(max(dict[4]))
I have a data set which contains something like this:
SNo Cookie
1 A
2 A
3 A
4 B
5 C
6 D
7 A
8 B
9 D
10 E
11 D
12 A
So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
Can anyone give me logic or python code behind this?
One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
no_doubles.groupby('Cookie').dups.sum()
This gives you:
Cookie
A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
Name: dups, dtype: float64
Start by removing consecutive duplicates, then count the survivers:
no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from #sacul
no_dups.groupby('Cookie').count() - 1
# SNo
#Cookie
#A 2
#B 1
#C 0
#D 2
#E 0
pandas.factorize and numpy.bincount
If immediately repeated values are not counted then remove them.
Do a normal counting of values on what's left.
However, that is one more than what is asked for, so subtract one.
factorize
Filter out immediate repeats
bincount
Produce pandas.Series
i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1
pd.Series(cnts, r)
A 2
B 1
C 0
D 2
E 0
dtype: int64
pandas.value_counts
zip cookies with its lagged self, pulling out non repeats
c = df.Cookie.tolist()
pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1
A 2
B 1
C 0
D 2
E 0
dtype: int64
defaultdict
from collections import defaultdict
def count(s):
d = defaultdict(lambda:-1)
x = None
for y in s:
d[y] += y != x
x = y
return pd.Series(d)
count(df.Cookie)
A 2
B 1
C 0
D 2
E 0
dtype: int64