cumulative number of unique elements for pandas dataframe

cumulative number of unique elements for pandas dataframe - python

i have a pandas data frame
id tag
1 A
1 A
1 B
1 C
1 A
2 B
2 C
2 B
I want to add a column which computes the cumulative number of unique tags over at id level. More specifically, I would like to have
id tag count
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2
For a given id, count will be non-decreasing. Thanks for your help!

I think this does what you want:
unique_count = df.drop_duplicates().groupby('id').cumcount() + 1
unique_count.reindex(df.index).ffill()
The +1 is because the count starts at zero. This only works if the dataframe is sorted by id. Was that intended? You can always sort beforehand.

You can find some other approaches in R and Python here
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2],'tag':["A","A", "B","C","A","B","C","B"]})
df['count']=df.groupby('id')['tag'].apply(lambda x: (~pd.Series(x).duplicated()).cumsum())
id tag count
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2

How about this:
d['X'] = 1
d.groupby("Col").X.cumsum()

idt=[1,1,1,1,1,2,2,2]
tag=['A','A','B','C','A','B','C','B']
df=pd.DataFrame(tag,index=idt,columns=['tag'])
df=df.reset_index()
print(df)
index tag
0 1 A
1 1 A
2 1 B
3 1 C
4 1 A
5 2 B
6 2 C
7 2 B
df['uCnt']=df.groupby(['index','tag']).cumcount()+1
print(df)
index tag uCnt
0 1 A 1
1 1 A 2
2 1 B 1
3 1 C 1
4 1 A 3
5 2 B 1
6 2 C 1
7 2 B 2
df['uCnt']=df['uCnt']//df['uCnt']**2
print(df)
index tag uCnt
0 1 A 1
1 1 A 0
2 1 B 1
3 1 C 1
4 1 A 0
5 2 B 1
6 2 C 1
7 2 B 0
df['uCnt']=df.groupby(['index'])['uCnt'].cumsum()
print(df)
index tag uCnt
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2
df=df.set_index('index')
print(df)
tag uCnt
index
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2

Related

Fill Nan with all the information from previous week

I have a dataframe that looks like:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B NaN
9 2 1 C 852634.0
I want for every row where I have a Nan UPC to go and check the previous week, match on Store and End Cap and then grab all the information from previous week where we are matching.
So in the above example (2/1/B) would match with both the second and third row that show (1/1/B) and the desired output would look like this:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
We now have both 789456 and 546879 show up for (2/1/B)
How can I go around doing this?
I tried sorting and forward filling but that only gets me 1 of the values not all.

Lets try self merge after assigning +1 to week
out = df.merge(df.assign(Week=df['Week'].add(1)),
on=['Week','Store','End Cap'],how='left',suffixes=('','_y'))
out['UPC'] = out['UPC'].fillna(out['UPC_y'])
out = out.loc[:, df.columns]
print(out)
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?

You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"

df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!

If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')

Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

How to grouby one column and do nothing to other columns in pandas?

I have a dataframe like this:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
How to groupby 'a', and do nothing to column b c d, and split into several dataframes? Like this:
First groupby column 'a':
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 2 1 1 1
5 2 2 2
6 3 3 3
And then split into different dataframes based on numbers in 'a':
dataframe 1:
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
dataframe 2:
a b c d
0 2 1 1 1
1 2 2 2
2 3 3 3
:
:
:
dataframe n:
a b c d
0 n 1 1 1
1 2 2 2
2 3 3 3

Iterate over each group that df.groupby returns.
for _, g in df.groupby('a'):
print(g, '\n')
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
If you want a dict of dataframes, I'd recommend:
df_dict = {idx : g for idx, g in df.groupby('a')}
Here, idx is the unique a value.
A couple of nifty techniques courtesy Root:
df_dict = dict(list(df.groupby('a'))) # for a dictionary
And,
idxs, dfs = zip(*df.groupby('a')) # separate lists
idxs
(1, 2)
dfs
( a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3)

This is the way by using np.split
idx=df.a.diff().fillna(0).nonzero()[0]
dfs = np.split(df, idx, axis=0)
dfs
Out[210]:
[ a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3]
dfs[0]
Out[211]:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4

Counting Precedant Entries of a column and creating a new varaible of these counts

I have a data frame and I want to count the number of consecutive entries of one column and record the counts in a separate variable. Here is an example:
ID Class
1 A
1 A
2 A
1 B
1 B
1 B
2 B
1 C
1 C
2 A
2 A
2 A
I want in each group ID to count the number of consecutive classes, so the output would look like this:
ID Class Counts
1 A 0
1 A 1
2 A 0
1 B 0
1 B 1
1 B 2
2 B 0
1 C 0
1 C 1
2 A 0
2 A 1
2 A 2
I am not looking the frequency of occurrence of a specific entries like here, rather the consecutive occurrences of an entry on the ID level

You can use cumcount by Series which is create by cumsum of shifted concanecate values by shift:
#use separator which is not in data like _ or ¥
s = df['ID'].astype(str) + '¥' + df['Class']
df['Counts'] = df.groupby(s.ne(s.shift()).cumsum()).cumcount()
print (df)
ID Class Counts
0 1 A 0
1 1 A 1
2 2 A 0
3 1 B 0
4 1 B 1
5 1 B 2
6 2 B 0
7 1 C 0
8 1 C 1
9 2 A 0
10 2 A 1
11 2 A 2
Another solution with ngroup (pandas 0.20.2+):
s = df.groupby(['ID','Class']).ngroup()
df['Counts'] = df.groupby(s.ne(s.shift()).cumsum()).cumcount()
print (df)
ID Class Counts
0 1 A 0
1 1 A 1
2 2 A 0
3 1 B 0
4 1 B 1
5 1 B 2
6 2 B 0
7 1 C 0
8 1 C 1
9 2 A 0
10 2 A 1
11 2 A 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

cumulative number of unique elements for pandas dataframe - python

I think this does what you want: unique_count = df.drop_duplicates().groupby('id').cumcount() + 1 unique_count.reindex(df.index).ffill() The +1 is because the count starts at zero. This only works if the dataframe is sorted by id. Was that intended? You can always sort beforehand.

How about this: d['X'] = 1 d.groupby("Col").X.cumsum()

Related

Fill Nan with all the information from previous week

Python Counting Same Values For Specific Columns

Use groupby and merge to create new column in pandas

How to grouby one column and do nothing to other columns in pandas?

Counting Precedant Entries of a column and creating a new varaible of these counts

Categories

Resources