I am currently working on a project where I have to measure someones activity over time on a site, based on whether they edit a site. I have a data frame that looks similar to this:
df = pd.DataFrame({"x":["a", "b", "c", "b","b"],
"y":["red", "blue", "green", "yellow","red"],
"z":[1,2,3,4,5]})
I want to add a column to the dataframe such that it counts the number of repeated values (number of edits, which is column x) there are, using the "z" column as the measure of when the events happened.
E.g. to have an additional column of:
df["activity"] = pd.Series([1,1,1,2,3])
How would I best go about this in Python? Not sure what my best approach here is.
groupby and cumcount
df['activity'] = df.groupby('x').cumcount() + 1
df
x y z activity
0 a red 1 1
1 b blue 2 1
2 c green 3 1
3 b yellow 4 2
4 b red 5 3
Related
I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).
I have a pandas df w/ columns that looks like these:
username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021 Sum_of_Months
user 1 0 0 0 1 1 1 0 3
user 2 0 0 0 0 0 0 1 1
user 3 1 1 1 0 1 1 0 5
user 4 0 1 1 1 0 1 1 5
Id like to be able to get a summary column that says the number of groups and their count.
For example:
When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021
username Num_of_groups Lenth_of_group
user 1 1 3
user 2 1 1
user 3 2 3,2
user 4 2 3,2
You can try with groupby function from itertools
from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))
Hope this Helps...
This solution uses staircase, and works by treating each users data as a step function (of 1s and 0s)
setup
import pandas as pd
df = pd.DataFrame(
{
"username":["user 1", "user 2", "user 3", "user 4"],
"Jan_2021":[0,0,1,0],
"Feb_2021":[0,0,1,0],
"Mar_2021":[0,0,1,1],
"April_2021":[1,0,0,1],
"May_2021":[1,0,1,0],
"June_2021":[1,0,1,1],
"July_2021":[0,1,1,0],
"Sum_of_Months":[3,1,5,5],
}
)
solution
import staircase as sc
# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)
def extract_groups(series):
return (
sc.Stairs.from_values(initial_value=0, values=series) # create step function for each user
.clip(0, len(series)) # clip step function to region of interest
.to_frame() # represent data as start/stop intervals in a dataframe
.query("value==1") # filter for groups of 1s
.eval("dist=end-start") # calculate the length of each "group"
["dist"].to_list() # convert the result from Series to list
)
sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))
sfs is a pandas.Series where the values are lists representing number of groups and the lengths of each. It looks like this:
0 [3]
1 [1]
2 [3, 3]
3 [2, 1]
dtype: object
You can use it to create the data you need, eg
df["Num_of_groups"] = sfs.apply(list.__len__)
adds the Num_of_groups column to your original dataframe
Disclaimer: I am author of staircase
I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.
Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B
I have this Python Pandas DataFrame DF :
DICT = { 'letter': ['A','B','C','A','B','C','A','B','C'],
'number': [1,1,1,2,2,2,3,3,3],
'word' : ['one','two','three','three','two','one','two','one','three']}
DF = pd.DataFrame(DICT)
Which looks like :
letter number word
0 A 1 one
1 B 1 two
2 C 1 three
3 A 2 three
4 B 2 two
5 C 2 one
6 A 3 two
7 B 3 one
8 C 3 three
And I want to extract the lines
letter number word
A 1 one
B 2 two
C 3 three
First I tired :
DF[(DF['letter'].isin(("A","B","C"))) &
DF['number'].isin((1,2,3)) &
DF['word'].isin(('one','two','three'))]
Of course it didn't work, and everything has been selected
Then I tested :
Bool = DF[['letter','number','word']].isin(("A",1,"one"))
DF[np.all(Bool,axis=1)]
Good, it works ! but only for one line ...
If we take the next step and give an iterable to .isin() :
Bool = DF[['letter','number','word']].isin((("A",1,"one"),
("B",2,"two"),
("C",3,"three")))
Then it fails, the Boolean array is full of False ...
What I'm doing wrong ? Is there a more elegant way to do this selection based on several columns ?
(Anyway, I want to avoid a for loop, because the real DataFrames I'm using are really big, so I'm looking for the fastest optimal way to do the job)
Idea is create new DataFrame with all triple values and then merge with original DataFrame:
L = [("A",1,"one"),
("B",2,"two"),
("C",3,"three")]
df1 = pd.DataFrame(L, columns=['letter','number','word'])
print (df1)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
df = DF.merge(df1)
print (df)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
Another idea is create list of tuples, convert to Series and then compare by isin:
s = pd.Series(list(map(tuple, DF[['letter','number','word']].values.tolist())),index=DF.index)
df1 = DF[s.isin(L)]
print (df1)
letter number word
0 A 1 one
4 B 2 two
8 C 3 three
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
I have a rather basic question for pandas, but I've tried merge and join to no success
-edit: these are in the same dataframe, and that wasn't clear. We are indeed condensing the data.
print df
product_code_shipped quantity product_code
0 A12395 1 A12395
1 H53456 4 D78997
2 A13456 3 E78997
3 A12372 8 A13456
4 E28997 1 D83126
5 B78997 2 C64516
6 C78117 9 B78497
7 B78227 1 H53456
8 B78497 2 J12372
So I want to just have one product code column with the unique product codes and their other data. So quantity, and color say, I just want the product codes of the shipped products (and in another column there is color). How do I do this inside the same dataframe?
So I should get
print df2
product_code_shipped quantity product_code color
0 A12395 1 A12395 red
1 H53456 4 H53456 blue
2 B78497 2 B78497 yellow
I'm a little confused by your question, specifically where "unique product codes" enter in...are we condensing the data? The example does not make that clear. Nonetheless I'll give it a shot:
Many DataFrame methods rely on the indexes to automatically align data. In your case, it seems convenient to set the index of these DataFrames to the product code. So you'd have this:
In [132]: shipped
Out[132]:
quantity
product_code_shipped
A 1
B 4
C 2
In [133]: info
Out[133]:
color
product_code
A red
B blue
C yellow
Now, join requires no extra parameters; it gives you exactly what (I think) you want.
In [134]: info.join(shipped)
Out[134]:
color quantity
product_code
A red 1
B blue 4
C yellow 2
If this doesn't answer your question, please clarify it by giving example input including where color comes from and the exact output that would come from that input.