I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).
I have a pandas df w/ columns that looks like these:
username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021 Sum_of_Months
user 1 0 0 0 1 1 1 0 3
user 2 0 0 0 0 0 0 1 1
user 3 1 1 1 0 1 1 0 5
user 4 0 1 1 1 0 1 1 5
Id like to be able to get a summary column that says the number of groups and their count.
For example:
When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021
username Num_of_groups Lenth_of_group
user 1 1 3
user 2 1 1
user 3 2 3,2
user 4 2 3,2
You can try with groupby function from itertools
from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))
Hope this Helps...
This solution uses staircase, and works by treating each users data as a step function (of 1s and 0s)
setup
import pandas as pd
df = pd.DataFrame(
{
"username":["user 1", "user 2", "user 3", "user 4"],
"Jan_2021":[0,0,1,0],
"Feb_2021":[0,0,1,0],
"Mar_2021":[0,0,1,1],
"April_2021":[1,0,0,1],
"May_2021":[1,0,1,0],
"June_2021":[1,0,1,1],
"July_2021":[0,1,1,0],
"Sum_of_Months":[3,1,5,5],
}
)
solution
import staircase as sc
# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)
def extract_groups(series):
return (
sc.Stairs.from_values(initial_value=0, values=series) # create step function for each user
.clip(0, len(series)) # clip step function to region of interest
.to_frame() # represent data as start/stop intervals in a dataframe
.query("value==1") # filter for groups of 1s
.eval("dist=end-start") # calculate the length of each "group"
["dist"].to_list() # convert the result from Series to list
)
sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))
sfs is a pandas.Series where the values are lists representing number of groups and the lengths of each. It looks like this:
0 [3]
1 [1]
2 [3, 3]
3 [2, 1]
dtype: object
You can use it to create the data you need, eg
df["Num_of_groups"] = sfs.apply(list.__len__)
adds the Num_of_groups column to your original dataframe
Disclaimer: I am author of staircase
Related
I am currently working on a dataset which has information on total sales for each product id and product sub category. For eg, let us consider that there are three products 1, 2 and 3. There are three product sub categories - A,B,C, one or two or all of which may comprise the products 1, 2 and 3. For instance, I have included a sample table below:
Now, I would like to add a flag column 'Flag' which can assign 1 or 0 to each product id depending on whether that product id is contains record of product sub category 'C'. If it does contain 'C', then assign 1 to the flag column. Otherwise, assign 0. Below is the desired output.
I am currently not able to do this in pandas. Could you help me out? Thank you so much!
use pandas transform and contains. transform applies the lambda function to all rows in the dataframe.
txt="""ID,Sub-category,Sales
1,A,100
1,B,101
1,C,102
2,B,100
2,C,101
3,A,102
3,B,100"""
df = pd.read_table(StringIO(txt), sep=',')
#print(df)
list_id=list(df[df['Sub-category'].str.contains('C')]['ID'])
df['flag']=df['ID'].apply(lambda x: 1 if x in list_id else 0 )
print(df)
output:
ID Sub-category Sales flag
0 1 A 100 1
1 1 B 101 1
2 1 C 102 1
3 2 B 100 1
4 2 C 101 1
5 3 A 102 0
6 3 B 100 0
Try this:
Flag = [ ]
for i in dataFrame["Product sub-category]:
if i == "C":
Flag.append(1)
else:
Flag.append(0)
So you have a list called "Flag" and can add it to your dataframe.
You can add a temporary column, isC to check for your condition. Then check for the number of isC's inside every "Product Id" group (with .groupby(...).transform).
check = (
df.assign(isC=lambda df: df["Product Sub-category"] == "C")
.groupby("Product Id").isC.transform("sum")
)
df["Flag"] = (check > 0).astype(int)
I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).
Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5
I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8