I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393
Related
Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)
I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for
I have a dataset in pandas, with four columns (year, month, day, register).
df_registration
data table image
I want to group the data by year and month, and then count for each month in a year how many 'yes' and how many 'no' there were (or at least how many 'yes').
Image of df with expected result:
desired outcome
I have tried doing group_by + count,
g = df_registration.groupby(["year", "month"])
monthly_counts = g.aggregate({"register": pd.Series.value_counts })
but the outputs do not give the desired outcome, they just count the number of both register values.
Image of df with failed attempt:
wrong
I cannot get it to work as I want...
EDIT!____________ solution _________________
The code from alex smolyakov in the comment works!
counts = df_registration.groupby(["year", "month"]["register"].value_counts()
the code output here
counts = df_registration.groupby(["year", "month"])["register"].value_counts()
does it help ?
I have a database that includes monthly time series data on around 15 different indicators. The data is all in the same format, year-to-date values and year-to-date growth. January data is missing, with data for each indicator starting with the year-to-date total as of February.
For each indicator I want to turn the year-to-date data into monthly values. The code below does that.
But I want to be able to run this as a loop over all the 15 indictators, and then automatically rename each dataframe that results to include a reference to the category it belongs to. For example, one category of data is sales in value terms, so when I apply the code to that category, I want the output of df_m to be renamed as sales_m, and df_yoy as sales_yoy.
I thought I could so this by defining a list of the 15 indicators to start with, and then somehow assigning that list to the dataframes produced by the loop. But I can't make that work.
category = ['sales', 'construction']
df_m = df.loc[:, df.columns.str.contains('Monthly')]
df_ytd = df.drop(df.filter(regex='Monthly').columns, axis=1)
df_ytd = df_ytd.fillna(method='bfill', limit=1)
df_ytd.loc[df_ytd.index.month.isin([1,2]), :] = df_ytd / 2
df_ytd.columns = df_ytd.columns.str.replace(', YTD', '')
df_m.columns = df_m.columns.str.replace('YTD, ', '').str.replace(', Monthly', '')
df_m = df_m.fillna(df_ytd)
df_yoy = df_m.pct_change(periods=12) * 100
sales_m = df_m
I have a dataframe with lots of measurements of temperature values. I want to count the number of measurements in every day of the month. So far, I managed to display the number of measurements, and also to create a new dataframe, containing the unique values of the days.
How can I add the number of measurements to the new dataframe(the one containing all the unique values of days), in a new column?
So far, I have managed this function, which counts the number of measurements in the given day:
def measurements_in_a_day(day, month, year):
full_date = day.format(), '/', month.format(), '/', year.format()
full_date = ''.join(full_date)
seriesObj = data.apply(lambda x: True if x['day'] == (full_date) else False, axis=1)
no_of_rows = len(seriesObj[seriesObj == True].index)
print('Number of Rows in dataframe in which date is ', full_date, ' are ', no_of_rows)
The thing is that I have to call this function 3 different times, because the csv file doesn't save the save format for data. How can I add the count of measurements in a new column in the dataframe created for unique month days?
Did you try using pandas groupby ?
something like data.groupby('day').count() should give you what you want.
df1=df.groupby('day')['time'].count().reset_index()
df1=df1.rename(columns={'time':'count'})
In one line:
df1=df.groupby('day')['time'].count().reset_index().rename(columns={'time':'count'})
If you prefer having the days as index you can do the following
df1=df.groupby('day')['time'].count().rename('count')