I am fairly new to working with pandas. I have a dataframe with individual entries like this:
dfImport:
id
date_created
date_closed
0
01-07-2020
1
02-09-2020
10-09-2020
2
07-03-2019
02-09-2020
I would like to filter it in a way, that I get the total number of created and closed objects (count id's) grouped by Year and Quarter and Month like this:
dfInOut:
Year
Qrt
month
number_created
number_closed
2019
1
March
1
0
2020
3
July
1
0
September
1
2
I guess I'd have to use some combination of crosstab or group_by, but I tried out alot of ideas and already did research on the problem, but I can't seem to figure out a way. I guess it's an issue of understanding. Thanks in advance!
Use DataFrame.melt with crosstab:
df['date_created'] = pd.to_datetime(df['date_created'], dayfirst=True)
df['date_closed'] = pd.to_datetime(df['date_closed'], dayfirst=True)
df1 = df.melt(value_vars=['date_created','date_closed']).dropna()
df = (pd.crosstab([df1['value'].dt.year.rename('Year'),
df1['value'].dt.quarter.rename('Qrt'),
df1['value'].dt.month.rename('Month')], df1['variable'])
[['date_created','date_closed']])
print (df)
variable date_created date_closed
Year Qrt Month
2019 1 3 1 0
2020 3 7 1 0
9 1 2
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Year Qrt Month date_created date_closed
0 2019 1 3 1 0
1 2020 3 7 1 0
2 2020 3 9 1 2
Related
Let's say I have the dataset:
df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`
print(df1)
#output:
number decade
0 0 1970
1 0 1980
2 0 1990
3 0 2000
4 0 2010
and I want to merge it with another dataset:
df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]
print(df2)
#output:
number decade
0 1 1990
1 1 2010
such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?
Thank you so much!
You can use pandas.DataFrame.merge with pandas.Series.fillna :
out = (
df1[["decade"]]
.merge(df2, on="decade", how="left")
.fillna({"number": df1["number"]}, downcast="infer")
)
# Output :
print(out)
decade number
0 1970 0
1 1980 0
2 1990 1
3 2000 0
4 2010 1
What about using apply?
First you create a function
def validation(previous,latest):
if pd.isna(latest):
return previous
else:
return latest
Then you can use the function dataframe.apply to compare the data in df1 to df2
df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)
Your result:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3
I have a dataframe that looks this:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017','...']
sales = [1,2,3,4,1,2,'...']
days_left_in_m = [3,2,1,0,29,28,'...']
df_test = pd.DataFrame({'date': date,'days_left_in_m':days_left_in_m,'sales':sales})
df_test
I am trying to find sales for the rest of the month.
So, for 28th of Jan 2017 it will calculate sum of the next 3 days,
for 29th of Jan - sum of the next 2 days and so on...
The outcome should look like the "required" column below.
date days_left_in_m sales required
0 28-01-2017 3 1 10
1 29-01-2017 2 2 9
2 30-01-2017 1 3 7
3 31-01-2017 0 4 4
4 01-02-2017 29 1 3
5 02-02-2017 28 2 2
6 ... ... ... ...
My current solution is really ugly - I use a non-pythonic looping:
for i in range(lenght_of_t_series):
days_left = data_in.loc[i].days_left_in_m
if days_left == 0:
sales_temp_list.append(0)
else:
if (i+days_left) <= lenght_of_t_series:
sales_temp_list.append(sum(data_in.loc[(i+1):(i+days_left)].sales))
else:
sales_temp_list.append(np.nan)
I guess a much better way of doing this would be to use df['sales'].rolling(n).sum()
However, each row has a different window.
Please advise on the best way of doing this...
I think you need DataFrame.sort_values with GroupBy.cumsum.
If you do not want to take into account the current day you can
use groupby.shift (see commented code).
First you could convert date column to datetime in order to use Series.dt.month
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
Then we can use:
months = df_test['date'].dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
print(df_test)
Output
date days_left_in_m sales required
0 2017-01-28 3 1 10
1 2017-01-29 2 2 9
2 2017-01-30 1 3 7
3 2017-01-31 0 4 4
4 2017-02-01 29 1 3
5 2017-02-02 28 2 2
If you don't want convert date column to datetime use:
months = pd.to_datetime(df_test['date'],format = '%d-%m-%Y').dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
Currently I have the following python code
forumposts = pd.DataFrame({'UserId': [1,1,2,3,2,1,3], 'FirstPostDate': [2018,2018,2017,2019,2017,2018,2019], 'PostDate': [201801,201802,201701,201901,201801,201803,201902]})
data = forumposts.groupby(['UserId', 'PostDate','FirstPostDate']).size().reset_index()
rankedUserIdByFirstPostDate = data.groupby(['UserId', 'FirstPostDate']).size().reset_index().sort_values('FirstPostDate').reset_index(drop=True).reset_index()
data.loc[:,'Rank'] = data.merge(rankedUserIdByFirstPostDate , how='left', on='UserId')['index'].values
The code works as intended but its complicated is there a more pandas like way of doing this? The intent is the following:
Create a dense rank over the UserId column sorted by the FirstPostDate such that the user with the earliest posting gets rank 0 and the user with the second earliest first post gets rank 1 and so on.
Using forumposts.UserId.rank(method='dense') gives me a ranking but its sorted by the order of the UserId.
Use map by dictionary created by sort_values with drop_duplicates for order zipped with np.arange:
data = (forumposts.groupby(['UserId', 'PostDate','FirstPostDate'])
.size()
.reset_index(name='count'))
users = data.sort_values('FirstPostDate').drop_duplicates('UserId')['UserId']
d = dict(zip(users, np.arange(len(users))))
data['Rank'] = data['UserId'].map(d)
print (data)
UserId PostDate FirstPostDate count Rank
0 1 201801 2018 1 1
1 1 201802 2018 1 1
2 1 201803 2018 1 1
3 2 201701 2017 1 0
4 2 201801 2017 1 0
5 3 201901 2019 1 2
6 3 201902 2019 1 2
Another solution:
data['Rank'] = (data.groupby('UserId')['FirstPostDate']
.transform('min')
.rank(method='dense')
.sub(1)
.astype(int))
I have df with columns date, employee and event. 'Event' have value [1,3,5] if someone exit or [0,2,4] if someone entries. 'Employee' it's a private number for each employee. That's a head of df:
employee event registration date
0 4 1 1 2010-10-18 18:11:00
1 17 1 1 2010-10-18 18:15:00
2 6 0 1 2010-10-19 06:28:00
3 8 0 0 2010-10-19 07:04:00
4 15 0 1 2010-10-19 07:34:00
I sorted df and I have value from one month [year and month are my variables].
df = df.where(df['date'].dt.year == year).dropna()
df = df.where(df['date'].dt.month== month).dropna()
I want to create df which shows me sum of time in work for each employee.
Employees come in and come out at the same day and they could do it a few times in each day.
It seems you need boolean indexing with groupby where get difference by diff with sum:
year = 2010
month = 10
df = df[(df['date'].dt.year == year) & (df['date'].dt.month== month)]
More general solution is add to groupby year and month:
df =df['date'].groupby([df['employee'],
df['event'],
df['date'].rename('year').dt.year,
df['date'].rename('month').dt.month]).apply(lambda x: x.diff().sum())