Having an issue with pandas df, trying to get the "Count" column based on the date, the code should search for the "date range' within the dates column, and if it is present the 'Count' should be copied into the "Posts" column for the corresponding date
eg: date_range value = 16/02/2017 - code searches for 16/02/2017 in "Dates" column and makes "Posts" equal to the "Count" value of that Date - if the date_range value does not appear - Posts should = 0.
Data Example:
Dates Count date_range Posts
0 07/02/2017 1 16/12/2016 (should = 5)
1 01/03/2017 1 17/12/2016
2 15/02/2017 1 18/12/2016
3 23/01/2017 1 19/12/2016
4 28/02/2017 1 20/12/2016
5 09/02/2017 2 21/12/2016
6 20/03/2017 2 22/12/2016
7 16/12/2016 5
My code looks like this:
DateList = df['Dates'].tolist()
for date in df['date_range']:
if str(date) in DateList:
df['Posts'] = df['Count']
else:
dates_df['Posts'] = 0
However this makes the data map the wrong values to "Posts"
Hopefully I explained this correctly! Thanks in advance for the help!
You can first create dict for matching values and then map by date_range column:
print (df)
Dates Count date_range
0 07/02/2017 1 16/12/2016
1 01/03/2017 1 17/12/2016
2 15/02/2017 1 18/12/2016
3 23/01/2017 1 19/12/2016
4 28/02/2017 1 07/02/2017 <-change value for match
5 09/02/2017 2 21/12/2016
6 20/03/2017 2 22/12/2016
7 16/12/2016 5 22/12/2016
d = df[df['Dates'].isin(df.date_range)].set_index('Dates')['Count'].to_dict()
print (d)
{'16/12/2016': 5, '07/02/2017': 1}
df['Posts'] = df['date_range'].map(d).fillna(0).astype(int)
print (df)
Dates Count date_range Posts
0 07/02/2017 1 16/12/2016 5
1 01/03/2017 1 17/12/2016 0
2 15/02/2017 1 18/12/2016 0
3 23/01/2017 1 19/12/2016 0
4 28/02/2017 1 07/02/2017 1
5 09/02/2017 2 21/12/2016 0
6 20/03/2017 2 22/12/2016 0
7 16/12/2016 5 22/12/2016 0
Related
I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)
I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
I have a dataframe that looks this:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017','...']
sales = [1,2,3,4,1,2,'...']
days_left_in_m = [3,2,1,0,29,28,'...']
df_test = pd.DataFrame({'date': date,'days_left_in_m':days_left_in_m,'sales':sales})
df_test
I am trying to find sales for the rest of the month.
So, for 28th of Jan 2017 it will calculate sum of the next 3 days,
for 29th of Jan - sum of the next 2 days and so on...
The outcome should look like the "required" column below.
date days_left_in_m sales required
0 28-01-2017 3 1 10
1 29-01-2017 2 2 9
2 30-01-2017 1 3 7
3 31-01-2017 0 4 4
4 01-02-2017 29 1 3
5 02-02-2017 28 2 2
6 ... ... ... ...
My current solution is really ugly - I use a non-pythonic looping:
for i in range(lenght_of_t_series):
days_left = data_in.loc[i].days_left_in_m
if days_left == 0:
sales_temp_list.append(0)
else:
if (i+days_left) <= lenght_of_t_series:
sales_temp_list.append(sum(data_in.loc[(i+1):(i+days_left)].sales))
else:
sales_temp_list.append(np.nan)
I guess a much better way of doing this would be to use df['sales'].rolling(n).sum()
However, each row has a different window.
Please advise on the best way of doing this...
I think you need DataFrame.sort_values with GroupBy.cumsum.
If you do not want to take into account the current day you can
use groupby.shift (see commented code).
First you could convert date column to datetime in order to use Series.dt.month
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
Then we can use:
months = df_test['date'].dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
print(df_test)
Output
date days_left_in_m sales required
0 2017-01-28 3 1 10
1 2017-01-29 2 2 9
2 2017-01-30 1 3 7
3 2017-01-31 0 4 4
4 2017-02-01 29 1 3
5 2017-02-02 28 2 2
If you don't want convert date column to datetime use:
months = pd.to_datetime(df_test['date'],format = '%d-%m-%Y').dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
I have this pandas DataFrame
val
datetime attribute_id
2018-01-31 0 4.162565
1 3.305480
2 3.191123
3 3.601398
4 3.277375
6 3.556552
2018-02-28 0 0.593762
1 0.594565
2 0.583355
3 0.611113
4 0.577600
6 0.638904
And I would like to have a column ratio where for each month each attribute is divided by the mean of all other attributes.
For example, for datetime = 2018-01-31, which refers to the month of January, I would like the column ratio to contain the value of attribute 0 (4.162565) divided by the mean of attributes 1,2,3,4 and 6 which is the mean of 3.305480, 3.191123, 3.601398, 3.277375 and 3.556552. This month-wise for each attribute.
datetime and attribute_id are a MultiIndex.
Does someone know how to do this?
You can count mean per first MultiIndex level with GroupBy.transform and create new Series with same size like original DataFrame and dnen divide column by Series.div:
print (df.groupby(level=0)['val'].transform('mean'))
datetime attribute_id
2018-01-31 0 3.515749
1 3.515749
2 3.515749
3 3.515749
4 3.515749
6 3.515749
2018-02-28 0 0.599883
1 0.599883
2 0.599883
3 0.599883
4 0.599883
6 0.599883
Name: val, dtype: float64
df['result'] = df['val'].div(df.groupby(level=0)['val'].transform('mean'))
print (df)
val result
datetime attribute_id
2018-01-31 0 4.162565 1.183977
1 3.305480 0.940192
2 3.191123 0.907665
3 3.601398 1.024362
4 3.277375 0.932198
6 3.556552 1.011606
2018-02-28 0 0.593762 0.989796
1 0.594565 0.991135
2 0.583355 0.972448
3 0.611113 1.018720
4 0.577600 0.962854
6 0.638904 1.065047
If need exlude correct row only change groupby(level=0) in this unutbu solution:
grouped = df.groupby(level=0)
n = grouped['val'].transform('count')
mean = grouped['val'].transform('mean')
df['ratio'] = df['val'] / ((mean*n - df['val']) / (n-1))
I have this problem where 1 of the columns in my df is entered in as string, but I want to convert it into end of date month in python. For example,
Id Name Date Number
0 1 A 201601 5
1 2 B 201602 6
2 3 C 201603 4
The Date column has the year and month as string. Ideally, my goal is:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
I was able to do this on excel using Endmonth and cut string, but when I tried pd.to_datetime in python, it didn't work. Thanks!
we can using MonthEnd
from pandas.tseries.offsets import MonthEnd
df.Date=(pd.to_datetime(df.Date,format='%Y%m')+MonthEnd(1)).dt.strftime('%m/%d/%Y')
df
Out[1336]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
you can use PeriodIndex:
In [36]: df['Date'] = pd.PeriodIndex(df['Date'].astype(str), freq='M').strftime('%m/%d/%Y')
In [37]: df
Out[37]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4