Filling Pandas dataframe based on date columns and date range - python

I have a pandas dataframe that looks like this,
id start end
0 1 2020-02-01 2020-04-01
1 2 2020-04-01 2020-04-28
I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.
I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",
id month status
0 1 2020-01 -1
1 1 2020-02 1
2 1 2020-03 2
3 1 2020-04 2
4 1 2020-05 -1
5 1 2020-06 -1
6 2 2020-01 -1
7 2 2020-02 -1
8 2 2020-03 -1
9 2 2020-04 1
10 2 2020-05 -1
11 2 2020-06 -1
The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,
If the month column value is equal to month of start column then fill status as 1
If the month column value is greater than month of start column but less than or equal to month of end column fill it as 2.
If the month column value is less than month of start month then fill it as -1. Also if the month column value is greater than month of end fill status with -1.
I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.
Is there any pandas functions that can help me here?
Thanks #Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,
id start end
0 1 2020-02-01 2020-02-20
1 1 2020-04-01 2020-05-10
2 2 2020-04-10 2020-04-28
One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.

Make sure the start and end columns are of type Timestamp:
# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'
df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])
# Determine the status
df['status'] = -1
cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1
cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2

Related

Create a new column based on timestamp values to count the days by steps

I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear

Python repeatable cycle for picking only first values equal 1

I have the df which has index with dates and values 0 or 1. I need to filter every first 1 from this data frame.
For example:
2019-11-27 0
2019-11-29 0
2019-12-02 0
2019-12-03 1
2019-12-04 1
2019-12-05 1
2020-06-01 0
2020-06-02 0
2020-06-03 1
2020-06-04 1
2020-06-05 1
So I want to get:
2019-12-03 1
2020-06-03 1
Assuming you want the first date with value 1 of the dataframe ordered by date ascending, a window operation might be the best way to do this:
df['PrevValue'] = df['value'].rolling(2).agg(lambda rowset: int(rowset.iloc[0]))
This line of code adds an extra column named "PrevValue" to the dataframe containing the value of the previous row or "NaN" for the first row.
Next, you could query the data as follows:
df_filtered = df.query("value == 1 & PrevValue == 0")
Resulting in the following output:
date value PrevValue
3 2019-12-03 1 0.0
8 2020-06-03 1 0.0
i built function that can satisfy your requirements
important note you should change the col argument it might cause you problem
def funfun (df , col="values"):
'''
df : dataframe
col (str) : please insert the name of column that you want to scan
'''
a = []
c = df.to_dict()
for i in range (len(c[col]) -1 ) :
b=c[col][i] , c[col][i+1]
if b == (0, 1) :
a.append(df.iloc[i+1])
return a
results

Creating a dataframe column in python, based on the conditions on other columns

I have the following DataFrame (in reality I'm working with around 20 million rows):
shop month day sale
1 7 1 10
1 6 1 8
1 5 1 9
2 7 1 10
2 6 1 8
2 5 1 9
I want another column: "Prev month sales", where sales are equal to the "Sales of previous month with same day, e.g.
shop month day sale prev month sale
1 7 1 10 8
1 6 1 8 9
1 5 1 9 9
2 7 1 10 8
2 6 1 8 9
2 5 1 9 9
One solution using .concat(), set_index(), and .loc[]:
# Get index of (shop, previous month, day).
# This will serve as a unique index to look up prev. month sale.
prev = pd.concat((df.shop, df.month - 1, df.day), axis=1)
# Unfortunately need to convert to list of tuples for MultiIndexing
prev = pd.MultiIndex.from_arrays(prev.values.T)
# old: [tuple(i) for i in prev.values]
# Now call .loc on df to look up each prev. month sale.
sale_prev_month = df.set_index(['shop', 'month', 'day']).loc[prev]
# And finally just concat rather than merge/join operation
# because we want to ignore index & mimic a left join.
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
shop month day sale sale
0 1 7 1 10 8.0
1 1 6 1 8 9.0
2 1 5 1 9 NaN
3 2 7 1 10 8.0
4 2 6 1 8 9.0
5 2 5 1 9 NaN
Your new column will be float, not int, because of the presence of NaNs.
Update - an attempt with dask
I don't use dask day to day so this is probably woefully sub-par. Trying to work around the fact that dask does not implement pandas' MultiIndex. So, you can concatenate your three existing indices into a string column and lookup on that.
import dask.dataframe as dd
# Play around with npartitions or chunksize here!
df2 = dd.from_pandas(df, npartitions=10)
# Get a *single* index of unique (shop, month, day IDs)
# Dask doesn't support MultiIndex
empty = pd.Series(np.empty(len(df), dtype='object')) # Passed to `meta`
current = df2.loc[:, on].apply(lambda col: '_'.join(col.astype(str)), axis=1,
meta=empty)
prev = df2.loc[:, on].assign(month=df2['month'] - 1)\
.apply(lambda col: '_'.join(col.astype(str)), axis=1, meta=empty)
df2 = df2.set_index(current)
# We know have two dask.Series, `current` and `prev`, in the
# concatenated format "shop_month_day".
# We also have a dask.DataFrame, df2, which is indexed by `current`
# I would think we could just call df2.loc[prev].compute(), but
# that's throwing a KeyError for me, so slightly more expensive:
sale_prev_month = df2.compute().loc[prev.compute()][['sale']]\
.reset_index(drop=True)
# Now just concat as before
# Could re-break into dask objects here if you really needed to
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)

Sum time in pandas

I have df with columns date, employee and event. 'Event' have value [1,3,5] if someone exit or [0,2,4] if someone entries. 'Employee' it's a private number for each employee. That's a head of df:
employee event registration date
0 4 1 1 2010-10-18 18:11:00
1 17 1 1 2010-10-18 18:15:00
2 6 0 1 2010-10-19 06:28:00
3 8 0 0 2010-10-19 07:04:00
4 15 0 1 2010-10-19 07:34:00
I sorted df and I have value from one month [year and month are my variables].
df = df.where(df['date'].dt.year == year).dropna()
df = df.where(df['date'].dt.month== month).dropna()
I want to create df which shows me sum of time in work for each employee.
Employees come in and come out at the same day and they could do it a few times in each day.
It seems you need boolean indexing with groupby where get difference by diff with sum:
year = 2010
month = 10
df = df[(df['date'].dt.year == year) & (df['date'].dt.month== month)]
More general solution is add to groupby year and month:
df =df['date'].groupby([df['employee'],
df['event'],
df['date'].rename('year').dt.year,
df['date'].rename('month').dt.month]).apply(lambda x: x.diff().sum())

How to efficiently add rows for those data points which are missing from a sequence using pandas?

I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8

Categories

Resources