I have a dataframe like this:
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
01/09/2022 10 20 1 2
01/09/2022 15 25 4 5
01/09/2022 30 50 7 10
05/09/2022 10 20 1 2
05/09/2022 15 25 4 5
07/09/2022 15 25 4 5
I want to expand the dataframe to all date range between the DATE column with forward filling. The desired putput is:
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
01/09/2022 10 20 1 2
01/09/2022 15 25 4 5
01/09/2022 30 50 7 10
02/09/2022 10 20 1 2
02/09/2022 15 25 4 5
02/09/2022 30 50 7 10
03/09/2022 10 20 1 2
03/09/2022 15 25 4 5
03/09/2022 30 50 7 10
04/09/2022 10 20 1 2
04/09/2022 15 25 4 5
04/09/2022 30 50 7 10
05/09/2022 10 20 1 2
05/09/2022 15 25 4 5
06/09/2022 10 20 1 2
06/09/2022 15 25 4 5
07/09/2022 15 25 4 5
Could you please help me about this?
First convert values to datetimes, create helper counter Series g by GroupBy.cumcount for reshape by DataFrame.set_index and DataFrame.unstack, then use DataFrame.asfreq with method='ffill' and reshape back by DataFrame.stack, remove helper level by DataFrame.droplevel, convert DatetimeIndex to column, change format of datetimes and last create same dtypes like original DataFrame:
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
g = df.groupby('DATE').cumcount()
df = (df.set_index(['DATE',g])
.unstack()
.asfreq('D', method='ffill')
.stack()
.droplevel(-1)
.reset_index()
.assign(DATE = lambda x: x['DATE'].dt.strftime('%d/%m/%Y'))
.astype(df.dtypes)
)
print (df)
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
0 2022-01-09 10 20 1 2
1 2022-01-09 15 25 4 5
2 2022-01-09 30 50 7 10
3 2022-02-09 10 20 1 2
4 2022-02-09 15 25 4 5
5 2022-02-09 30 50 7 10
6 2022-03-09 10 20 1 2
7 2022-03-09 15 25 4 5
8 2022-03-09 30 50 7 10
9 2022-04-09 10 20 1 2
10 2022-04-09 15 25 4 5
11 2022-04-09 30 50 7 10
12 2022-05-09 10 20 1 2
13 2022-05-09 15 25 4 5
14 2022-06-09 10 20 1 2
15 2022-06-09 15 25 4 5
16 2022-07-09 15 25 4 5
A couple of merges should help with this, and should still be efficient as the data size increases:
Get the unique dates and build a new dataframe from that:
out = df.DATE.drop_duplicates()
dates = pd.date_range(out.min(), out.max(), freq='D')
dates = pd.DataFrame(dates, columns=['dates'])
Merge dates with out, and subsequently merge the outcome with the original dataframe:
(dates
.merge(
out,
left_on='dates',
right_on='DATE',
how = 'left')
# faster to fill on a Series than a Dataframe
.assign(DATE = lambda df: df.DATE.ffill())
.merge(
df,
on = 'DATE',
how = 'left')
.drop(columns='DATE')
.rename(columns= {'dates':'DATE'})
)
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
0 2022-09-01 10 20 1 2
1 2022-09-01 15 25 4 5
2 2022-09-01 30 50 7 10
3 2022-09-02 10 20 1 2
4 2022-09-02 15 25 4 5
5 2022-09-02 30 50 7 10
6 2022-09-03 10 20 1 2
7 2022-09-03 15 25 4 5
8 2022-09-03 30 50 7 10
9 2022-09-04 10 20 1 2
10 2022-09-04 15 25 4 5
11 2022-09-04 30 50 7 10
12 2022-09-05 10 20 1 2
13 2022-09-05 15 25 4 5
14 2022-09-06 10 20 1 2
15 2022-09-06 15 25 4 5
16 2022-09-07 15 25 4 5
Related
I am having issues finding a solution for the cummulative sum for mtd and ytd
I need help to get this result
Use groupby.cumsum combined with periods using to_period:
# ensure datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# group by year
df['ytd'] = df.groupby(s.dt.to_period('Y'))['count'].cumsum()
# group by month
df['mtd'] = df.groupby(s.dt.to_period('M'))['count'].cumsum()
Example (with dummy data):
date count ytd mtd
0 2022-08-26 6 6 6
1 2022-08-27 1 7 7
2 2022-08-28 4 11 11
3 2022-08-29 4 15 15
4 2022-08-30 8 23 23
5 2022-08-31 4 27 27
6 2022-09-01 6 33 6
7 2022-09-02 3 36 9
8 2022-09-03 5 41 14
9 2022-09-04 8 49 22
10 2022-09-05 7 56 29
11 2022-09-06 9 65 38
12 2022-09-07 9 74 47
I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21
I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?
Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)
You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
I’ve been trying to code python equivalent of excel sumif
Excel:
Sumif($A$1:$A$20,A1,$C$1:$C$20)
enter code here
Pandas df:
A C Term
1 10 1
1 20 2
1 10 3
1 10 4
2 30 5
2 30 6
2 30 7
3 20 8
3 10 9
3 10 10
3 10 11
3 10 12
Output df - I want output df with ‘fwdSum’ as follows
—————————
A C Term fwdSum
1 10 1 50
1 20 2 50
1 10 3 50
1 10 4 50
2 30 5 90
2 30 6 90
2 30 7 90
3 20 8 60
3 10 9 60
3 10 10 60
3 10 11 60
3 10 12 60
I tried creating another df with groupby and sum and then later merge
Please can anyone suggest the best Way to achieve this?
df['fwdSum'] = df.groupby('A')['C'].transform('sum')
print(df)
Prints:
A C Term fwdSum
0 1 10 1 50
1 1 20 2 50
2 1 10 3 50
3 1 10 4 50
4 2 30 5 90
5 2 30 6 90
6 2 30 7 90
7 3 20 8 60
8 3 10 9 60
9 3 10 10 60
10 3 10 11 60
11 3 10 12 60
So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?
I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.