Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('3/31/2018', periods=365, freq='D'), 6, replace=False)})
print(df)
VipNo Quantity OrderDate
0 0 118 2019-02-16
1 0 49 2019-03-25
2 1 113 2018-05-11
3 1 127 2019-02-18
4 2 124 2018-12-27
5 2 71 2018-05-14
I want to create a new column that shows the percentage of each customer's total quantity purchased in 2018-10-01 - 2019-03-31 compared to that in 2018-03-31 - 2019-03-31. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. My dataset is big so a customer may have ordered multiple times within both of the time range and I would want to use the sum of the orders.
(df.assign(Quantity6=df['OrderDate'].between("2018-10-01","2019-03-31")*df.Quantity)
.assign(Quantity12=df['OrderDate'].between("2018-03-31","2019-03-31")*df.Quantity)
.groupby('VipNo')[['Quantity6','Quantity12']]
.sum()
.assign(output=lambda x: x['Quantity6']/x['Quantity12'])
)
Quantity6 Quantity12 output
VipNo
0 167 167 1.000000
1 127 240 0.529167
2 124 195 0.635897
This code now can achieve this goal and I know I can drop Quantity6 and Quantity12. But all I need is one column "output" which I want to put it in a dataframe I created earlier and I want to keep the code short. How can I create this output column without having to create other unnecessary columns?
Thank you in advance~
Just a few modifications in your code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('3/31/2018', periods=365, freq='D'), 6, replace=False)}
).set_index("VipNo")
(df.assign(Quantity6=df['OrderDate'].between("2018-10-01","2019-03-31")*df.Quantity)
.assign(Quantity12=df['OrderDate'].between("2018-03-31","2019-03-31")*df.Quantity)
.groupby('VipNo')[['Quantity6','Quantity12']]
.sum()
.assign(output=lambda x: x['Quantity6']/x['Quantity12'])
)["output"].to_frame().join(df)
Related
I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.
For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.
This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.
I have not succeeded in finding an alternative method which is less resource intensive.
What would be the best approach for this?
My current method here:
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
This gives the following result
Now to create a set containing all possible dates between the min and max date of the set:
df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max() + pd.Timedelta('1d'), freq='1d')})
This results in an expected frame containing all the possible dates
Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row
df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]
This leads to the following final
This final dataframe is exactly what is needed.
Efficient Solution
d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount() + 1
df1['date'] = df1['date_start'] + pd.to_timedelta(i, unit='d')
How it works?
Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result
Result
id number color date_start date_end date
0 aa0 1 blue 2022-01-01 2022-01-02 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-03
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-04
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-08
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-09
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-13
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-14
I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))
hopefully a basic one for most.
I have created two datasets using random data one for days of the year and other for energy per day:
import numpy as np
import pandas as pd
np.random.seed(2)
start2018 = pd.datetime(2018, 1, 1)
end2018 = pd.datetime(2018, 12, 31)
dates2018 = pd.date_range(start2018, end2018, freq='d')
synEne2018 = np.random.normal(loc=66.883795, scale=5.448145, size=365)
syn2018data = pd.DataFrame({'Date': [dates2018], 'Total Daily Energy': [synEne2018]})
syn2018data
When I run this code I was hoping to get the daily energy for each date on separate rows. However, what I get is one row similar to below:
Date Total Daily Energy
0 DatetimeIndex(['2018-01-01', '2018-01-02', '20... [64.61323781744713, 66.57724516658102, 55.2454...
Can someone suggest the edit to get this to display as described above..
Remove the square brackets around dates2018 and synEne2018. You are making them nested list by putting square brackets around them. Just leave them alone as it is and you should be good to go.
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018})
Prints:
Date Total Daily Energy
0 2018-01-01 64.613238
1 2018-01-02 66.577245
2 2018-01-03 55.245489
3 2018-01-04 75.820228
4 2018-01-05 57.112898
.. ... ...
360 2018-12-27 73.685533
361 2018-12-28 60.096896
362 2018-12-29 65.973035
363 2018-12-30 63.742335
364 2018-12-31 69.150342
[365 rows x 2 columns]
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848
The funky way that you index into pandas dataframes to change values is difficult for me. I can never figure out if I'm changing the value of a dataframe element, or if I'm changing a copy of that value.
I'm also new to python's syntax for operating on arrays, and struggle to turn loops over indexes (like in C++) into vector operations in python.
My problem is that I wish to add a column of pandas.Timestamp values to a dataframe based on values in other columns. Lets say I start with a dataframe like
import pandas as pd
import numpy as np
mydata = np.transpose([ [11, 22, 33, 44, 66, 77],
pd.to_datetime(['2015-02-26', '2015-02-27', '2015-02-25', np.NaN, '2015-01-24', '2015-03-24'], errors='coerce'),
pd.to_datetime(['2015-02-24', np.NaN, '2015-03-24', '2015-02-26', '2015-02-27', '2015-02-25'], errors='coerce')
])
df = pd.DataFrame(columns=['ID', 'BEFORE', 'AFTER'], data=mydata)
df.head(6)
which returns
ID BEFORE AFTER
0 11 2015-02-26 2015-02-24
1 22 2015-02-27 NaT
2 33 2015-02-25 2015-03-24
3 44 NaT 2015-02-26
4 66 2015-01-24 2015-02-27
5 77 2015-03-24 2015-02-25
I want to find the lesser of the dates BEFORE and AFTER and then make a new column called RELEVANT_DATE with the results. I can then drop BEFORE and AFTER. There are a zillion ways to do this but, for me, almost all of them don't work. The best I can do is this
# fix up NaT's only in specific columns, real data has more columns
futureDate = pd.to_datetime('2099-01-01')
df.fillna({'BEFORE':futureDate, 'AFTER':futureDate}, inplace=True)
# super clunky solution
numRows = np.shape(df)[0]
relevantDate = []
for index in range(numRows):
if df.loc[index, 'AFTER'] >= df.loc[index, 'BEFORE']:
relevantDate.append(df.loc[index, 'BEFORE'])
else:
relevantDate.append(df.loc[index, 'AFTER'])
# add relevant date column to df
df['RELEVANT_DATE'] = relevantDate
# delete irrelevant dates
df.drop(labels=['BEFORE', 'AFTER'], axis=1, inplace=True)
df.head(6)
returning
ID RELEVANT_DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
This approach is super slow. With a few million rows it takes too long to be useful.
Can you provide a pythonic-style solution for this? Recall that I'm having trouble both with vectorizing these operations AND making sure they get set for real in the DataFrame.
Take the minimum across a row (axis=1). Set the index so you can bring 'ID' along for the ride.
df.set_index('ID').min(axis=1).rename('RELEVANT DATE').reset_index()
ID RELEVANT DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
Or assign the new column to your existing DataFrame:
df['RELEVANT DATE'] = df[['BEFORE', 'AFTER']].min(1)
I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object