I want to add data to redis now.
My purpose is to split the data and insert it into redis little by little.
After inserting pandas.DataFrame into redis, I want to add data
We've inserted the Dataframe to redis now, but we don't know how to keep and add existing data.
for example:
log_df_v1 ## DataFrame_v1
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
step 1. dataframe to redis set
r = redis.StrictRedis(host="localhost", port=6379, db=0)
log_dic = log_df_v1.to_dict()
log_set = json.dumps(log_dic,ensure_ascii = False).encode('utf-8')
r.set('log_t1',log_set)
True
step 2. Get data from redis, Make it DataFrame
log_get = r.get('log_t1').decode('utf-8')
log_dic = dict(json.loads(log_get))
data_log = pd.DataFrame(log_dic)
data_log
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
step 3(Question). I want to add Dataframe(log_df_v2) to redis.
However, I need to keep existing DataFrame(log_df_v1)
log_df_v2 ## DataFrame_v2
session_id connect_date location categories join page_out
20000 f28e7b23-5ad0-460f-b50e-e6fe0b5edff6 2019-12-29 16:03:39 서울특별시 12 0 0
20001 e284ca69-333f-4cb8-84c9-485353a4ed74 2019-12-29 16:03:38 경기도 4 0 1
20002 ea348aa8-aa52-4ee2-84da-f000020c1ecf 2019-12-29 16:03:15 경상북도 1 0 0
20003 36b9795c-d38f-4ec1-8f49-0eae9cecd0b6 2019-12-29 16:03:12 경상북도 1 0 0
20004 f83e403e-16f5-4e31-8265-3ad40d9be969 2019-12-29 16:03:12 경상북도 1 0 0
I want to Result:
log_get = r.get('log_t1').decode('utf-8')
log_dic = dict(json.loads(log_get))
data_log = pd.DataFrame(log_dic)
data_log
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
20000 f28e7b23-5ad0-460f-b50e-e6fe0b5edff6 2019-12-29 16:03:39 서울특별시 12 0 0
20001 e284ca69-333f-4cb8-84c9-485353a4ed74 2019-12-29 16:03:38 경기도 4 0 1
20002 ea348aa8-aa52-4ee2-84da-f000020c1ecf 2019-12-29 16:03:15 경상북도 1 0 0
20003 36b9795c-d38f-4ec1-8f49-0eae9cecd0b6 2019-12-29 16:03:12 경상북도 1 0 0
20004 f83e403e-16f5-4e31-8265-3ad40d9be969 2019-12-29 16:03:12 경상북도 1 0 0
How to insert log_df_v1 and then log_df_v2 into redis?
All I want is to save data on Redis.
Please help me...
Related
I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had. I also got an KeyError.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
df_new = df.groupby(['customerId', 'fromDate'], as_index=False)['fromDate'].count()
df_new['count_from_date'] = df_new['fromDate']
df = df.merge(df_new['count_from_date'], how='inner', left_index=True, right_index=True)
(df.set_index(['fromDate']).sort_index().groupby('customerId').apply(lambda s: s['count_from_date'].rolling('180D').sum())- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']
[OUT] KeyError: 'count_from_date'
What I want
customerId fromDate occur_last_6_months
0 1 2022-06-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 1 # 2022-05-20 = 1
2 1 2022-05-25 1 # 2022-05-20 = 1
3 1 2022-05-20 0 # No in the last 6 months
4 1 2021-09-05 0 # No in the last 6 months
5 2 2022-06-02 0 # No in the last 6 months
6 3 2021-03-01 1 # 2021-02-01 = 1
7 3 2021-02-01 0 # No in the last 6 months
If possible sum duplicated values like second and third row count matched values in mask by sum only True values:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1) - 1
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 2
2 1 2022-05-25 2021-11-25 2
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
If need subtract by all count per duplciated dates instead subtract 1 use GroupBy.transform with size:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1)
return x
df = df.groupby('customerId').apply(f)
s = df.groupby(['customerId', 'fromDate'])['customerId'].transform('size')
df['occur_last_6_months'] -= s
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I am working with two pandas dataframes that I am trying to groupby the same date ranges. I want to use this sample df that we can call 'hours' as a basis to set the START_DATE & END_DATE which I was able to do by just grouping by every 5 records by index. This is what the 'hours' dataframe looks like:
HOURS MIN_DATE MAX_DATE
0 93.00 2021-01-05 2021-01-12
1 203.25 2021-01-13 2021-01-19
2 210.00 2021-01-20 2021-01-26
3 185.75 2021-01-27 2021-02-02
4 180.25 2021-02-03 2021-02-09
5 172.25 2021-02-10 2021-02-16
Then I have a separate df that I want to summarize with the same date ranges that I'll call 'models' which looks like this:
MODEL DATE MODEL_1 MODEL_2 MODEL_3 MODEL_4 MODEL_5 MODEL_6
0 2021-01-05 0 2 0 0 0 0
1 2021-01-06 0 0 0 0 3 0
2 2021-01-07 0 0 0 0 0 0
3 2021-01-13 3 0 0 0 0 0
4 2021-01-14 0 0 1 1 1 0
5 2021-01-15 0 0 0 0 0 0
6 2021-01-20 0 0 0 0 0 1
7 2021-01-21 0 3 0 0 0 1
I ultimately am looking for this result:
MIN_DATE MAX_DATE MODEL_1 MODEL_2 MODEL_3 MODEL_4 MODEL_5 MODEL_6
0 2021-01-05 2021-01-12 0 2 0 0 3 0
1 2021-01-13 2021-01-19 3 0 1 1 1 0
2 2021-01-20 2021-01-26 0 3 0 0 0 2
I haven't been able to find a way to use .groupby() on the 'models' data using the MIN_DATE & MAX_DATE from the 'hours' data. Is there a different operation I should be using or is there a way to use those dates to summarize the model data?
Thanks
Try using pd.IntervalIndex and groupby:
# First let's ensure that all DATE columns are datetime dtype:
hours_df[['MIN_DATE', 'MAX_DATE']] = hours_df[['MIN_DATE', 'MAX_DATE']].apply(pd.to_datetime)
model_df['DATE'] = pd.to_datetime(model_df['DATE'])
# Create IntervalIndex using from_arrays
hours_df['interval'] = pd.IntervalIndex.from_arrays(hours_df['MIN_DATE'], hours_df['MAX_DATE'], closed='both')
#set 'interval' as index of hours_df
hours_df = hours_df.set_index('interval')
# groupby and sum
model_df.groupby(hours_df.loc[model_df['DATE']].index).sum()
Output:
MODEL_1 MODEL_2 MODEL_3 MODEL_4 MODEL_5 MODEL_6
interval
[2021-01-05, 2021-01-12] 0 2 0 0 3 0
[2021-01-13, 2021-01-19] 3 0 1 1 1 0
[2021-01-20, 2021-01-26] 0 3 0 0 0 2
Try:
# convert the columns first:
df1["MIN_DATE"] = pd.to_datetime(df1["MIN_DATE"])
df1["MAX_DATE"] = pd.to_datetime(df1["MAX_DATE"])
df2["DATE"] = pd.to_datetime(df2["DATE"])
# convert min_date/max_date to datarange
df1["tmp"] = df1.apply(
lambda x: pd.date_range(x["MIN_DATE"], x["MAX_DATE"]), axis=1
)
# explode + save the index to column "index"
df1 = df1.explode("tmp").reset_index()
# merge + groupby on the saved index
print(
df2.merge(df1, left_on="DATE", right_on="tmp")
.groupby("index")
.agg(
{
"MIN_DATE": "min",
"MAX_DATE": "max",
**{f"MODEL_{i}": "sum" for i in range(1, 7)},
}
)
)
Prints:
MIN_DATE MAX_DATE MODEL_1 MODEL_2 MODEL_3 MODEL_4 MODEL_5 MODEL_6
index
0 2021-01-05 2021-01-12 0 2 0 0 3 0
1 2021-01-13 2021-01-19 3 0 1 1 1 0
2 2021-01-20 2021-01-26 0 3 0 0 0 2
Please refer to the image attached
I have a data frame that has yearly revenue in columns (2020 to 2025). I want to shift the revenue in those columns by a given time delta(column Time Shift). The time delta I have is in terms of days. Is there an efficient way to make the shift?
E.G
What I want to achieve is to shift the yearly revenue in columns by the value of days in the Time Shift column i.e. 4 days of revenue to shift from column to column ( i.e. 1.27[116/365 * 4] should be shifted from 2022 to 2023 for the 1st row)
Thanks in Advance
Text Input data
Launch Date Launch Date Base Time Shift 2020 2021 2022 2023 2024 2025
2022-06-01 2022-06-01 4 0 0 115.98 122.93 119.22 35.31
2025-02-01 2025-02-01 4 0 0 0 0 0 66.18859318
2022-09-01 2022-09-01 4 49.42 254.86 191.12 248.80 206.53 98.22
2025-01-01 2025-01-01 4 0 0 0 0 14.47 54.24
2022-06-01 2022-06-01 4 0 0 50.25 53.26 51.65 15.30
2025-02-01 2025-02-01 4 0 0 0 0 0 28.67
2022-09-01 2022-09-01 4 148.20 758.22 535.45 676.73 545.42 251.83
2025-01-01 2025-01-01 4 0 0 0 0 38.23 139.07
2022-06-01 2022-06-01 4 0 0 140.78 144.88 136.41 39.23
You can figour out how much to shift per year and then subtract from the current year and add to the next year.
Get the column names of interest
ycols = [str(n) for n in range(2020,2026)]
calculate the amount that needs shifting, per year (to the next year):
shift_df = df[ycols].multiply(df['Time_Shift']/365.0, axis=0)
looks like this
2020 2021 2022 2023 2024 2025
-- -------- ------- -------- -------- -------- --------
0 0 0 1.27101 1.34718 1.30652 0.386959
1 0 0 0 0 0 0.725354
2 0.541589 2.79299 2.09447 2.72658 2.26334 1.07638
3 0 0 0 0 0.158575 0.594411
4 0 0 0.550685 0.583671 0.566027 0.167671
5 0 0 0 0 0 0.314192
6 1.62411 8.30926 5.86795 7.41622 5.97721 2.75978
7 0 0 0 0 0.418959 1.52405
8 0 0 1.54279 1.58773 1.4949 0.429918
Now create a copy of df (could use the original if you want of course) and apply the operations:
df2 = df.copy()
df2[ycols] = df2[ycols] - shift_df[ycols]
df2[ycols[1:]] =df2[ycols[1:]] + shift_df[ycols[:-1]].values
slightly tricky bits here are in the last line -- we use the indexing [1:] and [:-1] appropriately to access previous year shift, and also use .values method otherwise the column labels do not match and you can do do the addition.
After this we get df2:
Launch_Date Launch_Date_Base Time_Shift 2020 2021 2022 2023 2024 2025
-- ------------- ------------------ ------------ -------- ------- -------- ------- -------- --------
0 2022-06-01 2022-06-01 4 0 0 114.709 122.854 119.261 36.2296
1 2025-02-01 2025-02-01 4 0 0 0 0 0 65.4632
2 2022-09-01 2022-09-01 4 48.8784 252.609 191.819 248.168 206.993 99.407
3 2025-01-01 2025-01-01 4 0 0 0 0 14.3114 53.8042
4 2022-06-01 2022-06-01 4 0 0 49.6993 53.227 51.6676 15.6984
5 2025-02-01 2025-02-01 4 0 0 0 0 0 28.3558
6 2022-09-01 2022-09-01 4 146.576 751.535 537.891 675.182 546.859 255.047
7 2025-01-01 2025-01-01 4 0 0 0 0 37.811 137.965
8 2022-06-01 2022-06-01 4 0 0 139.237 144.835 136.503 40.295
As you noticed the amount shifted from year 2026 is 'lost' ie we do not assign it to any new column
I have a Pandas dataframe with the following columns:
SecId Date Sector Country
184149 2019-12-31 Utility USA
184150 2019-12-31 Banking USA
187194 2019-12-31 Aerospace FRA
...............
128502 2020-02-12 CommSvcs UK
...............
SecId & Date columns are the indices. What I want is the following..
SecId Date Aerospace Banking CommSvcs ........ Utility AFG CAN .. FRA .... UK USA ...
184149 2019-12-31 0 0 0 1 0 0 0 0 1
184150 2019-12-31 0 1 0 0 0 0 0 0 1
187194 2019-12-31 1 0 0 0 0 0 1 0 0
................
128502 2020-02-12 0 0 1 0 0 0 0 1 0
................
What is the efficient way to pivot this? The original data is denormalized for each day and can have millions of rows.
You can use get_dummies. You can cast as a categorical dtype beforehand to define what columns will be created.
code:
SECTORS = df.Sector.unique()
df["Sector"] = df.Sector.astype(pd.Categorical(SECTORS))
COUNTRIES = df.Country.unique()
df["Country"] = df.Country.astype(pd.Categorical(COUNTRIES))
df2 = pd.get_dummies(data=df, columns=["Sector", "Country"], prefix="", pefix_sep="")
output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
Try as #BEN_YO suggests:
pd.get_dummies(df,columns=['Sector', 'Country'], prefix='', prefix_sep='')
Output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0