I've got a dataframe with data taken from a database like this:
conn = sqlite3.connect('REDB.db')
dataAvg1 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, HOUSEINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID",conn
)
dataAvg2 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, FLATINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID",conn
)
dataAvg3 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, LANDINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID",conn
)
conn.close()
df2 = [dataAvg1, dataAvg2, dataAvg3]
dfAvg = pd.concat(df2)
dfAvg = dfAvg.reset_index(drop=True)
The main columns are UNIQUE_RE_NUMBER, RE_PRICE and UPDATE_DATE. I would like to count frequency of change in prices each day. Ideally create a new column called 'Frequency' and for each day add a number. For example:
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
1.1.2021 1 500 2
1.1.2021 2 400 2
2.1.2021 1 500 1
2.1.2021 2 450 1
I hope this example is understandable.
Right now I have something like this:
dfAvg['FREQUENCY'] = dfAvg.groupby('UPDATE_DATE')['UPDATE_DATE'].transform('count')
dfAvg.drop_duplicates(subset=['UPDATE_DATE'], inplace=True)
This code counts every price added that day, so when the price of real estate on 1.1.2021 was 500 and the next day, its also 500, it counts as "change" in price, but in fact the price stayed the same and I dont want to count that. I would like to select only distinct values in prices for each real estate. Is it possible?
Not sure if this is the most efficient way, but maybe it helps:
def ident_deltas(sdf):
return sdf.assign(
DELTA=(sdf.RE_PRICE.shift(1) != sdf.RE_PRICE).astype(int)
)
def sum_deltas(sdf):
return sdf.assign(FREQUENCY=sdf.DELTA.sum())
df = (
df.groupby("UNIQUE_RE_NUMBER").apply(ident_deltas)
.groupby("UPDATE_DAY").apply(sum_deltas)
.drop(columns="DELTA")
)
Result for
df =
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE
0 2021-01-01 1 500
1 2021-01-01 2 400
2 2021-02-01 1 500
3 2021-02-01 2 450
is
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
0 2021-01-01 1 500 2
1 2021-01-01 2 400 2
2 2021-02-01 1 500 1
3 2021-02-01 2 450 1
Related
I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas
product_1 price_1 Date of purchase
0 computer 1200 2022-01-02
1 monitor 800 2022-01-03
2 printer 200 2022-01-04
3 desk 350 2022-01-05
product_2 price_2 Date of purchase
0 computer 900 2022-01-02
1 monitor 800 2022-01-03
2 printer 300 2022-01-04
3 desk 350 2022-01-05
I would use a split/merge/where
df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])
From there you can merge the two columns using a join or merge
After that you can use an np.where()
merge_df['Check'] = np.where(merge_df['comp_column'] != merge_df['another_comp_column'])
From there you can just look for where the comp columns didn't match
merge_df.loc[merge_df['Check'] == False]
First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:
df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`
After that, you could simply use .loc to find the rows where prices do not match:
df.loc[df["price_1"] != df["price_2"]])
product_1 price_1 Date of purchase product_2 price_2
0 computer 1200 2022-01-02 computer 900
2 printer 200 2022-01-04 printer 300
Now, you could process each year by iterating a list of years, querying only the data from that year on df1 and df2 and then using the above procedure to find the price mismatches:
# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
# find data for this `year`
df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
# Apply the procedure described at the beginning
df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
# Find rows where prices do no match
mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)
Output:
product price_1 Date of purchase price_2
0 computer 1200 2022-01-02 900
2 printer 200 2022-01-04 300
I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
This question is related to How to fill missing dates and values in partitioned data?, but since the solution doesn't work for BigQuery, I'm posting the question again.
I have the following hypothetical table:
name date val
-------------------------------
A 01/01/2020 1.5
A 01/03/2020 2
A 01/06/2020 5
B 01/02/2020 90
B 01/07/2020 10
I want to fill in the dates in between the gaps and copy over the value from the most recent following date. In addition, I would like to fill in dates that 1) go back to a pre-set MINDATE (let's say it's 12/29/2019) and 2) go up to the current date (let's say it's 01/09/2020) - and for 2) the default values will be 1.
So, the output would be:
name date val
-------------------------------
A 12/29/2019 1.5
A 12/30/2019 1.5
A 12/31/2019 1.5
A 01/01/2020 1.5 <- original
A 01/02/2020 2
A 01/03/2020 2 <- original
A 01/04/2020 5
A 01/05/2020 5
A 01/06/2020 5 <- original
A 01/07/2020 1
A 01/08/2020 1
A 01/09/2020 1
B 12/29/2019 90
B 12/30/2019 90
B 12/31/2019 90
B 01/01/2020 90
B 01/02/2020 90 <- original
B 01/03/2020 10
B 01/04/2020 10
B 01/05/2020 10
B 01/06/2020 10
B 01/07/2020 10 <- original
B 01/08/2020 1
B 01/09/2020 1
The accepted solution in the above question doesn't work in BigQuery.
this should work
with base as (
select 'A' as name, '01/01/2020' as date, 1.5 as val union all
select 'A' as name, '01/03/2020' as date, 2 as val union all
select 'A' as name, '01/06/2020' as date, 5 as val union all
select 'B' as name, '01/02/2020' as date, 90 as val union all
select 'B' as name, '01/07/2020' as date, 10 as val
),
missing_dates as (
select name,dates as date from
UNNEST(GENERATE_DATE_ARRAY('2019-12-29', '2020-01-09', INTERVAL 1 DAY)) AS dates cross join (select distinct name from base)
), joined as (
select distinct missing_dates.name, missing_dates.date,val
from missing_dates
left join base on missing_dates.name = base.name
and parse_date('%m/%d/%Y', base.date) = missing_dates.date
)
select * except(val),
ifnull(first_value(val ignore nulls) over(partition by name order by date ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING),1) as va1
from joined
I have a data frame with this column name
timestamp,stockname,total volume traded
There are multiple stock names at each time frame
11:00,A,100
11:00,B,500
11:01,A,150
11:01,B,600
11:02,A,200
11:02,B,650
I want to create a ChangeInVol column such that each stock carries its own difference like
timestamp, stock,total volume, change in volume
11:00,A,100,NaN
11:00,B,500,NAN
11:01,A,150,50
11:01,B,600,100
11:02,A,200,50
11:03,B,650,50
If it were a single stock, I could have done
df['ChangeVol'] = df['TotalVol'] - df['TotalVol'].shift(1)
but there are multiple stocks
Need sort_values + DataFrameGroupBy.diff:
#if columns not sorted
df = df.sort_values(['timestamp','stockname'])
df['change in volume'] = df.groupby('stockname')['total volume traded'].diff()
print (df)
timestamp stockname total volume traded change in volume
0 11:00 A 100 NaN
1 11:00 B 500 NaN
2 11:01 A 150 50.0
3 11:01 B 600 100.0
4 11:02 A 200 50.0
5 11:02 B 650 50.0
I have a pandas dataframe (originally generated from a sql query) that looks like:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
2 100 1000 1/2/2016
3 100 1000 1/3/2016
4 101 1234 9/15/2016
5 101 1234 9/16/2016
etc....
I'd like to get this whittled down to a unique list, returning only the entry with the earliest date available, something like this:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
4 101 1234 9/15/2016
etc....
Any pointers or direction for a fairly new pandas dev? The unique function doesn't appear to be able to handle these types of rules, and looping through the array and working out which one to drop seems like a lot of trouble for a simple task... Is there a function that I'm missing that does this?
Let's use groupby, idxmin, and .loc:
df_out = df2.loc[df2.groupby('AccountId')['EntryDate'].idxmin()]
print(df_out)
Output:
AccountId ItemID EntryDate
index
1 100 1000 2016-01-01
4 101 1234 2016-09-15