How to fill missing dates in BigQuery? - python

This question is related to How to fill missing dates and values in partitioned data?, but since the solution doesn't work for BigQuery, I'm posting the question again.
I have the following hypothetical table:
name date val
-------------------------------
A 01/01/2020 1.5
A 01/03/2020 2
A 01/06/2020 5
B 01/02/2020 90
B 01/07/2020 10
I want to fill in the dates in between the gaps and copy over the value from the most recent following date. In addition, I would like to fill in dates that 1) go back to a pre-set MINDATE (let's say it's 12/29/2019) and 2) go up to the current date (let's say it's 01/09/2020) - and for 2) the default values will be 1.
So, the output would be:
name date val
-------------------------------
A 12/29/2019 1.5
A 12/30/2019 1.5
A 12/31/2019 1.5
A 01/01/2020 1.5 <- original
A 01/02/2020 2
A 01/03/2020 2 <- original
A 01/04/2020 5
A 01/05/2020 5
A 01/06/2020 5 <- original
A 01/07/2020 1
A 01/08/2020 1
A 01/09/2020 1
B 12/29/2019 90
B 12/30/2019 90
B 12/31/2019 90
B 01/01/2020 90
B 01/02/2020 90 <- original
B 01/03/2020 10
B 01/04/2020 10
B 01/05/2020 10
B 01/06/2020 10
B 01/07/2020 10 <- original
B 01/08/2020 1
B 01/09/2020 1
The accepted solution in the above question doesn't work in BigQuery.

this should work
with base as (
select 'A' as name, '01/01/2020' as date, 1.5 as val union all
select 'A' as name, '01/03/2020' as date, 2 as val union all
select 'A' as name, '01/06/2020' as date, 5 as val union all
select 'B' as name, '01/02/2020' as date, 90 as val union all
select 'B' as name, '01/07/2020' as date, 10 as val
),
missing_dates as (
select name,dates as date from
UNNEST(GENERATE_DATE_ARRAY('2019-12-29', '2020-01-09', INTERVAL 1 DAY)) AS dates cross join (select distinct name from base)
), joined as (
select distinct missing_dates.name, missing_dates.date,val
from missing_dates
left join base on missing_dates.name = base.name
and parse_date('%m/%d/%Y', base.date) = missing_dates.date
)
select * except(val),
ifnull(first_value(val ignore nulls) over(partition by name order by date ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING),1) as va1
from joined

Related

SQL - applying two different conditions after group by to select rows from sql table

I am willing to filter the rows based on the group of client id, date
So in the group if the latest status as per update date = 'CO' then earliest rows as per update date and if the latest status in ('NonPay','VD','Active') then the latest row as per update date.
Table look like: table1
rownum
clientid
date
status
updateDate
1
1234
2021-02-01
CO
2021-02-01
2
1234
2021-02-01
CO
2021-01-01
3
1234
2021-02-01
NonPay
2020-12-01
4
1234
2021-02-03
Active
2021-11-01
5
1234
2021-02-03
CO
2021-10-01
6
1234
2021-02-03
CO
2021-09-01
7
1234
2021-02-04
CO
2021-08-01
8
1234
2021-02-04
VD
2021-07-01
9
4567
2019-06-01
Active
2020-12-28
10
4567
2019-06-01
CO
2020-12-20
11
4567
2019-06-01
NonPay
2020-12-10
12
4567
2019-05-03
VD
2020-12-01
13
4567
2019-05-03
Active
2020-11-01
14
4567
2019-05-03
CO
2020-10-01
15
4567
2019-05-03
NP
2020-09-01
16
4567
2019-04-04
CO
2020-08-01
17
4567
2019-04-04
VD
2020-07-01
So the expected result would look like :
rownum
clientid
date
status
updateDate
3
1234
2021-02-01
NonPay
2020-12-01
4
1234
2021-02-03
Active
2021-11-01
8
1234
2021-02-04
VD
2021-07-01
9
4567
2019-06-01
Active
2020-02-01
12
4567
2019-05-03
VD
2020-12-01
17
4567
2019-04-04
VD
2020-07-01
I tried :
select *,
case when rank_date=1 and status=!='Active'
then max(rank_date)
else min(rank_date) end as selected_rank_date
from (select *,
rank() over(partition by clientid, date
order by updateDate desc) as rank_date
from table1)
And on top of this I will compare the rank_date and selected_rank_date, wherever they are equal, I will select those rows.
But unfortunately I am not able to figure out the first query itself, trying since a week.
If there's python way of doing then it should be optimized as the table size is huge approx. to 1 billion records.
#Import the csv in df and try with below code...
grp = df.groupby(['clientid', 'date'], axis=0)
li = []
for i, j in grp:
j.sort_values(by=['updateDate'], ascending = True)
fil = j['status'] != 'CO'
j = j.loc[fil, :].reset_index(drop=True)
li.append(j.loc[0,:])
pd.DataFrame(li)
One approach to solve your problem in MySQL is the following:
Step 1: get the first and the last updateDate for each partition (client_id, date)
Step 2: get the last updateDate for the group ('NonPay','VD','Active')
Step 3: get the first updateDate for the group ('CO')
Step 4: do a union of the rows for the two groups
Step 1: you can use ROW_NUMBER():
ascendently over updateDate to find first date of the partition where this value equals 1
descendently over updateDate to find last date of the partition where this value equals 1
SELECT *,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate ) AS firstUpdateDate,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate DESC) AS lastUpdateDate
FROM tab
Step 2: when you have one of the statuses ('NonPay','VD','Active') in the last date, you require to retrieve the last date itself, hence getting the corresponding rows means effectively to retrieve the last rows (lastUpdateDate = 1) where the status is one of the previously cited ones.
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
WHERE rd.lastUpdateDate = 1
AND rd.status IN ('NonPay', 'VD', 'Active')
Step 3: when you have one status 'CO' in the last date, you require to retrieve the first date, or in other words, from all the first dates that we have, we don't want those rows whose combination of (clientid, date) has already been captured from the Step 2. You can do this with an left join where the left table values are null (those values for which you don't have correspondence in the table generated by Step 2).
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
LEFT JOIN np_vd_active_status s
ON rd.clientid = s.clientid
AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
AND s.rownum IS NULL
Step 4: just apply a union between Step 2 result and Step 3 result. If you want to make some ordering on the rownum field, you can do that easily with an ORDER BY statement.
Final Query:
WITH ranked_dates AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate ) AS firstUpdateDate,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate DESC) AS lastUpdateDate
FROM tab
), np_vd_active_status AS (
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
WHERE rd.lastUpdateDate = 1
AND rd.status IN ('NonPay', 'VD', 'Active')
)
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
LEFT JOIN np_vd_active_status s
ON rd.clientid = s.clientid
AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
AND s.rownum IS NULL
UNION
SELECT *
FROM np_vd_active_status
Try it here.

Extracting unique price values from dataframe depending on real estate id

I've got a dataframe with data taken from a database like this:
conn = sqlite3.connect('REDB.db')
dataAvg1 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, HOUSEINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID",conn
)
dataAvg2 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, FLATINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID",conn
)
dataAvg3 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, LANDINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID",conn
)
conn.close()
df2 = [dataAvg1, dataAvg2, dataAvg3]
dfAvg = pd.concat(df2)
dfAvg = dfAvg.reset_index(drop=True)
The main columns are UNIQUE_RE_NUMBER, RE_PRICE and UPDATE_DATE. I would like to count frequency of change in prices each day. Ideally create a new column called 'Frequency' and for each day add a number. For example:
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
1.1.2021 1 500 2
1.1.2021 2 400 2
2.1.2021 1 500 1
2.1.2021 2 450 1
I hope this example is understandable.
Right now I have something like this:
dfAvg['FREQUENCY'] = dfAvg.groupby('UPDATE_DATE')['UPDATE_DATE'].transform('count')
dfAvg.drop_duplicates(subset=['UPDATE_DATE'], inplace=True)
This code counts every price added that day, so when the price of real estate on 1.1.2021 was 500 and the next day, its also 500, it counts as "change" in price, but in fact the price stayed the same and I dont want to count that. I would like to select only distinct values in prices for each real estate. Is it possible?
Not sure if this is the most efficient way, but maybe it helps:
def ident_deltas(sdf):
return sdf.assign(
DELTA=(sdf.RE_PRICE.shift(1) != sdf.RE_PRICE).astype(int)
)
def sum_deltas(sdf):
return sdf.assign(FREQUENCY=sdf.DELTA.sum())
df = (
df.groupby("UNIQUE_RE_NUMBER").apply(ident_deltas)
.groupby("UPDATE_DAY").apply(sum_deltas)
.drop(columns="DELTA")
)
Result for
df =
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE
0 2021-01-01 1 500
1 2021-01-01 2 400
2 2021-02-01 1 500
3 2021-02-01 2 450
is
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
0 2021-01-01 1 500 2
1 2021-01-01 2 400 2
2 2021-02-01 1 500 1
3 2021-02-01 2 450 1

Calculate aggregate value of column row by row

My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}

Pandas Time Series: Remove Rows Per ID

I have a Pandas dataframe of the form:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/04/27 1 42
2019/04/28 1 41
2019/01/27 2 33
2019/08/27 2 23
What I need to do?
Select Rows which are at least 30 days old from their latest measurement for each id.
i.e. the latest date for Id = 2 is 2019/08/27, so for ID =2 I need to select rows which are at least 30 days older. So, the row with 2019/08/27 for ID=2 will itself be dropped.
Similarly, the latest date for ID = 1 is 2019/04/28. This means I can select rows for ID =1 only if the date is less than 2019/03/28 (30 days older). So, the row 2019/04/27 with ID=1 will be dropped.
How to do this in Pandas. Any help is greatly appreciated.
Thank you.
Final dataframe will be:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/01/27 2 33
In your case using groupby + transform('last') and filter the original df
Yourdf=df[df.Date<df.groupby('ID').Date.transform('last')-pd.Timedelta('30 days')].copy()
Date ID Temp
0 2019-03-27 1 23
1 2019-04-27 2 32
4 2019-01-27 2 33
Notice I am adding the .copy at the end to prevent the setting copy error.

How to check date and time of max values in large data set Python

I have data sets that are ~30-60,000,000 lines each. Each Name has one or more unique ID associated with it for every day in the data set. Some OP_DATE and OP_HOUR the unique IDs can have 0 or blank values for each Load1,2,3.
I'm looking for a way to calculate the total maximum values of columns over all the OP_DATE that look like these:
Name ID OP_DATE OP_HOUR OP_TIME Load1 Load2 Load3
OMI 1 2001-01-01 1 1 11 10 12
OMI 1 2001-01-01 2 0.2 1 12 10
.
.
OMI 2A 2001-01-01 1 0.4 5
.
.
OMI 2A 2001-01-01 24 0.6 2 7 12
.
.
Kain 2 01 2002-01-01 1 0.1 6 12
Kain 2 01 2002-01-01 2 0.98 3 14 7
.
.
OMI 1 2018-01-01 1 0.89 12 10 20
.
.
I want to find the maximum values of Load1, Load2, Load3, and find what OP_DATE, OP_TIME and OP_HOUR that it occurred on.
The output I want is:
Name ID max OP_DATE max OP_HOUR max OP_TIME max Load1 max Load2 max Load3
OMI 1 2011-06-11 22 ..... max values on dates
OMI 2A 2012-02-01 12 ..... max values on dates
Kain 2 01 2006-01-01 1..... max values on dates
Is there a way I can do this easily?
I've tried:
unique_MAX = df.groupby(['Name','ID'])['Load1', 'Load2', 'Load3'].max().reset_index()
But this would group only by the dates and give me a total maximum - I'd like the associated dates, hours, and times as well.
To get the full row of information for any given fields [max]:
Get the index locations for the max of each group you desire
Use the indexes to return the full row at each location
An example for finding the max Load1 for each Name & ID pair
idx = df.groupby(['Name','ID'])['Load1'].transform(max) == df['Load1']
df[idx]
Out[14]:
name ID dt x y
1 Fred 050 1/2/2018 2 4
4 Dave 001 1/3/2018 6 1
5 Carly 002 1/3/2018 5 7

Categories

Resources