Is it possible to fill_forward only some columns when upsampling using Polars?
For example, whould like to fill in the missing dates in sample dataframe (see code below). 'upsample' and 'forward_fill' works beautifully and is blazing fast with a much larger dataset. The output is as expected in the table below. No issues, all as expected.
However, and the question: is it possible to exclude a column for the forward_fill, so for example it returns blanks in the 'utc_time' column instead of filling the time. I have tried listing the columns in select statement by replacing 'pl.all() with pl.col([...])', but that just removes the column that is not listed.
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
'utc_created':pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 22, 0), interval="2d"),
'utc_time':['21:12:06','21:20:06','17:51:10','03:54:49'],
'sku':[9100000801,9100000801,9100000801,9100000801],
'old':[18,17,16,15],
'new':[17,16,15,14],
'alert_type':['Inventory','Inventory','Inventory','Inventory'],
'alert_level':['Info','Info','Info','Info']
}
)
df = df.upsample(
time_column = 'utc_created',every="1d", by =('sku'))
.select(pl.all()
.forward_fill()
)
Returns:
| utc_created | utc_time | sku | old | new | alert_type | alert_level |
| 2021-12-16 00:00:00 | 21:12:06 | 9100000801 | 18 | 17 | Inventory | Info |
| 2021-12-17 00:00:00 | 21:12:06 | 9100000801 | 18 | 17 | Inventory | Info |
| 2021-12-18 00:00:00 | 21:20:06 | 9100000801 | 17 | 16 | Inventory | Info |
| 2021-12-19 00:00:00 | 21:20:06 | 9100000801 | 17 | 16 | Inventory | Info |
| 2021-12-20 00:00:00 | 17:51:10 | 9100000801 | 16 | 15 | Inventory | Info |
| 2021-12-21 00:00:00 | 17:51:10 | 9100000801 | 16 | 15 | Inventory | Info |
| 2021-12-22 00:00:00 | 03:54:49 | 9100000801 | 15 | 14 | Inventory | Info |
You can use pl.exclude(name).forward_fill() e.g.
.with_column(pl.exclude("utc_time").forward_fill())
Related
I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.
I have a datasource which gives me the following dataframe, pricehistory:
+---------------------+------------+------------+------------+------------+----------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+----------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601.0 | True |
| 2020-01-02 11:16:00 | 321.430623 | 321.465419 | 321.395827 | 321.465419 | 42678.0 | True |
| 2020-01-02 11:17:00 | 321.425652 | 321.445536 | 321.375944 | 321.440565 | 39827.0 | True |
| 2020-01-02 11:33:00 | 321.137343 | 321.261614 | 321.137343 | 321.261614 | 102805.0 | True |
| 2020-01-02 12:11:00 | 321.256643 | 321.266585 | 321.241731 | 321.266585 | 25629.0 | True |
| 2020-01-02 12:12:00 | 321.246701 | 321.266585 | 321.231789 | 321.266585 | 40869.0 | True |
| 2020-01-02 13:26:00 | 321.226818 | 321.266585 | 321.226818 | 321.261614 | 44011.0 | True |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351.0 | True |
| 2020-01-03 10:49:00 | 320.988217 | 321.077692 | 320.988217 | 321.057809 | 84492.0 | True |
| etc... | etc... | etc... | etc... | etc... | etc... | etc. |
+---------------------+------------+------------+------------+------------+----------+------+
Output of pricehistory.dtypes:
close float64
high float64
low float64
open float64
volume float64
red bool
dtype: object
Output of pricehistory.index.dtype:
dtype('<M8[ns]')
Note: This dataframe is large, each row is 1-min of data and spans for months, so there are many time frames to iterate over.
Question:
I have some specific criteria I'd like to use that will become columns in a new dataframe:
High price and time (minute) of each day for the entire dataframe
The first occurrence of 4 downward trending minutes open < close during the day with their respective times
So far, I'm not exactly sure how to pull the time (datetimeindex value) and high price from pricehistory.
For (1) above, I'm using pd.DataFrame(pricehistory.high.groupby(pd.Grouper(freq='D')).max()) which gives me:
+------------+------------+
| time | high |
+------------+------------+
| | |
| 2020-01-02 | 322.956677 |
| 2020-01-03 | 321.753729 |
| 2020-01-04 | NaN |
| 2020-01-05 | NaN |
| 2020-01-06 | 321.843204 |
| etc... | etc... |
+------------+------------+
But this doesn't work because it's only giving me the day and not down to the minute, and using min as the Grouper freq doesn't work because then it's just the max value of each min, which is high.
Desired outcome (note: minutes included):
+---------------------+------------+
| time | high |
+---------------------+------------+
| | |
| 2020-01-02 9:31:00 | 322.956677 |
| 2020-01-03 10:13:11 | 321.753729 |
| 2020-01-04 15:33:12 | 320.991231 |
| 2020-01-06 12:01:23 | 321.843204 |
| etc... | etc... |
+---------------------+------------+
For (2) above, I'm using the following:
pricehistory['red'] = pricehistory['close'].lt(pricehistory['open'])
To make a new column in pricehistory which shows us if there are 4 red minutes in a row.
Then, using new_pricehistory = pricehistory.loc[pricehistory[::-1].rolling(4)['red'].sum().eq(4)], this gives a new dataframe of only the rows where 4 red minutes in a row occur, preferably I'd like to only have the very first occurrence, not all.
Current output:
+---------------------+------------+------------+------------+------------+--------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+--------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601 | TRUE |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351 | TRUE |
| 2020-01-06 10:49:00 | 320.520956 | 320.570665 | 320.501073 | 320.550781 | 71901 | TRUE |
+---------------------+------------+------------+------------+------------+--------+------+
Given you didn't provide data I create some dummy one. By SO policy you should make different question per problem. For now I'm answering the first one.
Generate data
import pandas as pd
import numpy as np
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"high":np.random.randn(len(times))})
Question 1
Here I just look for index where the maximum per day occour and filter df accordingly
idx = df.groupby(df["time"].dt.date)["high"].idxmax().values
df[df.index.isin(idx)]
Update: In case you have time as index in your df the solution will be
df = df.set_index("time")
idx = df.groupby(pd.Grouper(freq='D'))["high"].idxmax().values
df[df.index.isin(idx)]
Question 2
import pandas as pd
import numpy as np
# generate data
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"open":np.random.randn(len(times))})
df["open"] = np.where(df["open"]<0, -1 * df["open"], df["open"])
df["close"] = df["open"] + 0.01 *np.random.randn(len(times))
df = df.set_index("time")
df["red"] = df['close'].lt(df['open'])
# this function return the first time
# when there are 4 consecutive red
def get_first(ts):
idx = ts.loc[ts[::-1].rolling(4)['red'].sum().ge(4)].index
if idx.empty:
return pd.NaT
else:
return idx[0]
# get first time within group and drop nan
grp = df.groupby(pd.Grouper(freq='D'))\
.apply(get_first).dropna()
df[df.index.isin(grp.values)]
I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))
df3=pd.read_excel(r'may_2019.xlsx',sheet_name='Sheet2')
Here is Sample of my Pandas Dataframe:
+--------------------------+
| Col1 |
+--------------------------+
| G | 20 mins | 2015 |
| NR | 2 |
| G | 11 mins | 302 |
| TV-MA | 44 mins | Apr 30 |
| G | 198 |
| TV-MA | Apr 30 |
| NR | 2012 |
| NR | 57 mins |
+--------------------------+
there are some exception in data(i.e: 2,198,302)
Output Desired for Given Sample :
+--------+----------+------+-------+-----+
| Rating | Duration | Year | Month | Day |
+--------+----------+------+-------+-----+
| G | 20 | 2015 | | |
| NR | | 2 | | |
| G | 11 | 302 | | |
| TV-MA | 44 | | Apr | 30 |
| G | | 198 | | |
| TV-MA | | | Jan | 20 |
| NR | | 2012 | | |
| NR | 57 | | | |
+--------+----------+------+-------+-----+
Things I've tried
df5=pd.DataFrame(df3.Col1.str.split("|").tolist(),columns=['r','d','y'])
indx=df5.loc[df5.d.str.contains('\d{4}')].index
df6.loc[indx,['d','y']]=df5.loc[indx,['d','y']].shift(1,axis=1)
then I failed to shift date according to my required table
so I tried to create function but that also not worked.
def split_data(input):
newd=input.split("|")
if len(newd)==3:
df['date']=newd[2]
df['du']=newd[1]
df['rating']=newd[0]
if len(newd)==2:
df['rating']=newd[0]
if re.findall('\d{4}',newd[1]):
df['date']=newd[1]
else:
df['du']=newd[1]
return df
Things I've tried doen't provide a complete solution for all cases.
So Does anyone know how to do it with Pandas?
Looking at your inputs, i would first try reading in the data properly - it seems you fail in defining the separators etc. of the excel file
Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.