Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.
Related
I need to append a pandas Series as a row to the end of a pandas Dataframe. What makes this tricky is that I am using dates as my index, which are not unique in my case. This is what I want to be the result with the date values being the index.
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
| 8/5/2015 | 1076 | 3 | FALSE | 8 |
| 8/5/2015 | 1060 | 4 | FALSE | 8 |
| 8/6/2015 | 1540 | 5 | TRUE | 8 |
| 8/7/2015 | 1493 | 6 | TRUE | 8 |
| 8/7/2015 | 1060 | 0 | FALSE | 8 |
| 8/7/2015 | 1113 | 1 | FALSE | 8 |
| 8/8/2015 | 1027 | 2 | FALSE | 8 |
| 8/8/2015 | 1053 | 3 | FALSE | 8 |
| 8/8/2015 | 1051 | 4 | FALSE | 8 |
| 8/8/2015 | 1278 | 5 | TRUE | 8 |
| 8/8/2015 | 1086 | 6 | TRUE | 8 |
+───────────+─────────+──────────────+──────────+────────+
While this was easily possible with the append method, it is being deprecated and I am not sure that concat can replicate all of its functionality. (On a side note, why does the pandas team keep deprecating great functionality?).
My solution involves the loc method:
df.loc[len(df)] = series_row
df= df.rename(index={label_name: series_row.name})
In case you don't follow, we insert a new row at the end of the Dataframe. If we stop there, the label name will be an int value, specifically the size of the Dataframe.
df.loc[len(df)] = series_row
+───+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───+─────────+──────────────+──────────+────────+
| 1 | 1111 | 2 | FALSE | 8 |
+───+─────────+──────────────+──────────+────────+
To keep the append method's functionality, we need to rename the label to whatever we want which in this case was a date string.
df= df.rename(index={label_name: series_row.name})
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
+───────────+─────────+──────────────+──────────+────────+
I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.
I have a table presented below.
+------+-------+--------------+------------------------------+------+
| good | store | date_id | map_dates | sale |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-01-01' | ['2018-07-08'] | 10 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-05-06' | ['2019-01-01', '2018-07-08'] | 5 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2019-10-12' | ['2018-12-01'] | 24 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2018-07-08' | [] | 3 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2018-12-01' | [] | 15 |
+------+-------+--------------+------------------------------+------+
I want to group by columns good, store, and include only the dates specified in the map_dates column in the result. For example:
+------+-------+--------------+----------+
| good | store | date_id | sum_sale |
+------+-------+--------------+----------+
| 1 | 2 | '2019-01-01' | 3 |
+------+-------+--------------+----------+
| 1 | 2 | '2019-05-06' | 13 |
+------+-------+--------------+----------+
| 5 | 4 | '2019-10-12' | 15 |
+------+-------+--------------+----------+
How can I do this without using a loop?
First we explode, then we match our values by an inner merge on good, store, map_dates and date_id. Finally we GroupBy.sum:
dfn = df.explode('map_dates')
dfn = dfn.merge(dfn,
left_on=['good', 'store', 'map_dates'],
right_on=['good', 'store', 'date_id'],
suffixes=['', '_sum'])
dfn = dfn.groupby(['good', 'store', 'date_id'])['sale_sum'].sum().reset_index(
good store date_id sale_sum
0 1 2 2019-01-01 3
1 1 2 2019-05-06 13
2 5 4 2019-10-12 15
I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))
df3=pd.read_excel(r'may_2019.xlsx',sheet_name='Sheet2')
Here is Sample of my Pandas Dataframe:
+--------------------------+
| Col1 |
+--------------------------+
| G | 20 mins | 2015 |
| NR | 2 |
| G | 11 mins | 302 |
| TV-MA | 44 mins | Apr 30 |
| G | 198 |
| TV-MA | Apr 30 |
| NR | 2012 |
| NR | 57 mins |
+--------------------------+
there are some exception in data(i.e: 2,198,302)
Output Desired for Given Sample :
+--------+----------+------+-------+-----+
| Rating | Duration | Year | Month | Day |
+--------+----------+------+-------+-----+
| G | 20 | 2015 | | |
| NR | | 2 | | |
| G | 11 | 302 | | |
| TV-MA | 44 | | Apr | 30 |
| G | | 198 | | |
| TV-MA | | | Jan | 20 |
| NR | | 2012 | | |
| NR | 57 | | | |
+--------+----------+------+-------+-----+
Things I've tried
df5=pd.DataFrame(df3.Col1.str.split("|").tolist(),columns=['r','d','y'])
indx=df5.loc[df5.d.str.contains('\d{4}')].index
df6.loc[indx,['d','y']]=df5.loc[indx,['d','y']].shift(1,axis=1)
then I failed to shift date according to my required table
so I tried to create function but that also not worked.
def split_data(input):
newd=input.split("|")
if len(newd)==3:
df['date']=newd[2]
df['du']=newd[1]
df['rating']=newd[0]
if len(newd)==2:
df['rating']=newd[0]
if re.findall('\d{4}',newd[1]):
df['date']=newd[1]
else:
df['du']=newd[1]
return df
Things I've tried doen't provide a complete solution for all cases.
So Does anyone know how to do it with Pandas?
Looking at your inputs, i would first try reading in the data properly - it seems you fail in defining the separators etc. of the excel file