Splitting a Graphlab SFrame Date column into three columns (Year Month Day) - python

Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.

You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)

A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.

Related

How do you append a set of Series with duplicate label names (index) to the end of a DataFrame as rows without using the deprecated append method?

I need to append a pandas Series as a row to the end of a pandas Dataframe. What makes this tricky is that I am using dates as my index, which are not unique in my case. This is what I want to be the result with the date values being the index.
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
| 8/5/2015 | 1076 | 3 | FALSE | 8 |
| 8/5/2015 | 1060 | 4 | FALSE | 8 |
| 8/6/2015 | 1540 | 5 | TRUE | 8 |
| 8/7/2015 | 1493 | 6 | TRUE | 8 |
| 8/7/2015 | 1060 | 0 | FALSE | 8 |
| 8/7/2015 | 1113 | 1 | FALSE | 8 |
| 8/8/2015 | 1027 | 2 | FALSE | 8 |
| 8/8/2015 | 1053 | 3 | FALSE | 8 |
| 8/8/2015 | 1051 | 4 | FALSE | 8 |
| 8/8/2015 | 1278 | 5 | TRUE | 8 |
| 8/8/2015 | 1086 | 6 | TRUE | 8 |
+───────────+─────────+──────────────+──────────+────────+
While this was easily possible with the append method, it is being deprecated and I am not sure that concat can replicate all of its functionality. (On a side note, why does the pandas team keep deprecating great functionality?).
My solution involves the loc method:
df.loc[len(df)] = series_row
df= df.rename(index={label_name: series_row.name})
In case you don't follow, we insert a new row at the end of the Dataframe. If we stop there, the label name will be an int value, specifically the size of the Dataframe.
df.loc[len(df)] = series_row
+───+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───+─────────+──────────────+──────────+────────+
| 1 | 1111 | 2 | FALSE | 8 |
+───+─────────+──────────────+──────────+────────+
To keep the append method's functionality, we need to rename the label to whatever we want which in this case was a date string.
df= df.rename(index={label_name: series_row.name})
+───────────+─────────+──────────────+──────────+────────+
| | counts | day of week | weekend | month |
+───────────+─────────+──────────────+──────────+────────+
| 8/5/2015 | 1111 | 2 | FALSE | 8 |
+───────────+─────────+──────────────+──────────+────────+

Dump prometheus data to parquet

I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.

How do I make a grouping that includes only values from the list?

I have a table presented below.
+------+-------+--------------+------------------------------+------+
| good | store | date_id | map_dates | sale |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-01-01' | ['2018-07-08'] | 10 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-05-06' | ['2019-01-01', '2018-07-08'] | 5 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2019-10-12' | ['2018-12-01'] | 24 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2018-07-08' | [] | 3 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2018-12-01' | [] | 15 |
+------+-------+--------------+------------------------------+------+
I want to group by columns good, store, and include only the dates specified in the map_dates column in the result. For example:
+------+-------+--------------+----------+
| good | store | date_id | sum_sale |
+------+-------+--------------+----------+
| 1 | 2 | '2019-01-01' | 3 |
+------+-------+--------------+----------+
| 1 | 2 | '2019-05-06' | 13 |
+------+-------+--------------+----------+
| 5 | 4 | '2019-10-12' | 15 |
+------+-------+--------------+----------+
How can I do this without using a loop?
First we explode, then we match our values by an inner merge on good, store, map_dates and date_id. Finally we GroupBy.sum:
dfn = df.explode('map_dates')
dfn = dfn.merge(dfn,
left_on=['good', 'store', 'map_dates'],
right_on=['good', 'store', 'date_id'],
suffixes=['', '_sum'])
dfn = dfn.groupby(['good', 'store', 'date_id'])['sale_sum'].sum().reset_index(
good store date_id sale_sum
0 1 2 2019-01-01 3
1 1 2 2019-05-06 13
2 5 4 2019-10-12 15

PySpark groupBy with grouped elements

I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))

How to shift selected rows to next adjacent column in pandas?

df3=pd.read_excel(r'may_2019.xlsx',sheet_name='Sheet2')
Here is Sample of my Pandas Dataframe:
+--------------------------+
| Col1 |
+--------------------------+
| G | 20 mins | 2015 |
| NR | 2 |
| G | 11 mins | 302 |
| TV-MA | 44 mins | Apr 30 |
| G | 198 |
| TV-MA | Apr 30 |
| NR | 2012 |
| NR | 57 mins |
+--------------------------+
there are some exception in data(i.e: 2,198,302)
Output Desired for Given Sample :
+--------+----------+------+-------+-----+
| Rating | Duration | Year | Month | Day |
+--------+----------+------+-------+-----+
| G | 20 | 2015 | | |
| NR | | 2 | | |
| G | 11 | 302 | | |
| TV-MA | 44 | | Apr | 30 |
| G | | 198 | | |
| TV-MA | | | Jan | 20 |
| NR | | 2012 | | |
| NR | 57 | | | |
+--------+----------+------+-------+-----+
Things I've tried
df5=pd.DataFrame(df3.Col1.str.split("|").tolist(),columns=['r','d','y'])
indx=df5.loc[df5.d.str.contains('\d{4}')].index
df6.loc[indx,['d','y']]=df5.loc[indx,['d','y']].shift(1,axis=1)
then I failed to shift date according to my required table
so I tried to create function but that also not worked.
def split_data(input):
newd=input.split("|")
if len(newd)==3:
df['date']=newd[2]
df['du']=newd[1]
df['rating']=newd[0]
if len(newd)==2:
df['rating']=newd[0]
if re.findall('\d{4}',newd[1]):
df['date']=newd[1]
else:
df['du']=newd[1]
return df
Things I've tried doen't provide a complete solution for all cases.
So Does anyone know how to do it with Pandas?
Looking at your inputs, i would first try reading in the data properly - it seems you fail in defining the separators etc. of the excel file

Categories

Resources