I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))
Related
I have a dataframe as below
Example 1
Some data frame always has 9 characters in tagname column before Last full stop '.'
| id | tagname | datetime |
|----|---------------------------------|-------------------------|
| 1 | A2.WAM_A.ACCESS_CLOSE_FAULT_LH | 2022-01-20 15:07:36.310 |
| 2 | A2.WAM_A.ACCESS_SENSOR_FAULT_RH | 2022-01-20 15:07:36.310 |
| 3 | A2.WAM_A.OUTPUT_POWER_CP_FAULT | 2022-01-20 15:07:36.310 |
If I use,
df['tagname'] = df['tagname'].str[9:]
Output: -
| id | tagname | datetime |
|----|------------------------|-------------------------|
| 1 | ACCESS_CLOSE_FAULT_LH | 2022-01-20 15:07:36.310 |
| 2 | ACCESS_SENSOR_FAULT_RH | 2022-01-20 15:07:36.310 |
| 3 | OUTPUT_POWER_CP_FAULT | 2022-01-20 15:07:36.310 |
Example 2
But in some tables I have diff length characters & multiple dot's in tagname before Last full stop '.', like below
| id | tagname | datetime |
|----|----------------------------------------------|-------------------------|
| 1 | A2.AC.CH.CONDITION.1ST_VACUUM_CH | 2021-09-28 17:31:48.191 |
| 2 | A2.AC.CH.CONDITION.SMALL_LEAK_TEST_VACUUM_CH | 2021-09-28 17:31:48.193 |
| 3 | A2.AC.CH.CONDITION.VACC_VALUE_CH_FLOAT_R270 | 2021-09-28 17:31:48.196 |
| 4 | A2.CP01.PRL2_TRIM1.ONCHANGE.PRL2 | 2021-09-28 17:31:48.199 |
| 5 | AY2.CP01.DL5_TRIM1.ONCHANGE.DL5 | 2021-09-28 17:31:48.199 |
Requirement: Output
Remove all characters in column tagname before that last dot '.' in tagname column
| id | tagname | datetime |
|----|---------------------------|-------------------------|
| 1 | 1ST_VACUUM_CH | 2021-09-28 17:31:48.191 |
| 2 | SMALL_LEAK_TEST_VACUUM_CH | 2021-09-28 17:31:48.193 |
| 3 | VACC_VALUE_CH_FLOAT_R270 | 2021-09-28 17:31:48.196 |
| 4 | PRL2 | 2021-09-28 17:31:48.199 |
| 5 | DL5 | 2021-09-28 17:31:48.199 |
Use str.rsplit:
Example 1:
df1['tagname'] = df1['tagname'].str.rsplit('.', 1).str[1]
print(df1)
# Output
id tagname datetime
0 1 ACCESS_CLOSE_FAULT_LH 2022-01-20 15:07:36.310
1 2 ACCESS_SENSOR_FAULT_RH 2022-01-20 15:07:36.310
2 3 OUTPUT_POWER_CP_FAULT 2022-01-20 15:07:36.310
Example 2:
df2['tagname'] = df2['tagname'].str.rsplit('.', 1).str[1]
print(df2)
# Output
id tagname datetime
0 1 1ST_VACUUM_CH 2021-09-28 17:31:48.191
1 2 SMALL_LEAK_TEST_VACUUM_CH 2021-09-28 17:31:48.193
2 3 VACC_VALUE_CH_FLOAT_R270 2021-09-28 17:31:48.196
3 4 PRL2 2021-09-28 17:31:48.199
4 5 DL5 2021-09-28 17:31:48.199
I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.
I have a table presented below.
+------+-------+--------------+------------------------------+------+
| good | store | date_id | map_dates | sale |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-01-01' | ['2018-07-08'] | 10 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-05-06' | ['2019-01-01', '2018-07-08'] | 5 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2019-10-12' | ['2018-12-01'] | 24 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2018-07-08' | [] | 3 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2018-12-01' | [] | 15 |
+------+-------+--------------+------------------------------+------+
I want to group by columns good, store, and include only the dates specified in the map_dates column in the result. For example:
+------+-------+--------------+----------+
| good | store | date_id | sum_sale |
+------+-------+--------------+----------+
| 1 | 2 | '2019-01-01' | 3 |
+------+-------+--------------+----------+
| 1 | 2 | '2019-05-06' | 13 |
+------+-------+--------------+----------+
| 5 | 4 | '2019-10-12' | 15 |
+------+-------+--------------+----------+
How can I do this without using a loop?
First we explode, then we match our values by an inner merge on good, store, map_dates and date_id. Finally we GroupBy.sum:
dfn = df.explode('map_dates')
dfn = dfn.merge(dfn,
left_on=['good', 'store', 'map_dates'],
right_on=['good', 'store', 'date_id'],
suffixes=['', '_sum'])
dfn = dfn.groupby(['good', 'store', 'date_id'])['sale_sum'].sum().reset_index(
good store date_id sale_sum
0 1 2 2019-01-01 3
1 1 2 2019-05-06 13
2 5 4 2019-10-12 15
df3=pd.read_excel(r'may_2019.xlsx',sheet_name='Sheet2')
Here is Sample of my Pandas Dataframe:
+--------------------------+
| Col1 |
+--------------------------+
| G | 20 mins | 2015 |
| NR | 2 |
| G | 11 mins | 302 |
| TV-MA | 44 mins | Apr 30 |
| G | 198 |
| TV-MA | Apr 30 |
| NR | 2012 |
| NR | 57 mins |
+--------------------------+
there are some exception in data(i.e: 2,198,302)
Output Desired for Given Sample :
+--------+----------+------+-------+-----+
| Rating | Duration | Year | Month | Day |
+--------+----------+------+-------+-----+
| G | 20 | 2015 | | |
| NR | | 2 | | |
| G | 11 | 302 | | |
| TV-MA | 44 | | Apr | 30 |
| G | | 198 | | |
| TV-MA | | | Jan | 20 |
| NR | | 2012 | | |
| NR | 57 | | | |
+--------+----------+------+-------+-----+
Things I've tried
df5=pd.DataFrame(df3.Col1.str.split("|").tolist(),columns=['r','d','y'])
indx=df5.loc[df5.d.str.contains('\d{4}')].index
df6.loc[indx,['d','y']]=df5.loc[indx,['d','y']].shift(1,axis=1)
then I failed to shift date according to my required table
so I tried to create function but that also not worked.
def split_data(input):
newd=input.split("|")
if len(newd)==3:
df['date']=newd[2]
df['du']=newd[1]
df['rating']=newd[0]
if len(newd)==2:
df['rating']=newd[0]
if re.findall('\d{4}',newd[1]):
df['date']=newd[1]
else:
df['du']=newd[1]
return df
Things I've tried doen't provide a complete solution for all cases.
So Does anyone know how to do it with Pandas?
Looking at your inputs, i would first try reading in the data properly - it seems you fail in defining the separators etc. of the excel file
Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.