Dump prometheus data to parquet - python

I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.

Related

How can I generate a financial summary using pandas dataframes?

I'd like to create a table from a data frame with subtotals per business, totals per business type, and columns summing multiple value columns. Long term is to create a selection tool based on the ingested Excel sheet for whichever month's summary I bring in to compare month summaries (e.g. did minerals item 26 from BA3 disappear the next month) but I believe that is best saved for another question.
For now, I am having trouble figuring out how to summarize the data.
I have a dataframe in Pandas that contains the following:
Business | Business Type | ID | Value-Q1 | Value-Q2 | Value-Q3 | Value-Q4 | Value-FY |
---------+---------------+----+----------+----------+----------+----------+----------+
BA1 | Widgets | 1 | 7 | 0 | 0 | 8 | 15 |
BA1 | Widgets | 2 | 7 | 0 | 0 | 8 | 15 |
BA1 | Cups | 3 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 4 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 5 | 9 | 10 | 0 | 0 | 19 |
BA1 | Snorkels | 6 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 7 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 8 | 0 | 0 | 8 | 8 | 16 |
BA2 | Widgets | 9 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 10 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 11 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 12 | 100 | 0 | 7 | 0 | 107 |
BA2 | Bread | 13 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 14 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 15 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 16 | 0 | 0 | 0 | 1 | 1 |
BA2 | Cat Food | 17 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 18 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 19 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 20 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 21 | 504 | 0 | 0 | 500 | 1004 |
BA3 | Gravel | 22 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 23 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 24 | 7 | 7 | 7 | 7 | 28 |
BA3 | Rocks | 25 | 3 | 2 | 0 | 0 | 5 |
BA3 | Minerals | 26 | 1 | 1 | 0 | 1 | 3 |
BA3 | Minerals | 27 | 1 | 1 | 0 | 1 | 3 |
BA4 | Widgets | 28 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 29 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 30 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 31 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 32 | 6 | 4 | 0 | 0 | 10 |
BA4 | Something | 33 | 1000 | 0 | 0 | 2 | 1002 |
BA5 | Bonbons | 34 | 60 | 40 | 10 | 0 | 110 |
BA5 | Bonbons | 35 | 60 | 40 | 10 | 0 | 110 |
BA5 | Gummy Bears | 36 | 7 | 0 | 0 | 9 | 16 |
(Imagine each ID has different values as well)
My goal is to slice the data to get the total occurrences of a given business type (e.g. BA 1 has 2 widgets, 3 cups, and 3 snorkels which each have a unique ID) as well as the total values:
Occurrence | Q1 Sum | Q2 Sum | Q3 Sum | Q4 Sum | FY Sum |
BA 1 8 | 41 | 30 | 24 | 40 | 135 |
Widgets 2 | 14 | 0 | 0 | 16 | 30 |
Cups 3 | 27 | 30 | 0 | 0 | 57 |
Snorkels 3 | 0 | 0 | 24 | 24 | 48 |
BA 2 Subtotal of BA2 items below
Widgets Repeat Above
Bread Repeat Above
Cat Food Repeat Above
I have more columns that mirror the Q1-FY columns with other fields (e.g. Value 2 Q1-FY) per line as well that I would like to include on the summary but I imagine I could just repeat whatever process is used to grab the current Value cuts.
I have a list of unique Businesses
businesses = [BA1, BA2, BA3, BA4, BA5]
and a list of unique Business Types
[Widgets, Cups, Snorkels, Bread, Cat Food, Gravel, Rocks, Minerals, Something, Bonbons, Gummy Bears]
and finally a list of the Values
values = [Value-Q1, Value-Q2, Value-Q3, Value-Q4, Value-FY]
and I tried doing a for loop off of the lists
maybe I need to make the dataframe values be on their own individual lines? I tried the following for at least the sum of FY
for b in businesses
for bt in business types
df_sums = df.loc['Business' == b, 'Business Type' == bt, 'Value-FY'].sum()
but it didn't quite give me what I was hoping for
I'm sure there's a better way to at least grab the values (I managed to get FY values per business into a dictionary) for totals but not totals per business per business type (which is also unique per business).
If anyone has any advice or can point me in the right direction I'd really appreciate it!
You should try to use the group_by method for this. Group_by allows for several grouping options. I have attached a link to the documentation on the method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Seaborn Grouped Boxplot One-liner?

I've read the several threads about plotting grouped Seaborn Boxplots, but I was wondering, if there is a simpler solution as a one-liner?
The Pandas Dataframe contains something along the lines of:
Index xaxis yaxis1 xayis2
0 A 30 1985
1 A 29 2002
2 B 21 3034
3 A 31 2087
4 B 19 2931
5 B 21 2832
6 A 28 1950
sns.boxplot(x='xaxis', y=['yaxis1','yaxis2'], data=df);
doesn't work (for probably obvious reasons), while
sns.boxplot(x='xaxis', y='yaxis1', data=df);
or
sns.boxplot(x='xaxis', y='yaxis2', data=df);
work just fine for the separate plots. I also tried using
sns.boxplot(df['xaxis'], df[['yaxis1','yaxis2']])
but no luck therewith either...
I want both yaxis columns combined into a single boxplot, similar to this one https://seaborn.pydata.org/examples/grouped_boxplot.html, but I can't use hue=, as the data for both y axes is continuous.
Any way I can do that with the one line sprint, or is it inevitable to run the whole marathon?
If you want to create grouped boxplots with seaborn, you have to use hue=. The trick is to create a long-form dataframe where all your yaxis{1,2} values are in one column, and an other column instructs which of the two original columns each row comes from.
This is accomplished using DataFrame.melt():
df
| Index | xaxis | yaxis1 | xayis2 |
|--------:|:--------|---------:|---------:|
| 0 | A | 30 | 1985 |
| 1 | A | 29 | 2002 |
| 2 | B | 21 | 3034 |
| 3 | A | 31 | 2087 |
| 4 | B | 19 | 2931 |
| 5 | B | 21 | 2832 |
| 6 | A | 28 | 1950 |
df2 = df.melt(id_vars=['xaxis'], var_name='yaxis')
| | xaxis | yaxis | value |
|---:|:--------|:--------|--------:|
| 0 | A | yaxis1 | 30 |
| 1 | A | yaxis1 | 29 |
| 2 | B | yaxis1 | 21 |
| 3 | A | yaxis1 | 31 |
| 4 | B | yaxis1 | 19 |
| 5 | B | yaxis1 | 21 |
| 6 | A | yaxis1 | 28 |
| 7 | A | xayis2 | 1985 |
| 8 | A | xayis2 | 2002 |
| 9 | B | xayis2 | 3034 |
| 10 | A | xayis2 | 2087 |
| 11 | B | xayis2 | 2931 |
| 12 | B | xayis2 | 2832 |
| 13 | A | xayis2 | 1950 |
sns.boxplot(x='xaxis', y='value', hue='yaxis', data=df2)

PySpark groupBy with grouped elements

I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))

Pandas merge two dataframe and drop extra rows

How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)

Splitting a Graphlab SFrame Date column into three columns (Year Month Day)

Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.

Categories

Resources