Plot the graph excluding missing values in pandas or matplotlib - python

I am new to time-series programming with Pandas.
Here is the sample data:
date 2 3 4 ShiftedPrice
0 2017-11-05 09:20:01.134 2123.0 12.23 34.12 300.0
1 2017-11-05 09:20:01.789 2133.0 32.43 45.62 330.0
2 2017-11-05 09:20:02.238 2423.0 35.43 55.62 NaN
3 2017-11-05 09:20:02.567 3423.0 65.43 56.62 NaN
4 2017-11-05 09:20:02.948 2463.0 45.43 58.62 NaN
I would like to plot date vs ShiftedPrice for all pairs where there is no missing values for ShiftedPrice column. You can assume that there will absolutely be no missing values in data column. So please do help me in this.

You can first remove NaNs rows by dropna:
df = df.dropna(subset=['ShiftedPrice'])
And then plot:
df.plot(x='date', y='ShiftedPrice')
All together:
df.dropna(subset=['ShiftedPrice']).plot(x='date', y='ShiftedPrice')

Related

Differences of rows and adding the results in a new row

I want the results of this line of code df.diff(periods=len(df)-1) in a new row of my data frame.
The line of code above calculates the difference between the first and the last row. So what I want is to add a new line with the results to compare later in percentages the trend of my data. I have explained the final goal in case there exists a more straightforward approach.
My data
#MetaHash 0x 1inch 88mph AC \
Time
14:00 16/03/2021 196876.0 162052086.0 279895846.0 850387.0 713496.0
14:02 16/03/2021 271687.0 150463819.0 281510814.0 850387.0 714325.0
14:49 16/03/2021 362927.0 164764136.0 278248431.0 862865.0 688467.0
And the results obtained after applying the line of code.
#MetaHash 0x 1inch 88mph AC \
Time
14:00 16/03/2021 NaN NaN NaN NaN NaN
14:02 16/03/2021 NaN NaN NaN NaN NaN
17:15 17/03/2021 NaN NaN NaN NaN NaN
11:46 18/03/2021 362810.0 270883348.0 115643691.0 1833585.0 -312283.0 # I want this row in my previous data frame I have shown above.
Try using concat() method:-
import pandas as pd
df=pd.concat((df,df.diff(periods=len(df)-1).dropna()))

How to correctly match pandas multiindex dataframe multiplication for sparse data

I've searched before posting this, I've found among others this previous stack overflow post and I don't think it answers my question.
I have sparse data that I want to multiply together correctly matched by index, where the data is a multilevel index.
I have observations of different attributes for a number of element_ids on different dates, but the data is sparse:
This is my second array df_weight_at_date a list of weights for each element_id (python to create at bottom of post)
For each date, I want to multiply values together, so for example in my observed data A/1/2021-01-15 (0.87) should be multiplied by weight at date 1/2021-01-15 (0.3) for a value of 0.261
If either value is NaN then the result is NaN and the output frame will have the same shape as the df_observations dataframe.
I've tried using .multiply but get the error no ValueError: cannot join with no overlapping index names
df_observations.multiply(df_weight_at_date.unstack())
Expected output for this data
Bit of a newbie - would appreciate any pointers, thanks
code to create data frames
df_observations=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16'],
'element_id':[1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,2,3,4,5,6,7,3,2,3,4,5,6,7],
'factor_id':['A','A','A','A','A','A','A','B','B','B','B','B','B','B','C','C','C','A','A','A','A','A','A','F','F','B','B','B','B','B'],
'observation':[0.87,0.84,0.15,0.6,0.17,0.76,0.03,0.91,0.05,0.38,0.06,0.27,0.92,0.27,0.16,0.71,0.32,0.92,0.88,0.53,0.79,0.15,0.3,0.16,0.36,0.05,0.22,0.73,0.7,0.9]}).pivot(index=['observed_date','element_id'], columns='factor_id', values='observation')
df_weight_at_date=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15',
'2021-01-16','2021-01-17','2021-01-18',
'2021-01-19','2021-01-20','2021-01-18'
],
'element_id':[1,3,5,1,3,5,1,3,9],
'weight':[0.3,0.35,0.35,1,1,0.4,1,1,0.6]}).pivot(index=['element_id'], columns='observed_date', values='weight')
You can try to unstack df_weight_at_date:
df_observations.mul(df_weight_at_date.unstack().fillna(1)
.reindex(df_observations.index, fill_value=1),
axis=0
)
Output:
factor_id A B C F
observed_date element_id
2021-01-15 1 0.2610 0.2730 0.048 NaN
2 0.8400 0.0500 0.710 NaN
3 0.0525 0.1330 0.112 NaN
4 0.6000 0.0600 NaN NaN
5 0.0595 0.0945 NaN NaN
6 0.7600 0.9200 NaN NaN
7 0.0300 0.2700 NaN NaN
2021-01-16 2 0.9200 NaN NaN 0.36
3 0.8800 0.0500 NaN 0.16
4 0.5300 0.2200 NaN NaN
5 0.7900 0.7300 NaN NaN
6 0.1500 0.7000 NaN NaN
7 0.3000 0.9000 NaN NaN
After correcting the input frames so that index names match (observation_date -> observed_date) this now works and is concise enough I think
df_observations.multiply(df_weight_at_date.unstack(), axis=0)
result
This should work as well:
df_weight_at_date.stack().swaplevel().to_frame('A').reindex(df_observations.columns,axis=1).ffill(axis=1).mul(df_observations)

How to group a df based on one column and apply a function to another column in pandas

I am quite new to pandas here, I have been stuck for weeks in this issue, so as a last resort i have come to this forum.
Below is my dataframe
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 26.250000 420.00
2 15.00 2019-05-19 36.000000 180.00
3 7.50 2019-05-19 34.500000 172.50
4 7.50 2019-05-21 32.894737 625.00
I am trying to plot a graph where my primary y axis will have the S2rate and secondary Yaxis will have sale average. But I would like my x axis to have the date , for which I will need my df to like like this(below)
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 to 2019-05-19 31.1250000 600.00
2 7.50 2019-05-19 to 2019-05-21 33.690000 797.50
That is for S2rate 15 min date is 2019-05-18 and max date is 2019-05-19, so it needs to pic the min and max date for the S2rate that needs to be grouped, cause there can be situations when for a same S2rate, there can be many days.
Can anyone guide me towards this, also please do not mistake that I am directly asking help/code, even pointing me to the right concepts will do. I kinda have no clue how to proceed further.
Any help is much appreciated. TIA !
First, since S2Rate values can recur, consecutive dates of a S2Rate must be identified first. This can be done by a diff-cumsum trick. Ignore this step if you'd like to group by all S2Rates.
# identify consecutive groups of S2Rate
df["S2RateGroup"] = (df["S2Rate"].diff() != 0).cumsum()
df
Out[268]:
S2Rate S2BillDate Sale Average Total Sale S2RateGroup
0 20.0 2019-05-18 20.000000 20.0 1
1 15.0 2019-05-18 26.250000 420.0 2
2 15.0 2019-05-19 36.000000 180.0 2
3 7.5 2019-05-19 34.500000 172.5 3
4 7.5 2019-05-21 32.894737 625.0 3
Next, just write your custom title-producing function and put it into .agg() using Named Aggregation:
def date_agg(col):
dmin = col.min()
dmax = col.max()
return f"{dmin} to {dmax}" if dmax > dmin else f"{dmin}"
df.groupby("S2RateGroup").agg( # or .groupby("S2Rate")
s2rate=pd.NamedAgg("S2Rate", np.min),
date=pd.NamedAgg("S2BillDate", date_agg),
sale_avg=pd.NamedAgg("Sale Average", np.mean),
total_sale=pd.NamedAgg("Total Sale", np.sum)
)
# result
Out[270]:
s2rate date sale_avg total_sale
S2RateGroup
1 20.0 2019-05-18 20.000000 20.0
2 15.0 2019-05-18 to 2019-05-19 31.125000 600.0
3 7.5 2019-05-19 to 2019-05-21 33.697368 797.5
Since you are new to pandas, it would also be helpful to go through the official how-to.

Python Pandas totals and dates

Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')

How to plot some months in years of dataframe

Using this code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:.2f}'.format
a = pd.read_csv(r'C:\Users\Leonardo\Desktop\TRABALHO\dadosboias\MARINHA_TRATADO\Cabo Frio\boia_1\cabofrio.csv', na_values=['-9999.0'])
a.index = pd.to_datetime(a[['Year', 'Month', 'Day', 'Hour', 'Minute']])
pd.options.mode.chained_assignment = None
The output is something like this:
index wspd wdir gust hs
2009-06-24 15:21:00 1.4669884357700003 9.0 2.03121475722 nan
2009-06-24 16:21:00 1.4669884357700003 34.0 2.03121475722 nan
2009-06-24 17:21:00 0.677071585741 127.0 1.35414317148 nan
2009-06-24 18:21:00 0.22569052858000002 146.0 0.902762114322 nan
... ... ... ...
2013-02-10 17:21:00 nan nan nan nan
And doing a simple plotting with plt.plot(a.hs, 'r.') the output is this:
As can be seeable the dataframe has a lot of missing data in "hs" column. The main objective is to plot just the periods with data. In the image you can see that 2012-03 to 2013-3 have a lot of good data of "hs", so the objective is to plot this period and get something like this:
I Would be thankful if someone could help.
You can just select the relevant range, e.g.
a.loc['2012-03-01':'2013-03-01', 'hs'].plot()

Categories

Resources