I've a dataframe like this:
DATE
VALUE
TYPE
2021-01-11
57
A
2021-02-11
34
B
2021-03-11
43
A
2021-04-11
15
B
...
My question is how I could plot a bar graph, the mean monthly ordered by date of course and grouped by 'TYPE'
I'm using Pandas with this extract of code:
df = df.set_index('DATE')
df.index = pd.to_datetime(df.index)
df = df.resample('M').mean()
df.plot(kind='bar',stacked=True)
I want to draw a stacked bar plot but I don't know how...
Not sure if I understand correctly but if you want to stack values by types with date as x-axis, I would use pivot (do not set index first):
df = df.pivot('DATE', 'TYPE', 'VALUE')
df.plot(kind='bar', stacked=True, rot=0)
plt.show()
With the slightly edited table to show the stacking:
DATE
VALUE
TYPE
2021-01-11
57
A
2021-02-11
34
A
2021-02-11
12
B
2021-03-11
43
A
2021-04-11
15
B
You get the following:
I have created a visualization utilizing the plotly library within Python. Everything looks fine, except the axis is starting with 2020 and then shows 2019. The axis should be the opposite of what is displayed.
Here is the data (df):
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
This is what I am doing
import plotly.express as px
px.scatter(df, x = "date", y = "percent", color = "type", facet_col = "type")
How would I make it so that the dates are sorted correctly, earliest to latest? The dates are sorted within the raw data so why is it not reflecting this on the graph?
Any suggestion will be appreciated.
Here is the result:
It is plotting in the order of your df. If you want date order then sort so in date order.
df.sort_values('date', inplace=True)
A lot of other graphing utilities (Seaborn, etc) by default sort when plotting. Plotly Express does not do this.
Your date column seems to be a string. If you convert it to a datetime you don't have to sort your dataframe: plotly express will set the x-axis to datetime:
Working code example:
import pandas as pd
import plotly.express as px
from io import StringIO
text = """
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
px.scatter(df, x="date", y="percent", color="type", facet_col="type")
I want to create a line chart using Plotly. I have 3 variables(date,shift,runt).I want to include date with runt(also i want to display shift as well).
Dataframe:
What I want is to plot a line chart using both date and shift to x-axis.
This is what i got from excel. i want to plot a same graph in python
But I can't take two values.I tried to concatenate the date and shift to one column. But it shows first day values and then night values.
import plotly.express as px
fig = px.line(df, x="Day-Shift", y="RUNT", title='Yo',template="plotly_dark")
fig.show()
Is there any way to turn off order. what i want is shown in the above excel graph
I've created a column that combines the date and the shift and specified it on the x-axis. Does this meet the intent of your question?
import pandas as pd
import numpy as np
import io
data = '''
Date Shift RUNT
0 June-16 Day 350
1 June-16 Night 20
2 June-17 Day 350
3 June-17 Night 20
4 June-18 Day 350
5 June-18 Night 20
6 June-19 Day 350
7 June-19 Night 20
8 June-20 Day 350
9 June-20 Night 20
10 June-21 Day 350
11 June-21 Night 20
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['Day-Shift'] = df['Date'].str.cat(df['Shift'], sep='-')
import plotly.express as px
fig = px.line(df, x="Day-Shift", y="RUNT", title='Yo',template="plotly_dark")
fig.show()
I have a dataframe like as shown below
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
What I would like to do is add offset which is in years to the dates columns.
So, I was trying to convert the offset to timedelta object with unit=y or unit=Y and then shift admit_dates
df1['offset'] = pd.to_timedelta(df1['offset'],unit='Y') #also tried with `y` (small y)
df1['shifted_date'] = df1['admit_dates'] + df1['offset']
However, I get the below error
ValueError: Units 'M' and 'Y' are no longer supported, as they do not
represent unambiguous timedelta values durations.
Is there any other elegant way to shift dates by years?
The max Timestamp supported in pandas is Timestamp('2262-04-11 23:47:16.854775807') so you could not be able to add 310 years to date 12/31/2011, one possible way is to use python's datetime objects which support a max year upto 9999 so you should be able to add 310 years to that.
from dateutil.relativedelta import relativedelta
df['admit_dates'] = pd.to_datetime(df['admit_dates'])
df['admit_dates'] = df['admit_dates'].dt.date.add(
df['offset'].apply(lambda y: relativedelta(years=y)))
Result:
df
person_id admit_dates discharge_dates drug_start_dates offset
0 11 2238-03-21 05/09/2015 05/29/1967 223
1 11 2239-01-21 01/29/2016 01/21/1957 223
2 11 2241-07-20 7/27/2018 7/27/1959 223
3 21 2327-01-11 01/12/2017 01/01/1961 310
4 21 2321-12-31 01/31/2016 12/31/1961 310
One thing you can do is extract the year out of the date, and add it to the offset:
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[10,20,2,31,12]})
df1.admit_dates = pd.to_datetime(df1.admit_dates)
df1["new_year"] = df1.admit_dates.dt.year + df1.offset
df1["date_with_offset"] = pd.to_datetime(pd.DataFrame({"year": df1.new_year,
"month": df1.admit_dates.dt.month,
"day":df1.admit_dates.dt.day}))
The catch - with your original offsets, some of the dates cause the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2328-01-11 00:00:00
According to the documentation, the maximum date in pandas is Apr. 11th, 2262 (at about quarter to midnight, to be specific). It's probably because they keep time in nanoseconds, and that's when the out of bounds error occurs for this representation.
Units 'Y' and 'M' becomes deprecated since pandas 0.25.0
But thanks to numpy timedelta64 through which we can use these units in the pandas Timedelta
import pandas as pd
import numpy as np
# Builds your dataframe
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset
0 11 03/21/2015 05/09/2015 05/29/1967 223
1 11 01/21/2016 01/29/2016 01/21/1957 223
2 11 7/20/2018 7/27/2018 7/27/1959 223
3 21 01/11/2017 01/12/2017 01/01/1961 310
4 21 12/31/2011 01/31/2016 12/31/1961 310
>>> df1['shifted_date'] = df1.apply(lambda r: pd.Timedelta(np.timedelta64(r['offset'], 'Y'))+ pd.to_datetime(r['admit_dates']), axis=1)
>>> df1['shifted_date'] = df1['shifted_date'].dt.date
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset shifted_date
0 11 03/21/2015 05/09/2015 05/29/1967 223 2238-03-21
1 11 01/21/2016 01/29/2016 01/21/1957 223 2239-01-21
2 11 7/20/2018 7/27/2018 7/27/1959 223 2241-07-20
....
I'm learning Python & pandas and practicing with different stock calculations. I've tried to search help with this but just haven't found a response similar enough or then didn't understand how to deduce the correct approach based on the previous responses.
I have read stock data of a given time frame with datareader into dataframe df. In df I have Date Volume and Adj Close columns which I want to use to create a new column "OBV" based on given criteria. OBV is a cumulative value that adds or subtracts the value of the volume today to the previous' days OBV depending on the adjusted close price.
The calculation of OBV is simple:
If Adj Close is higher today than Adj Close of yesterday then add the Volume of today to the (cumulative) volume of yesterday.
If Adj Close is lower today than Adj Close of yesterday then substract the Volume of today from the (cumulative) volume of yesterday.
On day 1 the OBV = 0
This is then repeated along the time frame and OBV gets accumulated.
Here's the basic imports and start
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
from pandas_datareader import data, wb
start = datetime.date(2012, 4, 16)
end = datetime.date(2017, 4, 13)
# Reading in Yahoo Finance data with DataReader
df = data.DataReader('GOOG', 'yahoo', start, end)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#This is what I cannot get to work, and I've tried two different ways.
#ATTEMPT1
def obv1(column):
if column["Adj Close"] > column["Adj close"].shift(-1):
val = column["Volume"].shift(-1) + column["Volume"]
else:
val = column["Volume"].shift(-1) - column["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
#ATTEMPT 2
def obv1(df):
if df["Adj Close"] > df["Adj close"].shift(-1):
val = df["Volume"].shift(-1) + df["Volume"]
else:
val = df["Volume"].shift(-1) - df["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
Both give me an error.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Volume=np.random.randint(100, 200, 10),
AdjClose=np.random.rand(10)
))
print(df)
AdjClose Volume
0 0.951710 111
1 0.346711 198
2 0.289758 174
3 0.662151 190
4 0.171633 115
5 0.018571 155
6 0.182415 113
7 0.332961 111
8 0.150202 113
9 0.810506 126
Multiply the Volume by -1 when change in AdjClose is negative. Then cumsum
(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum()
0 111
1 -87
2 -261
3 -71
4 -186
5 -341
6 -228
7 -117
8 -230
9 -104
dtype: int64
Include this along side the rest of the df
df.assign(new=(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum())
AdjClose Volume new
0 0.951710 111 111
1 0.346711 198 -87
2 0.289758 174 -261
3 0.662151 190 -71
4 0.171633 115 -186
5 0.018571 155 -341
6 0.182415 113 -228
7 0.332961 111 -117
8 0.150202 113 -230
9 0.810506 126 -104