Plot specific months from whole data set? - python

From a whole data set, I need to plot the maximum & minimum temperatures for just the months of January and July. Column 2 is the date, and columns 8 and 9 are the 'TMAX' and 'TMIN.' This is what I have so far:
napa3=pd.read_csv('MET 51 Lab #10 data (Pandas, NAPA).csv',usecols=[2,8,9])
time2=pd.to_datetime(napa3['DATE'],format='%Y-%m-%d')
imon=time2.dt.month
jj=(imon==1)&(imon==7)
data_jj=napa3.loc[jj]
data_jj.plot.hist(title='TMAX & TMIN for January and July')
plt.show()
I keep getting the error: "TypeError: no numeric data to plot"
Why is this?

The problem can raise because the dates are saved as an "object" or a string.
However, I can't see that you have created dataframe?! you do read_csv but you do not make dataframe out of that:
dnapa3 = pd.DataFrame(napa3)
then repeat converting your time data and check:
print(dnapa3.dtypes)
after you became sure that your requested column values are string or object you can change the values of that column to floats:
dnapa3['your_temp_column_label'] = dnapa3['your_date_column_label'].astype(float)
This should work hopefully. Or silmilarly :
dnapa3['your_tem_column_label'] =pd.to_numeric(dnapa3['your_date_column_label'], errors='coerce')

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

The matplotlib chart changes when I change the index in python pandas dataframe

I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:

Is there a function to get the difference between two values on a pandas dataframe timeseries?

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.
I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.
Tried methods:
df.resample('2d').diff()
'DatetimeIndexResampler' object has no attribute 'diff'
df.resample('1d').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
df.rolling(2).diff()
'Rolling' object has no attribute 'diff'
df.rolling('2').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
Sample data:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'covid_cases':[1.2,2.0,2.9,3.6,3.9]
})
Desired sample output:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
})
Recreate sample data from original NYT dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()
Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.
Let's try this bit of complete code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df['date'] = pd.to_datetime(df['date'])
df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()
daily_new_cases_AL = df_daily_state.diff()['Alabama']
ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')
Output:
Details:
Download the historical case records from NYTimes github using the
raw URL
Convert the dtype of the 'date' column to datetime dtype
Groupby 'date' and 'state' columns sum 'cases' and unstack the state
level of the index to get dates of rows and states for columns.
Take the difference by columns and select only the Alabama column
Plot the last 30 days
The diff function is correct, but if you look at your error message:
'DatetimeIndexResampler' object has no attribute 'diff'
in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.
If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.

Moving x rows up in a dataframe indexed on dates

I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]

Converting Pandas Column of dates to Days

I have a panda table with a column called "date_of_work". This column contains date objects in the following format MM/DD/YYYY. For example 9/19/2016, or 12/5/2016
I'm trying to create a new column that assigns a day from a value between 1 and 365 so I can create a scatter plot with dates on the x axis. I created this function:
def converttoday(datex):
datex=str(datex)
newdate=datex.split('/')
day1=int(newdate[0])*30
day2=int(newdate[1])
finaldate=day1+day2
return finaldate
It ignores the year because I don't care about that (more focused on seasonality). Any idea how to convert this? I'm getting an error when attempting this.
Any help appreciated!
Try this:
df['dayofyear'] = pd.to_datetime(df['date_of_work']).dayofyear

Categories

Resources