I know this has been asked like 100 times but I still don't get it and the given solutions don't get me anywhere.
Im trying to convert time into a comparable format with Pandas/Python. I used a db entries as data and currently I have trouble using time like this:
52 2017-08-04 12:26:56.348698
53 2017-08-04 12:28:22.961560
54 2017-08-04 12:34:20.299041
the goal is to use it as year1 and year2 to make a graph like:
def sns_compare(year1,year2):
f, (ax1) = plt.subplots(1, figsize=LARGE_FIGSIZE)
for yr in range(int(year1),int(year2)):
sns.distplot(tag2[str(yr)].dropna(), hist=False, kde=True, rug=False, bins=25)
sns_compare(year1,year2)
When I try to to it like this I get ValueError: invalid literal for int() with base 10: '2017-08-04 12:34:20.299041'.
So currently I think about using Regex to manipulate the time fields but this cant be the way to go or at least I cant imagine. I tried all kind of suggestions from SO/GitHub but nothing really worked. I also don't know what the "optimal" time structure should look like. Is it 20170804123420299041 or something like 2017-08-04-12-34-20-299041. I hope somebody can make this clear to me.
This is your data:
from matplotlib import pyplot as plt
from datetime import datetime
import pandas as pd
df = pd.DataFrame([("2017-08-04 12:26",56.348698),("2017-08-04 12:28",22.961560),("2017-08-04 12:34",20.299041)])
df.columns = ["date", "val"]
First, we convert to datetime, then we reduce year1, next we convert to days.
df['date'] = pd.to_datetime(df["date"])
df["days"]=(df['date'] -datetime(year1,1,1)).dt.total_seconds()/86400.0
plot the data, and display only the days between year1 and year2
plt.scatter(df["days"],df["val"])
plt.xlim((0,(year2-year1)*365))
plt.show()
Have you looked at pd.to_datetime? Pandas and Seaborn should be able to handle dates fine, and you don't have to convert them to integers.
Related
I have a pandas dataframe with a column "Datetime" which has values in pd.Timestamp / np.datetime64 format. How should I extract the hours and minutes while keeping the status of this "HH:MM" as "continuous plottable values?"
I want to plot a histogram of the dataframe column (pd.Series) based on the frequency in "HH:MM sense" in which case the x-axis would range from 00:00 to 23:59 etc.
import pandas as pd
# ...
new_df["Datetime"][0]
> Timestamp('2022-08-08 16:58:00')
I saw examples of extracting the time as a string. Not good enough. I could also use groupby hour and then e.g. plot a bar chart by count but that's not exactly what I was looking for, either...
...or I could convert each row to a string and then immediately back to pd.Timestamp with the same date. It's not ideal, but works. Any better ideas?
I battled with this a bit longer and got it working decently. Is this really the most straightforward way of doing it? The lambda stuff feels always a bit far-fetched, and this one still keeps the full date which isn't a problem per se but not necessary, either (and requires extra formatting on the xaxis).
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
fig, ax = plt.subplots()
plt.xticks(rotation=45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
# pd.Timestamp convers the date automatically to "today" if YYYYMMDD is not specified
new_df["Datetime"].apply(lambda t:pd.Timestamp(f'{t.hour:02d}:{t.minute:02d}')).hist(ax=ax)
I want to plot histograms for timedelta64 (example: Timedelta('0 days 00:00:44.749500')
But in both cases, ploty.histogram() does not recognize the correct time but rather displays values (e.g., 50T 60T (See image)).
How do I have to convert the datetime/timedelta that plotly .histogram() recognizes the correct timeaxis? Thanks
fig = px.histogram(x=df_TT_redux["T_delta"],color=df_TT_redux["event_source"],log_y=True)
EDIT:
THanks to LittlePanic404
converting to ISO Format gives some interesting results. I guess I have to tweak that still a bit.
using
import isodate
df_TT_redux["T_delta3"]=[isodate.duration_isoformat(x) for x in df_TT_redux["T_delta"]]
fig = px.histogram(x=df_TT_redux["T_delta3"],color=df_TT_redux["event_source"],color_discrete_map=color_discrete_map,log_y=False,log_x=False,nbins=100)
However,
another way of solving this could be this:
df_TT_redux["T_delta2"]=df_TT_redux["T_delta"]/pd.Timedelta("1 hour")
or
.../pd.Timedelta("1 minute"). Depending on your case
I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:
I am building a plot. I have two types of data. Tampstamp column store dates and favorite count stores count of likes. I want to visualise favorite count during the time since posting the tweet.
I believe your timestamp are strings. Convert it to datetime type and matplotlib\pandas will give you a nicer x-axis:
df['timestamp'] = pd.to_datetime(df['timestamp'])
# plot
plt.figure(figsize=(20,5))
df.plot(x='timestamp',y='favorite_count')
I have a scenario of the index being datetime objects and the data I want to plot are sales counts. Most of the time, there are multiple sales done throughout the day and each day can have different amount of sales. I would like to create a plot that shows a date range that nicely formats the xticklabels, depending on how many days I'd like to show in the plot. Kind of like this. I've tried different variants of code but have thus far been unsuccessful. Could someone look at my script below and please help me?
import pandas as pd
import matplotlib.pyplot as plt
index1 = ['2017-07-01','2017-07-01','2017-07-02','2017-07-02','2017-07-03','2017-07-03','2017-07-03']
index2 = pd.to_datetime(index1,format='%Y-%m-%d')
df = pd.DataFrame([[123456],[123789],[123654],[654321],[654987],[789456],789123]],columns=['Count'],index=index1)
df.plot(kind='box')
plt.show()
Use T, transpose and reshape your dataframe.
df.T.plot(kind='box', figsize=(10,7))
Output:
Okay to keep those dates as separate records and boxplot. Let's do this:
df.reset_index().set_index('index',append=True).unstack()['Count'].plot(kind='box',figsize=(10,7))
This is better.
df.set_index(np.arange(len(df)),append=True).unstack(0)['Count']\
.plot(kind='box',figsize=(10,7))
Output: