I have a relatively large data set of accidents, which contains a column called 'Time'. Each row has a 'time'. I would like to plot a histogram showing the frequency distribution of time periods. These are datetime objects.
On x-axis I would have time-periods, or starting of time periods. And on y-axis the number of rows/datapoints that fall in those time periods. Don't think of this as a bi-variate data, with time serving as index. Think of just one series - Time. I only need frequency distribution. All the questions and answers relate to some data in context of time-series. But, data is really not relevant here.
This worked. It was pretty straightforward.
df['Time'].hist(bins=24)
Related
I am trying to plot the availability of my network per hour. So,I have a massive dataframe containing multiple variables including the availability and hour. I can clearly visualise everything I want on my plot I want to plot when I do the following:
mond_data= mond_data.groupby('Hour')['Availability'].mean()
The only problem is, if I bracket the whole code and plot it (I mean this (the code above).plot); I do not get any value on my x-axis that says 'Hour'.How can plot this showing the values of my x-axis (Hour). I should have 24 values as the code above bring an aaverage for the whole day for midnight to 11pm.
Here is how I solved it.
plt.plot(mon_data.index,mond_data.groupby('Hour')['Availability'].mean())
for some reason python was not plotting the index, only if called. I have not tested many cases. So additional explanation to this problem is still welcome.
I am trying to decompose a Time Series, however my data does not have Dates, it is composed of entries taken at regular (and unknown) time intervals.
This solution is great and exactly what I want, however it assumed that my series has a datetime index, which it does not.
I can estimate the frequency parameter in this specific case, however this will need to be automated for different data, and as such I can not use the freq parameter of the seasonal_decompose function (unless there is some way to automatically calculate this) to make do for the fact that my series lacks a datetime index.
I have managed to estimate season lenght by utilizing the seasonal python package.
Using fit_seasons function and then seeing the lenght of the returned seasons.
I am trying to plot this time series in a chart, but the canvas is empty.
As you can see in the image above, my time series is quite simple. I want to plot DATE in x-axis and PAYEMS in the y-axis.
At first, I was getting an error because my dates were strings, so I converted it in cell 11.
You do not want to use a tsplot to plot a time series. The name is a bit confusing, but as the documentation puts it, tsplot is "intended to be used with data where observations are nested within sampling units that were measured at multiple timepoints". As a rule of thumb: If you understand this sentence, you will know when to use it, if you don't understand this sentence, don't use it. Apart, tsplot will even be removed or significantly altered in the future, so its use is deprecated.
But that doesn't matter, because you can directly use pandas to plot the time series.
df.plot(x="Date", y="Payems")
The data come to me in Excel. MS-Access is also available, of course. I also have SAS and Python.
I have data in two columns which I named DateTime and Observation. Observations are numeric and correspond to hourly readings. When sorted, there are runs of consecutive hourly observations over one or more days, separated logically (and irregularly) by gaps in time greater than one hour.
I need to automate the identification of the the time blocks (24,000 records) and calculate average, minimum, and maximum Observation for each discrete time block.
Here is a PivotTable example:-
Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.