Cleaning up x-axis because there are too many datapoints - python

I have a data set that is like this
Date Time Cash
1/1/20 12:00pm 2
1/1/20 12:02pm 15
1/1/20 12:03pm 20
1/1/20 15:06pm 30
2/1/20 11:28am 5
. .
. .
. .
3/1/20 15:00pm 3
I basically grouped all the data by date along the y-axis and time along the x-axis, and plotted a facetgrid as shown below:
df_new= df[:300]
g = sns.FacetGrid(df_new.groupby(['Date','Time']).Cash.sum().reset_index(), col="Date", col_wrap=3)
g = g.map(plt.plot, "Time", "Cash", marker=".")
g.set_xticklabels(rotation=45)
What I got back was hideous(as shown below). So I'm wondering is there anyway to tidy up the x-axis? Maybe like having 5-10 time data labels so the time can be visible, or maybe expanding the image?
Edit: I am plotting using seaborn. I will want it to look something like that below where the x-axis has only a couple of labels:
Thanks for your inputs.

Have you tried to use moving average instead of the actual data? You can count the moving average of any data with the following function:
def moving_average(a, n=10) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
Set n to average you need, you can play around with that value. a is in your case variable Cash represented as numpy array.
After that, set column Cash to the moving average count from real values and plot it. The plot curve will be smoother.
P.S. the plot of suicides you have added in edit is really unreadable, as the range for y axis is way higher than needed. In practice, try to avoid such plots.
Edit
I did not notice how you aggregate the data at first, you might want to work with date and time merged. I do not know where you load data from, in case you load it from csv you can add this to read_csv method: parse_dates=[['Date', 'Time']]. In case not, you can play around with the dataframe:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
you create a new column with datetime and can work simply with that.

Related

Changing y axis limit in a loop

I am working on dataframe. I have data for a week ,since its huge I am dividing the dataframe for each day. I am plotting a parameter temperature with time. In most of all the days temperature will be within 20 to 30 ,some days it will exceed above 30. I need to write a code in such a way that, in a day, when the temperature is within 20 and 30 ,my plot Y axis limit should be (20,30), if it is out of those range I need to have a limit (0,50). My current code looks like this
listofDF = [df_i for (_, df_i) in df.groupby(pd.Grouper(key="filename", freq="1D"))] #for dividing daraframe
for df in ((listofDF)):
if len(df)!= 0:
y = df[' Temp']
x = df['time']
plot(x,y)
plt.ylim(20,30)
Thanks for your help in advance. I know someone may think why so much requirement, the reason is, I am analysing through lots of data, I should have a standard scale, so I can just keep on looking instead of looking for Y axis and see the value
You could use:
plt.ylim(y.min() - 5, y.max() + 5)
This will scale between the min and max temperature values every plot (+- 5 to have some empty space above and below.

How to loop to create plots for each hour of day of geospatial data?

I am trying to create an animated map (by generating multiple plots) of road traffic throughout a week, where the thickness of roads is represented by the volume of traffic at a specific time of day.
This is sort of what I'm looking for (but for each hour of each day):
The data has a structure that looks like this:
HMGNS_LNK_ID geometry DOW Hour Normalised Value
2 MULTILINESTRING ((251... 1 0 0.233623
2 MULTILINESTRING ((251... 1 1 0.136391
2 MULTILINESTRING ((251... 1 2 0.108916
DOW stands for 'day of the week' (1 = Monday) and so for every Hour of each of the 7 days I want to plot the map with roads' thickness by the value Normalised Value.
I encounter a problem that when trying to loop with this code:
for dow in df['DOW']:
fig, ax = plt.subplots(1)
day_df = df[df['DOW']==dow]
for hour in day_df['Hour']:
day_hour_df = day_df[day_df['Hour']==hour]
day_hour_df.plot(ax=ax, linewidth=day_hour_df['Normalised Value'])
plt.savefig("day{}_hour{}.png".format(dow, hour), dpi = 200, facecolor='#333333')
The problem is that the figures are saved only for day 1, so until day1_hour_23 and after that, it comes back to day1_hour0 and overwrites the plot with something new. I can't figure out why it stops at DOW 2.
I'm not even sure if the data structure is correct. I would greatly appreciate any help with that. Please find the full code in my repo.
Cheers!
The problem is with the way you loop and subset df. Let's go through the loop in detail. First time in the outer loop, dow will be 1 and day_df = df[df['DOW']==dow] will select all rows with 1 in the column DOW. Now the inner loop goes through the selected rows and creates day1_hour0 to day1_hour23. Inner loop done, great.
Now we come second time into the outer loop and dow is again 1. day_df = df[df['DOW']==dow] will select all rows with 1 in the column DOW, i.e., the same set of rows that it used the previous time through the outer loop. So, it (re)writes day1_hour0 to day1_hour23 again.
I would suggest using (geo)pandas.groupby:
for dow, day_gdf in df.groupby("DOW"):
for hour, day_hour_gdf in day_gdf.groupby("Hour"):
fig, ax = plt.subplots(1)
print(f"Doing dow={dow}, hour={hour}")
day_hour_gdf.plot(ax=ax, linewidth=day_hour_gdf['Normalised Value'])
plt.savefig("day{}_hour{}.png".format(dow, hour), dpi = 200, facecolor='#333333')
plt.close()
Bonus Tip: Checkout pandas-bokeh if you want to generate interactive graphs with background tiles that can be saved as HTMLs or embedded in jupyter notebooks. The learning curve can be a bit steep with bokeh, but you can produce really nice interactive plots.
Cheers!

Visualize NaN-Values in Features of a Class via Pandas GroupBy

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

how to plot two time series that have different sample rates on the same graph with matplotlib

I have two sets of data that I would like to plot on the same graph. Both sets of data have 200 seconds worth of data. DatasetA (BLUE) is sampled at 25 Hz and DatasetB (Red) is sampled at 40Hz. Hence DatasetA has 25*200 = 5000 (time,value) samples... and DatasetB has 40*200 = 8000 (time,value) samples.
datasets with different sample rates
As you can see above, I have managed to plot these in matplotlib using the 'plot_date' function. As far as I can tell, the 'plot' function will not work because the number of (x,y) pairs are different in each sample. The issue I have is the format of the xaxis. I would like the time to be a duration in seconds, rather than an exact time of the format hh:mm:ss. Currently, the seconds value resets back to zero when it hits each minute (as seen in the zoomed out image below).
zoomed out full time scale
How can I make the plot show the time increasing from 0-200 seconds rather than showing hours:min:sec ?
Is there a matplotlib.dates.DateFormatter that can do this (I have tried, but can't figure it out...)? Or do I somehow need to manipulate the datetime x-axis values to be a duration, rather than an exact time? (how to do this)?
FYI:
The code below is how I am converting the original csv list of float values (in seconds) into datetime objects, and again into matplotlib date-time objects -- to be used with the axes.plot_date() function.
from matplotlib import dates
import datetime
## arbitrary start date... we're dealing with milliseconds here.. so only showing time on the graph.
base_datetime = datetime.datetime(2018,1,1)
csvDateTime = map(lambda x: base_datetime + datetime.timedelta(seconds=x), csvTime)
csvMatTime = map(lambda x: dates.date2num(x), csvDateTime)
Thanks for your help/suggestions!
Well, thanks to ImportanceOfBeingErnst for pointing out that I was vastly over-complicating things...
It turns out that I really only need the ax.plot(x,y) function rather than the ax.plot_date(mdatetime, y) function. Plot can actually plot varied lengths of data as long as each individual trace has the same number of x and y values. Since the data is all given in seconds I can easily plot using 0 as my "reference time".
For anyone else struggling with plotting duration rather than exact times, you can simply manipulate the "time" (x) data by using python's map() function, or better yet a list comprehension to "time shift" the data or convert to a single unit of time (e.g. simply turn minutes into seconds by dividing by 60).
"Time Shifting" might look like:
# build some sample 25 Hz time data
time = range(0,1000,1)
time = [x*.04 for x in time]
# "time shift it by 5 seconds, since this data is recorded 5 seconds after the other signal
time = [x+5 for x in time]
Here is my plotting code for any other matplotlib beginners like me :) (this will not run, since I have not converted my variables to generic data... but nevertheless it is a simple example of using matplotlib.)
fig,ax = plt.subplots()
ax.grid()
ax.set_title(plotTitle)
ax.set_xlabel("time (s)")
ax.set_ylabel("value")
# begin looping over the different sets of data.
tup = 0
while (tup < len(alldata)):
outTime = alldata[tup][1].get("time")
# each signal is time shifted 5 seconds later.
# in addition each signal has different sampling frequency,
# so len(outTime) is different for almost every signal.
outTime = [x +(5*tup) for x in outTime]
for key in alldata[tup][1]:
if(key not in channelSelection):
## if we dont want to plot that data then skip it.
continue
else:
data = alldata[tup][1].get(key)
## using list comprehension to scale y values.
data = [100*x for x in data]
ax.plot(outTime,data,linestyle='solid', linewidth='1', marker='')
tup+=1
plt.show()

Plotting in ggplot with non-discrete x and y

I want to create a ggplot where the x-axis is a distance (currently the distances are continuous values that range between 0 and 45 feet) that can be binned and the y-axis is whether or not the basket was made (0 is missed, 1 is made). Here is a subset of the dataframe, which is a pandas dataframe. EDIT: Not sure this is helpful, but I have also added a column that represents the bucket/bin for each attempt's shot distance.
distance(ft) outcome category
----------- --------- --------
9.5 1 (9,18]
23.3 1 (18,27]
18.7 0 (18,27]
10.8 0 (9,18]
43.6 1 (36,45]
I could just make a scatterplot where x-axis is distance and the y-axis is miss/made. However, I don't want to visualize every shot attempt as a point. Let's say I want the x axis to be bins (where each bin is every 9 ft: 0-9 ft, 9-18 ft, 18-27 ft, 27-36 ft, 36-45 ft), and the y to be the proportion of shots that was made in that bin.
What is the best way to achieve this in ggplot? How much preprocessing do I have to do before leveraging ggplot capabilities? I can imagine doing all the necessary computation myself to find the proportion of shots made per bin and then plotting those values easily, but I feel there should be some built-in capabilities to help me with this (although I am new to ggplot and unsure at this point). Any guidance would be greatly appreciated. Thanks!
You are likely using a Pandas DataFrame (or Series) to hold your data, right?
If so, you can bin your values using Pandas in-built functionality, specifically the pandas.cut function.
e.g.
bins = 9 # can be int or sequence of scalars
out, bins = df.cut(data, bins, retbins = True, include_lowest = True)

Categories

Resources