Data Coverage Plot using matplotlib and Pandas DataFrame

Data Coverage Plot using matplotlib and Pandas DataFrame - python

I have created data Coverage of time series which is in pandas Data Frame and would like to plot the data coverage in Matplotlib or PyQtGraph.
DATA FRAME
DateTime WD98 WS120 WS125B WD123 WS125A
31-07-2013 100 99.9 99.9 NaN NaN
31-08-2013 100 100 100 NaN NaN
30-09-2013 100 100 100 NaN NaN
31-10-2013 100 100 100 NaN NaN
30-11-2013 100 100 100 100 100
31-12-2013 100 100 100 100 100
31-01-2014 100 100 100 100 100
28-02-2014 100 100 100 100 100
31-03-2014 100 100 100 100 100
30-04-2014 100 100 100 100 100
31-05-2014 67.1 100 100 67.1 7.7
30-06-2014 NaN NaN 100 0 69.2
31-07-2014 NaN NaN 100 0 100
31-08-2014 NaN NaN 100 0 96.2
I would like to plot in below fashion (Broken bar Chart)
The above plot was done using Excel Conditional Formatting. Please help me.
DataCoverage >= 90 (Green)
DataCoverage >= 75 and DataCoverage < 90 (Yellow)
DataCoverage < 75 (red)

you can use seaborn.heatmap:
import seaborn as sns
df = df.set_index(df.pop('DateTime').dt.strftime('%d-%m-%Y'))
g = sns.heatmap(df, cmap=['r','y','g'], annot=True, fmt='.0f')
g.set_yticklabels(g.get_yticklabels(), rotation=0, fontsize=8)
Result:
UPDATE: corrected version:
x = df.set_index(df['DateTime'].dt.strftime('%d-%m-%Y')).drop('DateTime', 1)
z = pd.cut(x.stack(), bins=[-np.inf, 75, 90, np.inf], labels=[1.,2.,3.]).unstack().apply(pd.to_numeric)
g = sns.heatmap(z, cmap=['r','y','g'], fmt='.0f', cbar=False)
g.set_yticklabels(g.get_yticklabels(), rotation = 0, fontsize = 8)
Result:

Related

Assign random date and time slot to user from table

I am currently building a vaccination appointment program for college and I am trying to write the code to randomly assign a date that ranges anywhere from 1/1/2022-31/12/2022, alongside a time slot ranging from 8am-5pm. Each hour will have 100 slots. Every time a user is assigned a slot, 1 from the assigned slot will be deducted. I tried doing this with a table i built using pandas, but I didn't get very far. Any help would be greatly appreciated, thank you.
Here's my code for the table using pandas (in case it will be helpful):
import pandas
start_date = '1/1/2022'
end_date = '31/12/2022'
list_of_date = pandas.date_range(start=start_date, end=end_date)
df = pandas.DataFrame(list_of_date)
df.columns = ['Date/Time']
df['8:00'] = 100
df['9:00'] = 100
df['10:00'] = 100
df['11:00'] = 100
df['12:00'] = 100
df['13:00'] = 100
df['14:00'] = 100
df['15:00'] = 100
df['16:00'] = 100
df['17:00'] = 100
print(df)

What I would do is start by including the leading zero at the beginning of the hour for each column name. It's easier to extract '08:00' from a pandas Timestamp than '8:00'.
df['08:00'] = 100
df['09:00'] = 100
Then you can set the index to your 'Date/Time' column and use .loc to locate an appointment slot by the date in the row and the hour (rounded down) in the columns, and subtract 1 from the number of appointments at that slot. For example:
df.set_index('Date/Time', inplace=True)
user1_datetime = pd.to_datetime("2022-01-02 08:30")
user1_day = user1_datetime.strftime('%Y-%m-%d')
user1_time = user1_datetime.floor("h").strftime('%H:%M')
df.loc[user1_day, user1_time] -= 1
Result:
>>> df
08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00
Date/Time
2022-01-01 100 100 100 100 100 100 100 100 100 100
2022-01-02 99 100 100 100 100 100 100 100 100 100
2022-01-03 100 100 100 100 100 100 100 100 100 100
2022-01-04 100 100 100 100 100 100 100 100 100 100
2022-01-05 100 100 100 100 100 100 100 100 100 100
... ... ... ... ... ... ... ... ... ... ...
2022-12-27 100 100 100 100 100 100 100 100 100 100
2022-12-28 100 100 100 100 100 100 100 100 100 100
2022-12-29 100 100 100 100 100 100 100 100 100 100
2022-12-30 100 100 100 100 100 100 100 100 100 100
2022-12-31 100 100 100 100 100 100 100 100 100 100
To scale up, you can easily wrap this in a function that takes a list of datetimes for multiple people, and checks that the person isn't making an appointment in an hour slot with 0 remaining appointments.

Thank you Derek, I finally managed to think of a way to do it, and I couldn't have done it without your help. Here's my code:
This builds the table and saves it into a CSV file:
import pandas
start_date = '1/1/2022'
end_date = '31/12/2022'
list_of_date = pandas.date_range(start=start_date, end=end_date)
df = pandas.DataFrame(list_of_date)
df.columns = ['Date/Time']
df['8:00'] = 100
df['9:00'] = 100
df['10:00'] = 100
df['11:00'] = 100
df['12:00'] = 100
df['13:00'] = 100
df['14:00'] = 100
df['15:00'] = 100
df['16:00'] = 100
df['17:00'] = 100
df.to_csv(r'C:\Users\Ric\PycharmProjects\pythonProject\new.csv')
And this code randomly pick a date and an hour from that date:
import pandas
import random
from random import randrange
#randrange randomly picks an index for date and time for the user
random_date = randrange(365)
random_hour = randrange(10)
list = ["8:00", "9:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00", "16:00", "17:00"]
hour = random.choice(list)
df = pandas.read_csv('new.csv')
date=df.iloc[random_date][0]
df.loc[random_date, hour] -= 1
df.to_csv(r'C:\Users\Rich\PycharmProjects\pythonProject\new.csv',index=False)
print(date)
print(hour)
I haven't found a way for the program to check whether the number of slots is > 0 before choosing one though.

Graphing scatter plot with ordinal, size and histogram dimensions in python

I am trying to visualize data in a scatter plot that is time-ordered on the x-axis. I want the size of dots to be dependent on the size of the overall value for a particular time (across columns in the data below). I want the color of the dot to be filled in based on the value for that variable across time (down the rows in the data below), say white for low values and purple for high values.
Time Value_1 Value_2 Value_3 Value_4 Value_5
10:30 100 200 1000 400 300
10:31 200 100 500 200 1000
10:32 300 500 900 300 200
In the above, there would be five dots for time 10:30. The third dot would be the largest in size because its value of 1000 is the largest of the total values for 10:30 (total of 2000). Ideally, its size should be half the total area of the remaining dots. The fourth dot would be next largest (with an area of 1/5 the total dots), followed by the fifth dot, second dot and finally the first dot.
The third dot would be colored purple because 1000 is the highest value for Value_3 for each of the times, 10:30-10:32. The third dot at 10:31 would be colored white because it is the lowest of the values for Value_3. The third dot for 10:32 would be very close to deep purple because 900 is much closer to 1000 than it is 500.
Does anyone know how to do this in matplotlib and python? As suggested in the headline question, this is a problem of coloring by histogram position and sizing by value during a specific time. The position of the dot is fixed and ordinal.

I split this problem into two matricies and then used these to graph a scatter plot.
#import needed packages
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#reproduce sample chart
index = ['10:30', '10:31', '10:32']
df = {'Value_1': [100, 200, 300],
'Value_2': [200, 100, 500],
'Value_3': [1000, 500, 900],
'Value_4': [400, 200, 300],
'Value_5': [300, 1000, 200]}
df = pd.DataFrame(data=df, index=pd.to_datetime(index))
print(df)
> Value_1 Value_2 Value_3 Value_4 Value_5
> 2020-12-11 10:30:00 100 200 1000 400 300
> 2020-12-11 10:31:00 200 100 500 200 1000
> 2020-12-11 10:32:00 300 500 900 300 200
#find relative size within a time period -- for size of circle
relative_size = df.div(df.sum(axis=1),axis=0)
print(relative_size)
> Value_1 Value_2 Value_3 Value_4 Value_5
> 2020-12-11 10:30:00 0.050000 0.100000 0.500000 0.200000 0.150000
> 2020-12-11 10:31:00 0.100000 0.050000 0.250000 0.100000 0.500000
> 2020-12-11 10:32:00 0.136364 0.227273 0.409091 0.136364 0.090909
#find relative size versus all time (for color intensity)
relative_color = df - df.min()
relative_color = relative_color / df.max()
print(relative_color)
> Value_1 Value_2 Value_3 Value_4 Value_5
> 2020-12-11 10:30:00 0.000000 0.2 0.5 0.50 0.1
> 2020-12-11 10:31:00 0.333333 0.0 0.0 0.00 0.8
> 2020-12-11 10:32:00 0.666667 0.8 0.4 0.25 0.0
#for colors
cmaps = ['Purples'] * int(len(df.keys()))
#plot
fig, ax = plt.subplots()
ax.xaxis_date()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
for i, col in enumerate(df.keys()):
ax.scatter(df.index, [i]*len(df.index), c=relative_color[col]/4+.75, s=relative_size[col]*10**3, alpha=1, cmap=cmaps[i])
#relative color is made to be a minimum of .75 with variations from 0-1 being scaled by 1/4th
#I chose 10**3 for the relative size just based on preference
plt.show()

Plot histogram on binary based column data versus continuous column of data

Here is the data showing two columns between which I need to plot histogram
Cont Bin_Data
21 1
21 1
22.8 1
21.4 0
18.7 0
18.1 0
14.3 0
24.4 0
22.8 1
19.2 1
17.8 0
16.4 1
17.3 0
15.2 1
I have to plot Bin_Data(column) based histogram to compare Cont(column). I have tried 3 approaches and am not getting satisfactory result/plot.
Approach #1
plt.hist('mpg', bins=5, data=am)
Approach #2
plt.hist(mpg, bins=np.arange(mpg.min(), mpg.max()+1))
Approach #3
am = data['am']
legend = ['am', 'mpg']
mpg = data['mpg']
plt.hist([mpg, am], color=['orange', 'green'])
plt.xlabel("am")
plt.ylabel("mpg")
plt.legend(legend)
#plt.xticks(range(0, 7))
#plt.yticks(range(1, 20))
plt.title('Analysis of "am" upon "mpg"')
plt.show()

Back Filling Dataframe

I have a dataframe with 3 columns. Something like this:
Data Initial_Amount Current
31-01-2018
28-02-2018
31-03-2018
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
I would like to populate the prior rows with the Initial Amount as such:
Data Initial_Amount Current
31-01-2018 100 100
28-02-2018 100 100
31-03-2018 100 100
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
So find the:
First non_empty row with Initial Amount populated
use that to backfill the initial Amounts to the starting date
If it is the first row and current is empty then copy Initial_Amount, else copy prior balance.
Regards,

Pandas fillna with fill method 'bfill' (uses next valid observation to fill gap) should do what you're looking for:
In [13]: df.fillna(method='bfill')
Out[13]:
Data Initial_Amount Current
0 31-01-2018 100.0 100.0
1 28-02-2018 100.0 100.0
2 31-03-2018 100.0 100.0
3 30-04-2018 100.0 100.0
4 31-05-2018 100.0 90.0
5 30-06-2018 100.0 80.0

Plotting a histogram in Pandas with very heavy-tailed data

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.
Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.
In [10]: data.value_counts.sort_index()
Out[10]:
0 8012
25 3710
100 10794
200 11718
300 2489
500 7631
600 34
700 115
1000 3099
1200 1766
1600 63
2000 1538
2200 41
2500 208
2700 2138
5000 515
5500 201
8800 10
10000 10
10900 465
13000 9
16200 74
20000 518
21500 65
27000 64
53000 82
56000 1
106000 35
530000 3
I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []
for key in mydict:
for e in range(mydict[key]):
mylist.insert(0,key)
df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]
Plot as a bar:
fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")
As a histogram (limited to values 5000 & under which is >97% of your data):
I like using linspace to control buckets.
df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')
EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Coverage Plot using matplotlib and Pandas DataFrame - python

Related

Assign random date and time slot to user from table

Graphing scatter plot with ordinal, size and histogram dimensions in python

Plot histogram on binary based column data versus continuous column of data

Back Filling Dataframe

Plotting a histogram in Pandas with very heavy-tailed data

Categories

Resources