So I'm comparing NBA betting lines between different sportsbooks over time
Procedure:
Open pickle file of scraped data
Plot the scraped data
The pickle file is a dictionary of NBA betting lines over time. Each of the two teams are their own nested dictionary. Each key in these team-specific dictionaries represents a different sportsbook. The values for these sportsbook keys are lists of tuples, representing timeseries data. It looks roughly like this:
dicto = {
'Time': <time that the game starts>,
'Team1': {
Market1: [ (time1, value1), (time2, value2), etc...],
Market2: [ (time1, value1), (time2, value2), etc...],
etc...
}
'Team2': {
<SAME FORM AS TEAM1>
}
}
There are no issues with scraping or manipulating this data. The issue comes when I plot it. Here is the code for the script that unpickles and plots these dictionaries:
import matplotlib.pyplot as plt
import pickle, datetime, os, time, re
IMAGEPATH = 'Images'
reg = re.compile(r'[A-Z]+#[A-Z]+[0-9|-]+')
noDate = re.compile(r'[A-Z]+#[A-Z]+')
# Turn 1 into '01'
def zeroPad(num):
if num < 10:
return '0' + str(num)
else:
return num
# Turn list of time-series tuples into an x list and y list
def unzip(lst):
x = []
y = []
for i in lst:
x.append(f'{i[0].hour}:{zeroPad(i[0].minute)}')
y.append(i[1])
return x, y
# Make exactly 5, evenly spaced xticks
def prune(xticks):
last = len(xticks)
first = 0
mid = int(len(xticks) / 2) - 1
upMid = int( mid + (last - mid) / 2)
downMid = int( (mid - first) / 2)
out = []
count = 0
for i in xticks:
if count in [last, first, mid, upMid, downMid]:
out.append(i)
else:
out.append('')
count += 1
return out
def plot(filename, choice):
IMAGEPATH = 'Images'
IMAGEPATH = os.path.join(IMAGEPATH, choice)
with open(filename, 'rb') as pik:
dicto = pickle.load(pik)
fig, axs = plt.subplots(2)
gameID = noDate.search(filename).group(0)
tm = dicto['Time']
fig.suptitle(gameID + '\n' + str(tm))
i = 0
for team in dicto.keys():
axs[i].set_title(team)
if team == 'Time':
continue
for market in dicto[team].keys():
lst = dicto[team][market]
x, y = unzip(lst)
axs[i].plot(x, y, label= market)
axs[i].set_xticks(prune(x))
axs[i].set_xticklabels(rotation=45, labels = x)
i += 1
plt.tight_layout()
#Finish
outputFile = reg.search(filename).group(0)
date = (datetime.datetime.today() - datetime.timedelta(hours = 6)).date()
fig.savefig(os.path.join(IMAGEPATH, str(date), f'{outputFile}.png'))
plt.close()
Here is the image that results from calling the plot function on one of the dictionaries that I described above. It is pretty much exactly as I intended it, except for one very strange and bothersome problem.
You will notice that the bottom right tick looks haunted, demonic, jpeggy, whatever you want to call it. I am highly suspicious that this problem occurs in the prune function, which I use to set the xtick values of the plot.
The reason that I prune the values with a function like this is because these dictionaries are continuously updated, so setting a static number of xticks would not work. And if I don't prune the xticks, they end up becoming unreadable due to overlapping one another.
I am quite confused as to what could cause an xtick to look like this. It happens consistently, for every dictionary, every time. Before I added the prune function (when the xticks unbound, overlapping one another), this issue did not occur. So when I say I'm suspicious that the prune function is the cause, I am really quite certain.
I will be happy to share an instance of one of these dictionaries, but they are saved as .pickle files, and I'm pretty sure it's bad practice to share pickle files over the internet. I have been warned about potential malware, so I'll just stay away from that. But if you need to see the dictionary, I can take the time to prettily print one and share a screenshot. Any help is greatly appreciated!
Matplotlib does this when there are many xticks or yticks which are plotted on the same value. It is normal. If you can limit the number of times the specific value is plotted - you can make it appear indistinguishable from the rest of the xticks.
Plot a simple example to test this out and you will see for yourself.
Related
I am relatively new to python, so still learning effective loop coding.
I have worked on creating a for loop, meant to iterate through numeric variables in a data set. The loop (as written) graphs a plot and provides bivariate statistics on all the numeric variables in my dataset. But there are over 100 variables in the data set so iterating through all combinations is both extremely taxing and unnecessary.
What I am trying to figure out now is what code I can insert into this loop (and where) which will allow me to set a threshold on r (being the r in stats.linregress). The purpose is to alleviate the code iterating through every combination of numeric variables in the data set and instead limit the iterations to variable combinations with a set r value (setting it at a value that has significance), thus skipping unnecessary iterations of combinations.
I thought a basic if/else would work, but it keeps breaking on me or plotting an empty graph.
code I am using:
list_of_checked_variables= []
for column_a in numeric_variable_list:
for column_b in numeric_variable_list:
list_of_checked_variables.append(column_a)
if column_b not in list_of_checked_variables:
correlation_plot = sns.regplot(x=df[column_a], y=df[column_b])
sns.despine(top=True, right=True)
# regression line
m, b, r, p, err = stats.linregress(df[column_a], df[column_b])
# Add formula, r^2, and p-value to the graph
textstr= 'y = ' + str(round(m, 2)) + 'x + ' + str(round(b, 2)) + '\n'
textstr += 'r2 = ' + str(round(r**2, 4)) + '\n'
textstr += 'p = ' + str(round(p, 4))
plt.text(0.15, 0.70, textstr, fontsize=12, transform=plt.gcf().transFigure)
plt.show(correlation_plot)
plt.close()
else:
continue
Looping many times is expensive (i.e. slow and uses a lot of memory) in Python because, unlike C++, everything is an object. At every iteration in the loop, Python makes a new object. List/dictionary comprehensions and generator expressions help with this because they are optimized at the C level, so use them wherever you can. (f-strings are also an optimized feature in Python 3.x, so you should try to use those where you can as well; I think they're more readable.)
For combinations- and permutations-type problems, I would use the built-in itertools module.
import itertools
from matplotlib import pyplot as plt
import pandas as pd
from scipy import stats
import seaborn as sns
if __name__ == "__main__":
# ...get your numeric_variable_list
# Set r-threshold
rthresh = 0.95 # Example
# Get tuple of combos
combos = itertools.combinations(numeric_variable_list, 2)
# Dictionary comprehension
regstats = {c:stats.linregress(c[0], c[1]) for c in combos}
regstats = {k:v for (k,v) in regstats.items() if v[2] >= rthresh}
for c, stat in regstats.items():
m, b, r, p, err = stat
// Create plots
correlation_plot = sns.regplot(c[0], c[1])
sns.despine(top=True, right=True)
# Make your text
textstr = f"y = {m:.2f}x + {b:.2f}\n"
textstr += f"r2 = {r**2:.4f}\n"
textstr += f"p = {p:.4f}"
plt.text(
0.15,
0.70,
textstr,
fontsize=12,
transform=plt.gcf().transFigure
)
plt.show(correlation_plot)
These are my first suggestions. There may be even more you can do improve things, but this should work well. One other thing you may want to research is to automatically plt.save your graphs, especially if there are hundreds, rather than going through each and manually saving them individually.
If specific errors pop up, let me know. An if/else should work in theory. There may be something hiding in your numeric_variable_list, but I can't debug it without knowing the exact error message you're getting.
I'm trying to plot a demand profile for heating energy for a specific building with Python and matplotlib.
But instead of being a single line it looks like this:
Did anyone ever had plotting results like this?
Or does anyone have an idea whats going on here?
The corresponding code fragment is:
for b in list_of_buildings:
print(b.label, b.Q_Heiz_a, b.Q_Heiz_TT, len(b.lp.heating_list))
heating_datalist=[]
for d in range(timesteps):
b.lp.heating_list[d] = b.lp.heating_list[d]*b.Q_Heiz_TT
heating_datalist.append((d, b.lp.heating_list[d]))
xs_heat = [x[0] for x in heating_datalist]
ys_heat = [x[1] for x in heating_datalist]
pyplot.plot(xs_heat, ys_heat, lw=0.5)
pyplot.title(TT)
#get legend entries from list_of_buildings
list_of_entries = []
for b in list_of_buildings:
list_of_entries.append(b.label)
pyplot.legend(list_of_entries)
pyplot.xlabel("[min]")
pyplot.ylabel("[kWh]")
Additional info:
timesteps is a list like [0.00, 0.01, 0.02, ... , 23.59] - the minutes of the day (24*60 values)
b.lp.heating_list is a list containing some float values
b.Q_Heiz_TT is a constant
Based on your information, I have created a minimal example that should reproduce your problem (if not, you may have not explained the problem/parameters in sufficient detail). I'd urge you to create such an example yourself next time, as your question is likely to get ignored without it. The example looks like this:
import numpy as np
import matplotlib.pyplot as plt
N = 24*60
Q_Heiz_TT = 0.5
lp_heating_list = np.random.rand(N)
lp_heating_list = lp_heating_list*Q_Heiz_TT
heating_datalist = []
for d in range(N):
heating_datalist.append((d, lp_heating_list[d]))
xs_heat = [x[0] for x in heating_datalist]
ys_heat = [x[1] for x in heating_datalist]
plt.plot(xs_heat, ys_heat)
plt.show()
What is going in in here? For each d in range(N) (with N = 24*60, i.e. each minute of the day), you plot all values up to and including lp_heating_list[d] versus d. This is because heating_datalist, is appended with the current value of d and corresponding value in lp_heating_list. What you get is 24x60=1440 lines that partially overlap one another. Depending on how your backend is handling things, it may be very slow and start to look messy.
A much better approach would be to simply use
plt.plot(range(timesteps), lp_heating_list)
plt.show()
Which plots only one line, instead of 1440 of them.
I suspect there is an indentation problem in your code.
Try this:
heating_datalist=[]
for d in range(timesteps):
b.lp.heating_list[d] = b.lp.heating_list[d]*b.Q_Heiz_TT
heating_datalist.append((d, b.lp.heating_list[d]))
xs_heat = [x[0] for x in heating_datalist] # <<<<<<<<
ys_heat = [x[1] for x in heating_datalist] # <<<<<<<<
pyplot.plot(xs_heat, ys_heat, lw=0.5) # <<<<<<<<
That way you'll plot only one line per building, which is probably what you want.
Besides, you can use zip to generate x values and y values like this:
xs_heat, ys_heat = zip(*heating_datalist)
This works because zip is it's own inverse!
I have several vehicle paths and I would like to automatically draw all of them on separate files. I am trying to do it with a for loop, but the points end up overlapping on each following file. So, basically, on the last file, I have all paths.
This is my function. Can someone help me with this?
def drawUnique(uniqueVeh):
for i in uniqueVeh:
latitudes = list(map(float,list(gps_data[gps_data["id"] == i]["lat"])))
longitudes = list(map(float,list(gps_data[gps_data["id"] == i]["long"])))
gmap.scatter(latitudes, longitudes, size=10, marker=False)
gmap.draw("map" + i + ".html")
The issue is related to the declaration of the gmap object, that is made obviously before the loop, so the same object is used and saves all the marks.
You just need to define a new gmap object at the beginning of each iteration to create a fresh new map:
def drawUnique(uniqueVeh):
for i in uniqueVeh:
gmap = gmplot.GoogleMapPlotter(center_lat, center_lng, zoom) # replace the values !!
latitudes = list(map(float,list(gps_data[gps_data["id"] == i]["lat"])))
longitudes = list(map(float,list(gps_data[gps_data["id"] == i]["long"])))
gmap.scatter(latitudes, longitudes, size=10, marker=False)
gmap.draw("map" + i + ".html")
Basically the code in the question can be converted like below:
import gmplot
gmap = gmplot.GoogleMapPlotter(40.640, -73.926, 16)
# turn 1
gmap.scatter([40.642810, 40.638240],
[-73.915, -73.922901],
'cornflowerblue', edge_width=8)
gmap.draw("map1.html")
# turn 2
# same gmap : all marks are added and overlap the first
gmap.scatter([40.644494, 40.637083],
[-73.925044, -73.926464],
'red', edge_width=8)
gmap.draw("map2.html")
You need to insert this line between each drawing to avoid the overlapping issue in map2.html :
gmap = gmplot.GoogleMapPlotter(40.640, -73.926, 16)
I am using python. I made this numpy.float64 and this shows the Chicago Cubs' win times by decades.
yr1874to1880 = np.mean(wonArray[137:143])
yr1881to1890 = np.mean(wonArray[127:136])
yr1891to1900 = np.mean(wonArray[117:126])
yr1901to1910 = np.mean(wonArray[107:116])
yr1911to1920 = np.mean(wonArray[97:106])
yr1921to1930 = np.mean(wonArray[87:96])
yr1931to1940 = np.mean(wonArray[77:86])
yr1941to1950 = np.mean(wonArray[67:76])
yr1951to1960 = np.mean(wonArray[57:66])
yr1961to1970 = np.mean(wonArray[47:56])
yr1971to1980 = np.mean(wonArray[37:46])
yr1981to1990 = np.mean(wonArray[27:36])
yr1991to2000 = np.mean(wonArray[17:26])
yr2001to2010 = np.mean(wonArray[7:16])
yr2011to2016 = np.mean(wonArray[0:6])
I want to put them together but I don't know how to. I tried for the list but it did not work. Does anyone know how to put them together in order to put them in the graph? I want to make a scatter graph with matplotlib. Thank you.
So with what you've shown, each variable you're setting becomes a float value. You can make them into a list by declaring:
list_of_values = [yr1874to1880, yr1881to1890, ...]
Adding all of the declared values to this results in a list of floats. For example, with just the two values above added:
>>>print list_of_values
[139.5, 131.0]
So that should explain how to obtain a list with the data from np.mean(). However, I'm guessing another question being asked is "how do I scatter plot this?" Using what is provided here, we have one axis of data, but to plot we need another (can't have a graph without x and y). Decide what the average wins is going to be compared against, and then that can be iterated over. For example, I'll use a simple integer in "decade" to act as the x axis:
import matplotlib.pyplot as plt
decade = 1
for i in list_of_values:
y = i
x = decade
decade += 1
plt.scatter(x, y)
plt.show()
I am extracting 150 different cell values from 350,000 (20kb) ascii raster files. My current code is fine for processing the 150 cell values from 100's of the ascii files, however it is very slow when running on the full data set.
I am still learning python so are there any obvious inefficiencies? or suggestions to improve the below code.
I have tried closing the 'dat' file in the 2nd function; no improvement.
dat = None
First: I have a function which returns the row and column locations from a cartesian grid.
def world2Pixel(gt, x, y):
ulX = gt[0]
ulY = gt[3]
xDist = gt[1]
yDist = gt[5]
rtnX = gt[2]
rtnY = gt[4]
pixel = int((x - ulX) / xDist)
line = int((ulY - y) / xDist)
return (pixel, line)
Second: A function to which I pass lists of 150 'id','x' and 'y' values in a for loop. The first function is called within and used to extract the cell value which is appended to a new list. I also have a list of files 'asc_list' and corresponding times in 'date_list'. Please ignore count / enumerate as I use this later; unless it is impeding efficiency.
def asc2series(id, x, y):
#count = 1
ls_id = []
ls_p = []
ls_d = []
for n, (asc,date) in enumerate(zip(asc, date_list)):
dat = gdal.Open(asc_list)
gt = dat.GetGeoTransform()
pixel, line = world2Pixel(gt, east, nort)
band = dat.GetRasterBand(1)
#dat = None
value = band.ReadAsArray(pixel, line, 1, 1)[0, 0]
ls_id.append(id)
ls_p.append(value)
ls_d.append(date)
Many thanks
In world2pixel you are setting rtnX and rtnY which you don't use.
You probably meant gdal.Open(asc) -- not asc_list.
You could move gt = dat.GetGeoTransform() out of the loop. (Rereading made me realize you can't really.)
You could cache calls to world2Pixel.
You're opening dat file for each pixel -- you should probably turn the logic around to only open files once and lookup all the pixels mapped to this file.
Benchmark, check the links in this podcast to see how: http://talkpython.fm/episodes/show/28/making-python-fast-profiling-python-code