I'm writing a script to render pie charts from multidimensional array data, and am struggling trying to keep each step synchronized.
Here's my code, with some dummy data (animal types) for example:
def parse_data(chart):
print("\n=====Chart Data=====")
for a, b in chart.items(): # the dictionary "chart" has items - key:a value:b
output = [] # list of what to print to the console
output.append(str(a)) # key:a - the name of the superset
if type(b) is float: # if there are no subsets and the value is a float...
output.append(': ' + str(b) + '%') # add that number (%) to the console output list
elif type(b) is list: # else if there are subsets they'll be in a list
for c in b:
for d, e in c.items():
output.append('\n ' + str(d) + ': ' + str(e) + '%')
else: # if 'b' is neither a float nor a list
print('Error: Data could not be parsed correctly.')
print('\n' + ''.join(map(str, output))) # put the output list together and print it
chart_data = {
"Mammal Group": [{"Hamster":23.1}, {"Yeti":16.4}],
"Platypus": 14.2,
"Reptile Group": [{"Snake":4.0}, {"Komodo Dragon":0.7}]
}
parse_data(chart_data)
The console output is:
=====Chart Data=====
Mammal Group
Hamster: 23.1%
Yeti: 16.4%
Platypus: 14.2%
Reptile Group
Snake: 4.0%
Komodo Dragon: 0.7%
This all looks fine so far. Groups/supersets are represented by inner slices, and subsets are represented by outer slices. Notice some animals (Platypus) do not belong to a group, and a percentage is listed next to their name directly. Other animals are subsets of some larger group/superset. The next step is to grab that data and group-by-group send the data to a function to be rendered.
Here's a mock-up animating the order I imagine it would be logical to render in: Superset, then subsets of that superset, then move to the next superset if there is one. If there's no parent superset (Platypus), skip rendering of the superset and let the subset take up both inner and outer areas). The final chart will not be animated; this is only to demonstrate the order.
When each slice is created, the starting angle and ending angle need to be kept track of. And those angles will be different for subset slices than for superset slices. This needs to be tracked in a way so that after a set is rendered, the angle marker is placed at the end of it, ready to make the next slice.
I've got the rendering function working, but trying to feed it correct data slice by slice is driving me nuts. How can the chart_data be extracted in an organized way, to feed to the rendering function? Thanks for any help!
Although it's unlikely that others will need to code something quite like this, since I have reached a solution and I would prefer not to leave this question unanswered, I will attempt to explain my solution. Anyone randomly curious can read on....
I solved this by creating two empty lists - one for the "superset" inner slices, and one for the "subset" outer slices, and then populating them as the chart data gets iterated through. Although the subset slices are always children of a superset slice in the structure of the original chart data, that structure does not need to be preserved in order to render the chart.
The supersets that have no subset children get added to both super and sub lists, but this is invisible in the final pie chart. To understand what is meant by "invisible", imagine that the blue slice in the mock-up image is actually comprised of an inner segment and outer segment, one beside the other. If the pie chart were uncurled and laid out as two straight parallel tracks they would still line up with one another, as they have the same percentage.
But only one should be labeled, and there is more room for text in the outer area. For this reason, superset slices containing no subset slices get their name labels removed and set to data type None. Those name labels get moved to each one's neighboring subset slice instead. Slices labeled None will not be rendered, but the place marker will rotate by the percentage of that slice. To use the mock-up image as an example, this is what allows moving from the end of the green superset slice to the beginning of the orange superset slice without drawing an unwanted second slice "over" the blue slice. (Imagine the inner superset slices as being laid on top of and covering the outer subset slices.)
As most people reading this probably know, having a data type of None as a key in a dictionary doesn't work more than once. for example, the entry "Platypus" : 14.2 will result in None : 14.2, which will be overwritten by "Jabberwock" : 41.6 into None : 41.6. Because of this, I decided to convert the dictionary chart_data into a list first before formatting it for the graph.
As the iteration happens, the percentages of the subset slices get tallied up so that their parent superset group matches them in size. (Size meaning width if laid out in two parallel tracks, and rotation angle if viewed in the pie chart.)
Finally, I threw in some code to check whether the chart percentages add up to 100.
chart_data = {
"Mammal Group": [{"Hamster":23.1}, {"Yeti":16.4}],
"Platypus": 14.2,
"Reptile Group": [{"Snake":4.0}, {"Komodo Dragon":0.7}],
"Jabberwock" : 41.6
}
track_sup = []
track_sub = []
def dict_to_list(dict): # converts a dictionary to a list (nested dictionaries are untouched)
new_list = []
for key, value in dict.items():
super_pair = [key, value]
new_list.append(super_pair)
return new_list
def format_data(dict_data):
global track_sup
global track_sub
track_sup = []
track_sub = []
print('\n\n\n====== Formatting data ======')
super_slices = dict_to_list(dict_data) # convert to list to allow more than one single super slice (their labels are type: None)
chart_perc = 0.0 # for checking that the chart slices all add up to 100%
i = 0
while i < len(super_slices):
tally = 0.0 # for adding up subset percentages to get each superset percentage
is_single = True # supersets single by default
super_slice = super_slices[i]
slice_label = ''
super_pair = []
sub_pair = []
if type(super_slice[1]) == list: # if [1] is a list, there are sub slices
is_single = False # mark superset as containing multiple subsets
slice_label = super_slice[0]
sub_slices = super_slice[1]
j = 0
while j < len(sub_slices): # iterate sub slices to gather label names and percentages
sub_slice = sub_slices[j]
for k, v in sub_slice.items(): # in each dict, k is a label and v is a percentage
v = float(v)
tally = tally + v # count toward super slice (group) percentage
chart_perc = chart_perc + v # count toward chart total percentage
sub_pair = [k, v] # convert each key-value pair into a list
print(str(sub_pair[0]) + ' ' + str(sub_pair[1]) + ' %')
track_sub.append(sub_pair) # append this pair to final sub output list
j = j + 1
print('Group [' + slice_label + '] combined total is ' + str(tally) + ' % \n')
elif type(super_slice[1]) == float: # this super slice (group) contains no sub slices
slice_label = super_slice[0]
tally = super_slice[1] # no sub slice percentages to add up
chart_perc = chart_perc + super_slice[1] # count toward chart total percentage
sub_pair = [slice_label, tally] # label drops to the sub slot as it only labels itself
track_sub.append(sub_pair) # append this pair to final sub output list
print(slice_label + ' ' + str(tally) + ' % (Does not belong to a group)\n')
else:
print('Error: Could not format data.')
if is_single == True:
slice_label = None # label removed for each single slice - only the percentage is used (for spacing)
super_pair = [slice_label, tally] # pair up each label name and super slice percentage in a list
track_sup.append(super_pair) # append this pair to final super output list
i = i + 1
chart_perc = round(chart_perc, 6) # round to 6 decimal places
short = 0.0
if chart_perc == 100.0:
print('______ Sum of all chart slices is 100 % ______\n')
else:
print('****** WARNING: Chart slices do not add up to 100 % ! ******')
short = round(100.0 - chart_perc, 6)
print('Sum of all chart slices is only ' + str(chart_perc) + ' % (Falling short by ' + str(short) + ' %)\n')
format_data(chart_data)
print(track_sup)
print(track_sub)
And the resulting console output:
====== Formatting data ======
Hamster 23.1 %
Yeti 16.4 %
Group [Mammal Group] combined total is 39.5 %
Platypus 14.2 % (Does not belong to a group)
Snake 4.0 %
Komodo Dragon 0.7 %
Group [Reptile Group] combined total is 4.7 %
Jabberwock 41.6 % (Does not belong to a group)
______ Sum of all chart slices is 100 % ______
[['Mammal Group', 39.5], [None, 14.2], ['Reptile Group', 4.7], [None, 41.6]]
[['Hamster', 23.1], ['Yeti', 16.4], ['Platypus', 14.2], ['Snake', 4.0], ['Komodo Dragon', 0.7], ['Jabberwock', 41.6]]
track_sup and track_sub are the two lists whose data is shown in the last two lines of output, that will be actually used to render the chart.
Related
I have several time points taken from a video with some max time length (T). These points are stored in a list of lists as follows:
time_pt_nested_list =
[[0.0, 6.131, 32.892, 43.424, 46.969, 108.493, 142.69, 197.025, 205.793, 244.582, 248.913, 251.518, 258.798, 264.021, 330.02, 428.965],
[11.066, 35.73, 64.784, 151.31, 289.03, 306.285, 328.7, 408.274, 413.64],
[48.447, 229.74, 293.19, 333.343, 404.194, 418.575],
[66.37, 242.16, 356.96, 424.967],
[78.711, 358.789, 403.346],
[84.454, 373.593, 422.384],
[102.734, 394.58],
[158.534],
[210.112],
[247.61],
[340.02],
[365.146],
[372.153]]
Each list above is associated with some probability; I'd like to randomly select points from each list according to its probability to form n tuples of contiguous time spans, such as the following:
[(0,t1),(t1,t2),(t2,t3),...,(tn,T)]
where n is specified by the user. All the returned tuples should only contain the floating point numbers inside the nested list above. I want to assign the highest probability to them to be sampled and appear in the returned tuples, the second list a slightly lower probability, etc. The exact details of these probabilities are not important, but it would be nice if the user can input a parameter that controls how fast the probability decays when idx increases.
The returned tuples are timeframes that should exactly cover the entire video and should not overlap. 0 and T may not necessarily appear in time_pt_nested_list (but they may). Are there nice ways to implement this? I would be grateful for any insightful suggestions.
For example if the user inputs 6 as the number of subclips, then this will be an example output:
[(0.0, 32.892), (32.892, 64.784), (64.784, 229.74), (229.74, 306.285), (306.285, 418.575), (418.575, 437.47)]
All numbers appearing in the tuples appeared in time_pt_nested_list, except 0.0 and 437.47. (Well 0.0 does appear here but may not in other cases) Here 437.47 is the length of video which is also given and may not appear in the list.
This is simpler than it may look. You really just need to sample n points from your sublists, each with row-dependent sample probability. Whatever samples are obtained can be time-ordered to construct your tuples.
import numpy as np
# user params
n = 6
prob_falloff_param = 0.2
lin_list = sorted([(idx, el) for idx, row in enumerate(time_pt_nested_list) for
el in row], key=lambda x: x[1])
# endpoints required, excluded from random selection process
t0 = lin_list.pop(0)[1]
T = lin_list.pop(-1)[1]
arr = np.array(lin_list)
# define row weights, alpha is parameter
weights = np.exp(-prob_falloff_param*arr[:,0]**2)
norm_weights = weights/np.sum(weights)
# choose (weighted) random points, create tuple list:
random_points = sorted(np.random.choice(arr[:,1], size=(n-1), replace=False))
time_arr = [t0, *random_points, T]
output = list(zip(time_arr, time_arr[1:]))
example outputs:
# n = 6
[(0.0, 78.711),
(78.711, 84.454),
(84.454, 158.534),
(158.534, 210.112),
(210.112, 372.153),
(372.153, 428.965)]
# n = 12
[(0.0, 6.131),
(6.131, 43.424),
(43.424, 64.784),
(64.784, 84.454),
(84.454, 102.734),
(102.734, 210.112),
(210.112, 229.74),
(229.74, 244.582),
(244.582, 264.021),
(264.021, 372.153),
(372.153, 424.967),
(424.967, 428.965)]
So I'm comparing NBA betting lines between different sportsbooks over time
Procedure:
Open pickle file of scraped data
Plot the scraped data
The pickle file is a dictionary of NBA betting lines over time. Each of the two teams are their own nested dictionary. Each key in these team-specific dictionaries represents a different sportsbook. The values for these sportsbook keys are lists of tuples, representing timeseries data. It looks roughly like this:
dicto = {
'Time': <time that the game starts>,
'Team1': {
Market1: [ (time1, value1), (time2, value2), etc...],
Market2: [ (time1, value1), (time2, value2), etc...],
etc...
}
'Team2': {
<SAME FORM AS TEAM1>
}
}
There are no issues with scraping or manipulating this data. The issue comes when I plot it. Here is the code for the script that unpickles and plots these dictionaries:
import matplotlib.pyplot as plt
import pickle, datetime, os, time, re
IMAGEPATH = 'Images'
reg = re.compile(r'[A-Z]+#[A-Z]+[0-9|-]+')
noDate = re.compile(r'[A-Z]+#[A-Z]+')
# Turn 1 into '01'
def zeroPad(num):
if num < 10:
return '0' + str(num)
else:
return num
# Turn list of time-series tuples into an x list and y list
def unzip(lst):
x = []
y = []
for i in lst:
x.append(f'{i[0].hour}:{zeroPad(i[0].minute)}')
y.append(i[1])
return x, y
# Make exactly 5, evenly spaced xticks
def prune(xticks):
last = len(xticks)
first = 0
mid = int(len(xticks) / 2) - 1
upMid = int( mid + (last - mid) / 2)
downMid = int( (mid - first) / 2)
out = []
count = 0
for i in xticks:
if count in [last, first, mid, upMid, downMid]:
out.append(i)
else:
out.append('')
count += 1
return out
def plot(filename, choice):
IMAGEPATH = 'Images'
IMAGEPATH = os.path.join(IMAGEPATH, choice)
with open(filename, 'rb') as pik:
dicto = pickle.load(pik)
fig, axs = plt.subplots(2)
gameID = noDate.search(filename).group(0)
tm = dicto['Time']
fig.suptitle(gameID + '\n' + str(tm))
i = 0
for team in dicto.keys():
axs[i].set_title(team)
if team == 'Time':
continue
for market in dicto[team].keys():
lst = dicto[team][market]
x, y = unzip(lst)
axs[i].plot(x, y, label= market)
axs[i].set_xticks(prune(x))
axs[i].set_xticklabels(rotation=45, labels = x)
i += 1
plt.tight_layout()
#Finish
outputFile = reg.search(filename).group(0)
date = (datetime.datetime.today() - datetime.timedelta(hours = 6)).date()
fig.savefig(os.path.join(IMAGEPATH, str(date), f'{outputFile}.png'))
plt.close()
Here is the image that results from calling the plot function on one of the dictionaries that I described above. It is pretty much exactly as I intended it, except for one very strange and bothersome problem.
You will notice that the bottom right tick looks haunted, demonic, jpeggy, whatever you want to call it. I am highly suspicious that this problem occurs in the prune function, which I use to set the xtick values of the plot.
The reason that I prune the values with a function like this is because these dictionaries are continuously updated, so setting a static number of xticks would not work. And if I don't prune the xticks, they end up becoming unreadable due to overlapping one another.
I am quite confused as to what could cause an xtick to look like this. It happens consistently, for every dictionary, every time. Before I added the prune function (when the xticks unbound, overlapping one another), this issue did not occur. So when I say I'm suspicious that the prune function is the cause, I am really quite certain.
I will be happy to share an instance of one of these dictionaries, but they are saved as .pickle files, and I'm pretty sure it's bad practice to share pickle files over the internet. I have been warned about potential malware, so I'll just stay away from that. But if you need to see the dictionary, I can take the time to prettily print one and share a screenshot. Any help is greatly appreciated!
Matplotlib does this when there are many xticks or yticks which are plotted on the same value. It is normal. If you can limit the number of times the specific value is plotted - you can make it appear indistinguishable from the rest of the xticks.
Plot a simple example to test this out and you will see for yourself.
I have the following data:
2
[-0.09891464 -0.09715325 -0.09410605 -0.09019411 -0.0860636 -0.08205132
-0.07875871 -0.07614547 -0.07443062 -0.07346302 -0.07298417 -0.07290273
-0.07287797 -0.07287593 -0.07287593] code_length
[-0.98949882 -0.97240346 -0.94268702 -0.90432065 -0.86363176 -0.82404481
-0.79160852 -0.76596087 -0.74920381 -0.73978155 -0.73512854 -0.73433788
-0.73409758 -0.7340778 -0.7340778 ] code_length
[-0.08209141 0.24530752 0.57179519 0.89738478 1.22269259 1.54813354
1.87437147 2.20121635 2.52864319 2.85637075 3.18420369 3.51207073
3.83993948 4.16780833 4.49567718] code_length
[0.09891464 0.09715325 0.09410605 0.09019411 0.0860636 0.08205132
0.07875871 0.07614547 0.07443062 0.07346302 0.07298417 0.07290273
0.07287797 0.07287593 0.07287593] code_length
[-0.98949882 -0.97240346 -0.94268702 -0.90432065 -0.86363176 -0.82404481
-0.79160852 -0.76596087 -0.74920381 -0.73978155 -0.73512854 -0.73433788
-0.73409758 -0.7340778 -0.7340778 ] code_length
[-0.08209141 0.24530752 0.57179519 0.89738478 1.22269259 1.54813354
1.87437147 2.20121635 2.52864319 2.85637075 3.18420369 3.51207073
3.83993948 4.16780833 4.49567718] code_length
print(len(pos_list))
print(streamline_x[0])
print(streamline_y[0])
print(streamline_z[0])
print(streamline_x[1])
print(streamline_y[1])
print(streamline_z[1])
I would like to plot them, and I would like to plot them with the negative z-component. It provides these two cycles:
for i in range(len(pos_list)):
ax.plot3D(streamline_x[i], streamline_y[i], streamline_z[i], color=cfg._sections['colors'].get('mag_field'), linewidth=cfg._sections['styles'].get('lines'))
for i in range(len(pos_list)):
ax.plot3D(streamline_x[i], streamline_y[i], -streamline_z[i], color=cfg._sections['colors'].get('mag_field'), linewidth=cfg._sections['styles'].get('lines'))
However, I would like to simplify it and create 3 lists for plotting instead of 6. I tried the following:
minus_streamline_z1 = []
minus_streamline_z2 = []
# Create a function that multiplies each element in list by (-1)
for x in streamline_z[0].tolist():
minus_streamline_z1.append(x * (-1))
for x in streamline_z[1].tolist():
minus_streamline_z2.append(x * (-1))
minus_streamline_z = [minus_streamline_z1, minus_streamline_z2]
# Adding symmetric parts of streamline to plot them
for i in range(len(pos_list)):
xs = streamline_x[i].tolist() + streamline_x[i].tolist()
ys = streamline_y[i].tolist() + streamline_y[i].tolist()
zs = streamline_z[i].tolist() + minus_streamline_z[i]
ax.plot3D(xs, ys, zs)
What is wrong, please?
The first two cycles give this figure:
and the second way gives some strange line
The problem is you currently have the first list start at (-0.09891464, -0.98949882, -0.08209141) and move up in the z-direction to (-0.07287593, -0.7340778, 4.49567718). When you join the two lists together, it then jumps to (-0.09891464, -0.98949882, 0.08209141), before moving down in the z-direction to (-0.07287593, -0.7340778, -4.49567718).
This jump where the lists join leads to the straight line segment you see in the second figure.
To solve this, you could reverse the order of the first of the two lists (using [::-1] to index the list) as you join them together.
For example:
for i in range(2):
xs = streamline_x[i].tolist()[::-1] + streamline_x[i].tolist()
ys = streamline_y[i].tolist()[::-1] + streamline_y[i].tolist()
zs = streamline_z[i].tolist()[::-1] + minus_streamline_z[i]
ax.plot3D(xs, ys, zs)
Produces:
Note there is still a small linear segment as you are jumping back up from z=-0.08209141 to z=0.08209141 where the two lists join, but for this particular plot that doesn't seem to be noticible.
I have three arrays, r_vals, Tgas_vals, and n_vals. They are all numpy arrays of the shape (9998.). The arrays have repeated values and I want to iterate over the unique values of r_vals and find the corresponding values of Tgas_vals, and n_vals so I can use the last two arrays to calculate the weighted average. This is what I have right now:
def calc_weighted_average (r_vals,Tgas_vals,n_vals):
for r in r_vals:
mask = r == r_vals
count = 0
count += 1
for t in Tgas_vals[mask]:
print (count, np.average(Tgas_vals[mask]*n_vals[mask]))
weighted_average = calc_weighted_average (r_vals,Tgas_vals,n_vals)
The problem I am running into is that the function is only looping through once. Did I implement mask incorrectly, or is the problem somewhere else in the for loop?
I'm not sure exactly what you plan to do with all the averages, so I'll toss this out there and see if it's helpful. The following code will calculate a bunch of weighted averages, one per unique value of r_vals and store them in a dictionary(which is then printed out).
def calc_weighted_average (r_vals, z_vals, Tgas_vals, n_vals):
weighted_vals = {} #new variable to store rval=>weighted ave.
for r in np.unique(r_vals):
mask = r_vals == r # I think yours was backwards
weighted_vals[r] = np.average(Tgas_vals[mask]*n_vals[mask])
return weighted_vals
weighted_averages = calc_weighted_average (r_vals, z_vals, Tgas_vals, n_vals)
for rval in weighted_averages:
print ('%i : %0.4f' % (rval, weighted_averages[rval])) #assuming rval is integer
alternatively, you may want to factor in "z_vals" in somehow. Your question was not clear in this.
Hoping someone can help me here. I have two bigquery tables that I read into 2 different p collections, p1 and p2. I essentially want to update product based on a type II transformation that keeps track of history (previous values in the nested column in product) and appends new values from dwsku.
The idea is to check every row in each collection. If there is a match based on some table values (between p1 and p2), then check product's nested data to see if it contains all values in p1 (based on it's sku number and brand). If it does not contain the most recent data from p2 then take a copy of the format of the current nested data in product, and fit the new data into it. Take this nested format and add it to the existing nested products in product.
def process_changes(element, productdata):
for data in productdata:
if element['sku_number'] == data['sku_number'] and element['brand'] == data['brand']:
logging.info('Processing Product: ' + str(element['sku_number']) + ' brand:' + str(element['brand']))
datatoappend = []
for nestline in data['product']:
logging.info('Nested Data: ' + nestline['product'])
if nestline['in_use'] == 'Y' and (nestline['sku_description'] != element['sku_description'] or nestline['department_id'] != element['department_id'] or nestline['department_description'] != element['department_description']
or nestline['class_id'] != element['class_id'] or nestline['class_description'] != element['class_description'] or nestline['sub_class_id'] != element['sub_class_id'] or nestline['sub_class_description'] != element['sub_class_description'] ):
logging.info('we found a sku we need to update')
logging.info('sku is ' + data['sku_number'])
newline = nestline.copy()
logging.info('most recent nested product element turned off...')
nestline['in_use'] = 'N'
nestline['expiration_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
logging.info(nestline)
logging.info('inserting most recent change in dwsku inside nest')
newline['sku_description'] = element['sku_description']
newline['department_id'] = element['department_id']
newline['department_description'] = element['department_description']
newline['class_id'] = element['class_id']
newline['class_description'] = element['class_description']
newline['sub_class_id'] = element['sub_class_id']
newline['sub_class_description'] = element['sub_class_description']
newline['in_use'] = 'Y'
newline['effective_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_time'] = "%s:%s:%s" % (curdate.hour, curdate.minute, curdate.second)
nestline['expiration_date'] = "9999-01-01"
datatoappend.append(newline)
else:
logging.info('Nothing changed for sku ' + str(data['sku_number']))
for dt in datatoappend:
logging.info('processed sku ' + str(element['sku_number']))
logging.info('adding the changes (if any)')
data['product'].append(dt)
return data
changed_product = p1 | beam.FlatMap(process_changes, AsIter(p2))
Afterwards I want to add all values in p1 not in p2 in a nested format as seen in nestline.
Any help would be appreciated as I'm wondering why my job is taking hours to run with nothing to show. Even the output logs in dataflow UI don't show anything.
Thanks in advance!
This can be quite expensive if side input PCollection p2 is large. From your code snippets it's not clear how PCollection p2 is constructed. But if it is, for example, a text file that is if size 62.7MB, processing it per element can be pretty expensive. Can you consider using CoGroupByKey: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
Also note that from a FlatMap, you are supposed to return a iterator of elements from the processing method. Seems like you are returning a dictionary('data') which probably is incorrect.