Custom Histogram Input for a very large dataset

Custom Histogram Input for a very large dataset - python

I have a usecase of create a histogram that is more meaningful then the default ones.
I have elasticsearch as a datastore where all my numberical data is stored.
It has a number field of price with highly varing values. most of price in the range of 100 to 999 are centered around 399 to 500, from 501 to 999 there are few then again from 999 to 1299 a huge range and so on.
example:
399-500: 1542
501-999: 7501
1000-1299: 10214
1299-2000: 154
...
While creating histogram of bucket size 200 only 2 of the 8 are having 75% of the bar height others are very small.
If i chose small bucket size then the chart become heavy to render with 1000+ buckets.
If i chose big bucket size then the insights are not useful from the chart plotted.
I want to make a intelligent bucketing where i can split the big buckets in to small ranges say of 50-70 and in the same time merge the small buckets into single big one say of 1000. so that the charts be more meaningful.
Is there a python code for such use case.
Edited:
Due to the two spikes i can neither visualize the flat regions to show the actual variations nor can i show the distribution of power in the spikes to say that price range say 449 to 499 is the most contibuting in the spike of 399 to 500

Correct me if I am wrong but if you change the bin-width depending on the amount of data located in a bin there is no longer any reason to use the histogram since it will not give the same amount of information anymore.
Why don't you instead use a normal plot to show the result? A simple code to do so would be the following for example.
import numpy as np
import matplotlib.pyplot as plt
data_1 = np.random.normal(450, 50, 1542)
data_2 = np.random.normal(700, 200, 7501)
data_3 = np.random.normal(1150, 150, 10214)
data_4 = np.random.normal(1650, 350, 154)
data = np.concatenate((data_1, data_2, data_3, data_4))
nr_of_samples = len(data)
nr_of_bins = 1000
offset = min(data)
range = max(data) - min(data)
bins = np.zeros(nr_of_bins)
for d in data:
bin_index = int(((d-offset)/range)*nr_of_bins)-1
bins[bin_index] += 1
plt.plot(np.linspace(min(data), max(data), nr_of_bins), bins)
plt.xlabel("Value")
plt.ylabel(f"Nr of values, binwidth = {round(range/nr_of_bins,2)}")
plt.show()
Giving the following final result:

Related

Matplotlib Scatterplot with 4 visual encondings

The initial step is a pandas Dataframe with several columns.
Th second step I did is to convert some columns of this Dataframe in a Numpy array using to_numpy() function.
I retrieve something like:
[[100 200 3.5 1] [100 200 3.5 1] [100 300 6.2 1] [200 125 4.2 1] [100 300 6.2 1] [100 200 3.5 1]]
Where the first element imagine that is an origin id
the second element is a destiny id
the 3rd is the distance between origin a destiny
and the 4th is just a counter (1 element) (I have included it just because I think that could be required to count elements. Just ignore it if your proposed solution doesn't use it)
I would like to have a scatterplot with the following specifications:
origing_id in x axis
destiny_id in y axis
color of the scatter dot in a warm scale that
indicates distance between both points (3rd element)
size of the
scatter dot depends on the number of pairs of origins_id
/destiny_id we have.for example we have three 100 200
combinations. So its size should be bigger that the one for
combintion 200 125 that only has one entry.
I have tried but I'm not able to include all prerequisites in this plot.
How this could be achieved in matplotlib? Or is there any other easier approach using pandas directly?

If I understood your requirements correctly, this should do the trick:
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[100,200,3.5,1],[100,200,3.5,1],[100,300,6.2,1],[200,125,4.2,1],[100,300,6.2,1],[100,200,3.5,1]])
unique, counts = np.unique(data, axis=0, return_counts=True)
x = unique[:,0]
y = unique[:,1]
c = unique[:,2]
## figure out a nice looking scaling factor here
# and remember that the scatter point size is supposed to be an area,
# hence squaring a base factor is ideal
s = (counts*10)**2
fig, ax = plt.subplots()
sca = ax.scatter(x,y,c=c,s=s)
plt.colorbar(sca)
plt.show()
which yields:

I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7

I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()

Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()

How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly

Lets say I have the following data:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories.
In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
Update: see #unutbu answere
Updated code and im getting an error for qcut when trying to use tuples.
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, "other".
By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
There is a limit to the number of category labels you can sensibly display on a
bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is
probably not reasonable to expect an audience to glean any meaning out of
reading 3000 labels.
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

Just log the axis (I have no pandas, but it should be similar):
import numpy as np
import matplotlib.pyplot as plt
s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()

Is there a way for iPython to generate these kinds of charts given a dataframe?

This picture
Please ignore the background image. The foreground chart is what I am interested in showing using pandas or numpy or scipy (or anything in iPython).
I have a dataframe where each row represents temperatures for a single day.
This is an example of some rows:
100 200 300 400 500 600 ...... 2300
10/3/2013 53*C 57*C 48*C 49*C 54*C 54*C 55*C
10/4/2013 45*C 47*C 48*C 49*C 50*C 52*C 57*C
Is there a way to get a chart that represents the changes from hour to hour using the first column as a 'zero'

Something quick and dirty that might get you most of the way there, assuming your data frame is named df:
import matplotlib.pyplot as plt
plt.imshow(df.T.diff().fillna(0.0).T.drop(0, axis=1).values)
Since I can't easily construct a sample version with your exact column labels, there might be slight additional tinkering with getting rid of any index columns that are included in the diff and moved with the transposition. But this worked to make a simple heat-map-ish plot for me on a random data example.
Then you can create a matplotlib figure or axis object and specify whatever you want for the x- and y-axis labels.

You could just plot lines one at a time for each row with an offset:
nrows, ncols = 12, 30
# make up some fake data:
d = np.random.rand(nrows, ncols)
d *= np.sin(2*np.pi*np.arange(ncols)*4/ncols)
d *= np.exp(-0.5*(np.arange(nrows)-nrows/2)**2/(nrows/4)**2)[:,None]
#this is all you need, if you already have the data:
for i, r in enumerate(d):
plt.fill_between(np.arange(ncols), r+(nrows-i)/2., lw=2, facecolor='white')
You could do it all at once if you don't need the fill color to block the previous line:
d += np.arange(nrows)[:, None]
plt.plot(d.T)

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3

I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Custom Histogram Input for a very large dataset - python

Related

Matplotlib Scatterplot with 4 visual encondings

I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7

How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly

Is there a way for iPython to generate these kinds of charts given a dataframe?

plotting high precision data

Categories

Resources