I have a 100.000.000 sample dataset and I want to make a histogram with pyplot. But reading this large file drains my memory critically (cursor not moving anymore, ...), so I'm looking for ways to 'help' pyplot.hist. I was thinking breaking up the file into several smaller files might help. But I wouldn't know how to combine them afterwards.
you can combine the output of pyplot.hist, or as #titusjan suggested numpy.histogram, as long as you keep your bins fixed each time you call it. For example:
import matplotlib.pyplot as plt
import numpy as np
# Generate some fake data
data=np.random.rand(1000)
# The fixed bins (change depending on your data)
bins=np.arange(0,1.1,0.1)
sub_hist = [], []
# Split into 10 sub histograms
for i in np.arange(0,1000,10):
sub_hist_temp, bins_out = np.histogram(data[i:i+10],bins=bins)
sub_hist.append(sub_hist_temp)
# Sum the histograms
hist_sum = np.array(sub_hist).sum(axis=0)
# Plot the new summed data, using plt.bar
fig=plt.figure()
ax1=fig.add_subplot(211)
ax1.bar(bins[:-1],hist_sum,width=0.1) # Change width depending on your bins
# Plot the histogram of all data to check
ax2=fig.add_subplot(212)
hist_all, bins_out, patches = all=ax2.hist(data,bins=bins)
fig.savefig('histsplit.png')
Related
I want to plot graph with a certain condition without manipulating my data frame.
For example, I created a countplot with a data frame that has a bunch of x-values that are less than 100, and in the countplot, those less than 100 comes up as "no-bar", and it's taking up space. So I want to just get rid of those empty (count < 100).
I tried to create another data frame with only count values higher than 100, but I wanted to know if there is a simpler/cleaner way to plot a graph, rather than creating a whole data frame.
plt.figure(figsize=(10,50))
plt.ylim(100,500)
sns.countplot(data=df, x='brand')
From this code, I see many empty bars caused by counting values less than 100, as xlim is set to 100-500.
import matplotlib.pyplot as plt
import seaborn as sns
plot_data = df.groupby('brand').size().reset_index(name='count').query('count>=100')
plt.figure(figsize=(10,50))
plt.ylim(100,500)
sns.barplot(data=plot_data, x='brand', y='count')
I have bubble plot like this, and I am willing to put labels next to each bubble (their name). Does any body know how to do that?
#Falko refered to another post that indicates you should be looking for the text method of the axes. However, your problem is quite a bit more involved than that, because you'll need to implement an offset that scales dynamically with the size of the "bubble" (the marker). That means you'll be looking into the transformation methods of matplotlib.
As you didn't provide a simple example dataset to experiment with, I've used one that is freely available: earthquakes of 1974. In this example, I'm plotting the depth of the quake vs the date on which it occurred, using the magnitude of the earthquake as the size of the bubbles/markers. I'm appending the locations of where these earthquakes happened next to the markers, not inside (which is far more easy: ignore the offset and set ha='center' in the call to ax.text).
Note that the bulk of this code example is merely about getting some dataset to toy with. What you really needed was just the ax.text method with the offset.
from __future__ import division # use real division in Python2.x
from matplotlib.dates import date2num
import matplotlib.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Get a dataset
data_url = 'http://datasets.flowingdata.com/earthquakes1974.csv'
df = pd.read_csv(data_url, parse_dates=['time'])
# Select a random subset of that dataframe to generate some variance in dates, magnitudes, ...
data = np.random.choice(df.shape[0], 10)
records = df.loc[data]
# Taint the dataset to add some bigger variance in the magnitude of the
# quake to show that the offset varies with the size of the marker
records.mag.values[:] = np.arange(10)
records.mag.values[0] = 50
records.mag.values[-1] = 100
dates = [date2num(dt) for dt in records.time]
f, ax = plt.subplots(1,1)
ax.scatter(dates, records.depth, s=records.mag*100, alpha=.4) # markersize is given in points**2 in recentt versions of mpl
for _, record in records.iterrows():
# Specify an offset for the text annotation:
# it is approx the radius of the disc + 10 points to the right
dx, dy = np.sqrt(record.mag*100)/f.dpi/2 + 10/f.dpi, 0.
offset = transforms.ScaledTranslation(dx, dy, f.dpi_scale_trans)
ax.text(date2num(record.time), record.depth, s=record.place,
va='center', ha='left',
transform=ax.transData + offset)
ax.set_xticks(dates)
ax.set_xticklabels([el.strftime("%Y-%M") for el in records.time], rotation=-60)
ax.set_ylabel('depth of earthquake')
plt.show()
For one such run, I got:
Definitely not pretty because of the overlapping labels, but it was just an example to show how to use the transforms in matplotlib to dynamically add an offset to the labels.
I am looping through a bunch of CSV files containing various measurements.
Each file might be from one of 4 different data sources.
In each file, I merge the data into monthly datasets, that I then plot in a 3x4 grid. After this plot has been saved, the loop moves on and does the same to the next file.
This part I got figured out, however I would like to add a visual clue to the plots, as to what data it is. As far as I understand it (and tried it)
plt.subplot(4,3,1)
plt.hist(Jan_Data,facecolor='Red')
plt.ylabel('value count')
plt.title('January')
does work, however this way, I would have to add the facecolor='Red' by hand to every 12 subplots. Looping through the plots wont work for this situation, since I want the ylabel only for the leftmost plots, and xlabels for the bottom row.
Setting facecolor at the beginning in
fig = plt.figure(figsize=(20,15),facecolor='Red')
does not work, since it only changes the background color of the 20 by 15 figure now, which subsequently gets ignored when I save it to a PNG, since it only gets set for screen output.
So is there just a simple setthecolorofallbars='Red' command for plt.hist(… or plt.savefig(… I am missing, or should I just copy n' paste it to all twelve months?
You can use mpl.rc("axes", color_cycle="red") to set the default color cycle for all your axes.
In this little toy example, I use the with mpl.rc_context block to limit the effects of mpl.rc to just the block. This way you don't spoil the default parameters for your whole session.
import matplotlib as mpl
import matplotlib.pylab as plt
import numpy as np
np.random.seed(42)
# create some toy data
n, m = 2, 2
data = []
for i in range(n*m):
data.append(np.random.rand(30))
# and do the plotting
with mpl.rc_context():
mpl.rc("axes", color_cycle="red")
fig, axes = plt.subplots(n, m, figsize=(8,8))
for ax, d in zip(axes.flat, data):
ax.hist(d)
The problem with the x- and y-labels (when you use loops) can be solved by using plt.subplots as you can access every axis seperately.
import matplotlib.pyplot as plt
import numpy.random
# creating figure with 4 plots
fig,ax = plt.subplots(2,2)
# some data
data = numpy.random.randn(4,1000)
# some titles
title = ['Jan','Feb','Mar','April']
xlabel = ['xlabel1','xlabel2']
ylabel = ['ylabel1','ylabel2']
for i in range(ax.size):
a = ax[i/2,i%2]
a.hist(data[i],facecolor='r',bins=50)
a.set_title(title[i])
# write the ylabels on all axis on the left hand side
for j in range(ax.shape[0]):
ax[j,0].set_ylabel(ylabel[j])
# write the xlabels an all axis on the bottom
for j in range(ax.shape[1]):
ax[-1,j].set_xlabel(xlabels[j])
fig.tight_layout()
All features (like titles) which are not constant can be put into arrays and placed at the appropriate axis.
I have an histogram like this:
Where my data are stored with an append in that way:
(while parsing the file)
{
[...]
a.append(int(number))
#a = [1,1,2,1,1, ]...
}
plt.hist(a, 180)
But as you can see from the image, there are lot of blank areas, so I would like to build a barchart from this data, how can I reorganize them like:
#a = [ 1: 4023, 2: 3043, 3:...]
Where 1 is the "number" and 4023 is an example on how many "hit" of the number 1? From what I have seen in this way I can call:
plt.bar(...)
and creating it, so that I can show only the relevant numbers, with more readability.
If there is a simple way to cut white area in the Histo is also welcome.
I would like also to show the top counter of each columns, but I have no idea how to do it.
Assuming you have some numpy array a full of integers then the code below will produce the bar chart you desire.
It uses np.bincount to count the number of values, note that it only works for non-negative integers.
Also note that I have adjusted the indices so that the plot centrally rather than to the left (using ind-width/2.).
import matplotlib.pyplot as plt
import numpy as np
# Generate some random data.
N=300
a = np.random.random_integers(low=0, high=20, size=N)
# Use bincount and nonzero to generate your data in the correct format.
b = np.bincount(a)
ind = np.nonzero(b)[0]
width=0.8
fig, ax = plt.subplots()
ax.bar(ind-width/2., b)
plt.show()
This is a very general question.
I have a series of data with a quantity (y) versus time (x). It is a very long series and the data are sometimes pretty noisy, some times better.
I would like to write a python code that allows me to take a look at these data with a given x-range per time (just a snapshot, so to say), and then allow me to decide if I want to "store" the sequence or not. Then pass to the next sequence and do the same, and so on. So at the end I will have a stacked amount of sequences that I can analyze separately.
I need some suggestions about the graphical part: I don't have a clue of which modules I need.
Matplotlib is probably one of the best options for the graphical part. For example:
import numpy as np
import matplotlib.pyplot as plt
plt.ion()
# make some data of some size
nblocks = 10
block_size = 1000
size = block_size*nblocks
data = np.random.normal(0.,1.,size=size)
# create a matplotlib figure with some plotting axes
fig = plt.figure()
ax = fig.add_subplot(111)
# display the figure
plt.show()
# storage for blocks to keep
kept_blocks = []
for block in data.reshape(nblocks,block_size):
#plot the block
ax.plot(block)
#force matplotlib to rerender
plt.draw()
# ask user for some input
answer = raw_input("Keep block? [Y/n]")
if answer.lower() != "n":
kept_blocks.append(block)
#clear the plotting axes
ax.cla()
# turn the kept blocks into a 2D array
kept_blocks = np.vstack(kept_blocks)
#or a 1D array
#kept_blocks = np.hstack(kept_blocks)
Matplotlib is well supported and is the de facto plotting standard in python.