I would like to make a barchart in python using matplotlib pyplot. The data consists of an index, which is a datetime list, and a number corresponding to that datetime. I have various samples that all belong to the same day. However, when making the bar chart, it only shows the first samples corresponding to a certain datetime, instead of all of them. How can I make a barchart showing every entry?
The index has the following structure:
ind = [datetime.datetime(2017, 3, 1, 0, 0), datetime.datetime(2017, 3, 1, 0, 0),
datetime.datetime(2017, 3, 15, 0, 0), datetime.datetime(2017, 3, 15, 0, 0)]
and the values are just integers:
values = [10, 20, 30, 40]
So when plotting, it only shows the bars 2017-3-1 with value 10, and 2017-3-15 with value 30. How can I make them show all of them?
You can groupby the dates, add the values and then plot the barchart from the same dataframe:
df = pd.DataFrame(data=values, index=ind)
df = df.groupby(df.index).sum()
df.plot(kind='bar')
If what you want is all values to appear in the plot, regardless of the date, you can simply use:
df.plot(kind='bar')
And entries with duplicate date will be plotted independently.
Related
I know how to plot a histogram when individual datapoints are given like:
(33, 45, 54, 33, 21, 29, 15, ...)
by simply using something matplotlib.pyplot.hist(x, bins=10)
but what if I only have grouped data like:
and so on.
I know that I can use bar plots to mimic a histogram by changing xticks but what if I want to do this by using only hist function of matplotlib.pyplot?
Is it possible to do this?
You can build the hist() params manually and use the existing value counts as weights.
Say you have this df:
>>> df = pd.DataFrame({'Marks': ['0-10', '10-20', '20-30', '30-40'], 'Number of students': [8, 12, 24, 26]})
Marks Number of students
0 0-10 8
1 10-20 12
2 20-30 24
3 30-40 26
The bins are all the unique boundary values in Marks:
>>> bins = pd.unique(df.Marks.str.split('-', expand=True).astype(int).values.ravel())
array([ 0, 10, 20, 30, 40])
Choose one x value per bin, e.g. the left edge to make it easy:
>>> x = bins[:-1]
array([ 0, 10, 20, 30])
Use the existing value counts (Number of students) as weights:
>>> weights = df['Number of students'].values
array([ 8, 12, 24, 26])
Then plug these into hist():
>>> plt.hist(x=x, bins=bins, weights=weights)
One possibility is to “ungroup” data yourself.
For example, for the 8 students with a mark between 0 and 10, you can generate 8 data points of value of 5 (the mean). For the 12 with a mark between 10 and 20, you can generate 12 data points of value 15.
However, the “ungrouped” data will only be an approximation of the real data. Thus, it is probably better to just use a matplotlib.pyplot.bar plot.
I have a dictionary Dict1 with keys as Dates and Sims.
Dates is an array with shape 100x1 and Sims has shape 100x5
I am trying:
import pandas as pd
df = pd.Dataframe.from_dict(Dict1)
But errors out due to size of Sims. Is there a way I can create the DataFrame with each row of column Sims has size 5? i.e each row can be stored as list or array of size 5.
Edit:
Dict1['Dates']
array([datetime.datetime(2016, 11, 1, 0, 0),
datetime.datetime(2016, 11, 1, 1, 0),
datetime.datetime(2016, 11, 1, 2, 0), ...,
datetime.datetime(2025, 12, 31, 21, 0),
datetime.datetime(2025, 12, 31, 22, 0),
datetime.datetime(2025, 12, 31, 23, 0)], dtype=object)
Dict1['Sims']
array([[ 63.89694316, 35.8551162 , 40.36134283, 57.23648392,
35.96607425, 61.166471 ],
[ 47.94894386, 53.95396849, 48.94336457, 51.04541849,
28.69973176, 49.78683505],
[ 63.90314179, 43.29467789, 36.97811714, 52.33639618,
45.24190878, 69.9059308 ]...]])
Edit2:
I am looking to create the dataframe such that I can perform the following operation:
print(df[datetime.datetime(2016, 11, 1, 0, 0)])
[ 63.89694316, 35.8551162 , 40.36134283, 57.23648392,
35.96607425, 61.166471 ]
You can use your Dict1['Dates'] as the index.
df = pd.DataFrame(Dict1['Sims'], index=Dict1['Dates'])
df.ix[datetime.datetime(2016, 11, 1, 0, 0)]
Note that you should use the df.ix[key] indexer, since df[key] defaults to looking up a column, not a row.
Alternatively, if you really want a single column containing lists, make sure that Dict1['Sims'] is a Python list, not a Numpy array before creating your data frame.
df = pd.DataFrame({'Sims': Dict1['Sims'].tolist()}, index=Dict1['Dates'])
The {'Sims': ...} construct tricks Pandas into interpreting the data as a single series of lists, rather than a multi-dimensional array.
I have a numpy array with datetime stored in array A of size 100 as:
>>>A[0]
datetime.datetime(2011, 1, 1, 0, 0)
The other 99 elements are datetime.datetime objects also but few of them repeat e.g.
A[55]
datetime.datetime(2011, 11, 2, 0, 0)
A[56]
datetime.datetime(2011, 11, 2, 0, 0)
I have another array of Temperatures of same size as A with values corresponding to rows of A as:
Temperature[0] = 55
Temperature[55] = 40
Temperature[56] = 50
I am trying to obtain a new array from A2 which only has unique datetime from A and takes average of corresponding temperature repeats.
So in this case I will have A2 with only 1 datetime.datetime(2011, 11, 2, 0, 0) and temperature will be 0.5*(40+50) = 45
I am trying to use pandas pivot table as:
DayLightSavCure = pd.pivot_table(pd.DataFrame({'DateByHour': A, 'Temp': Temperature}), index=['DateByHour'], values=['Temp'], aggfunc=[np.mean])
But the error is:
ValueError: If using all scalar values, you must pass an index
I do actually concurr with #someone else, this could be achieved without digging into pandas. itertools is really nice for this. Written for Python 3.5+(because of statistics:
from itertools import groupby
from operator import itemgetter
from random import randint
import datetime
from statistics import mean
# Generate test data
dates = [datetime.datetime(2005, i % 12 + 1, 5, 5, 5, 5) for i in range(100)]
temperatures = [randint(0, 100) for _ in range(100)]
# Calculate averages
## Group data points by unique dates using `groupby`, `sorted` and `zip`
grouped = groupby(sorted(zip(dates, temperatures)), key=itemgetter(0))
##Calculate mean per unique date
averaged = [(key, mean(temperature[1] for temperature in values)) for key, values in grouped]
print(averaged) # List of tuples
#[(datetime.datetime(2005, 1, 5, 5, 5, 5), 65.22222222222223), (datetime.datetime(2005, 2, 5, 5, 5, 5), 60.0),.......
print(dict(averaged)) # Nicer as a dict
{datetime.datetime(2005, 3, 5, 5, 5, 5): 48.111111111111114, datetime.datetime(2005, 12, 5, 5, 5, 5): 43.75, ..........
If you have to have two separate lists/iterators at the end of the calculation just apply zip to averaged.
I have a vector of dates of size 10 and type numpy.ndarray. I also have an array of temperatures at each hour of size 10x24.
I want to print the dates in column A and the corresponding temperature in columns B through Y for rows 1 though 10 in a csv file.
My arrays look as following:
print(AllDays)
[datetime.date(2008, 12, 31) datetime.date(2009, 1, 1)
datetime.date(2009, 1, 2) ..., datetime.date(2015, 11, 28)
datetime.date(2015, 11, 29) datetime.date(2015, 11, 30)]
So far I have to trying to implement this using dataframes as below:
TempDay = pd.DataFrame()
TempDay['Dates'] = AllDays #of size 10
TempDay['Temperature'] = TemperatureArray #of size 10x24
If the previous step had worked I aimed at:
TempDay.to_csv('C:\MyFile.csv')
But the above method has not been working.
It's not working because you trying to assign dataframe to column. You could construct pandas dataframe with your TemperatureArray and then add Dates column:
TempDay = pd.DataFrame(TemperatureArray)
TempDay['Dates'] = AllDays
TempDay.to_csv('C:\MyFile.csv')
My data (in a pandas DataFrame) looks like this:
a = pd.DataFrame({"age" : [25, 25, 25, 26, 26, 25, 25, 27, 27, 25, 26, 26, 25, 25],
"category" : [1, 2, 2, 3, 1, 3, 1, 1, 2, 3, 2, 2, 1, 2]})
I want a stacked barplot - each bar should reflect how many are in an age group and what percentage within the agegroup belongs to which category. The code
a.pivot_table('category', 'age').plot(kind='bar')
returns only the mean of the category instead of stacking it the way I want it.
Perhaps you mean to do something like the following... You need to change the aggregating function in pivot_table (which defaults to 'mean') and then use stacked=True when you plot:
table = a.pivot_table(index='category', columns='age', aggfunc='size')
table.plot(kind='bar', stacked=True)
This yields: