I have a very awkward dataframe that looks like this:
+----+------+-------+-------+--------+----+--------+
| | | hour1 | hour2 | hour 3 | … | hour24 |
+----+------+-------+-------+--------+----+--------+
| id | date | | | | | |
| 1 | 3 | 4 | 0 | 96 | 88 | 35 |
| | 4 | 10 | 2 | 54 | 42 | 37 |
| | 5 | 9 | 32 | 8 | 70 | 34 |
| | 6 | 36 | 89 | 69 | 46 | 78 |
| 2 | 5 | 17 | 41 | 48 | 45 | 71 |
| | 6 | 50 | 66 | 82 | 72 | 59 |
| | 7 | 14 | 24 | 55 | 20 | 89 |
| | 8 | 76 | 36 | 13 | 14 | 21 |
| 3 | 5 | 97 | 19 | 41 | 61 | 72 |
| | 6 | 22 | 4 | 56 | 82 | 15 |
| | 7 | 17 | 57 | 30 | 63 | 88 |
| | 8 | 83 | 43 | 35 | 8 | 4 |
+----+------+-------+-------+--------+----+--------+
For each id there is a list of dates and for each date the hour columns represent that full day's worth of data broken out by hour for the full 24hrs.
What I would like to do is plot (using matplotlib) the full hourly data for each of the ids, but I can't think of a way to do this. I was looking into the possibility of creating numpy matrices, but I'm not sure if that is the right path to go down.
Clarification: Essentially, for each id I want to concatenate all the hourly data together in order and plot that. I already have the days in the proper order, so I imagine it's just a matter finding a way to put all of the hourly data for each id into one object
Any thoughts on how to best accomplish this?
Here is some sample data in csv format: http://www.sharecsv.com/s/e56364930ddb3d04dec6994904b05cc6/test1.csv
Here is one approach:
for groupID, data in d.groupby(level='id'):
fig = pyplot.figure()
ax = fig.gca()
ax.plot(data.values.ravel())
ax.set_xticks(np.arange(len(data))*24)
ax.set_xticklabels(data.index.get_level_values('date'))
ravel is a numpy method that will string out multiple rows into one long 1D array.
Beware running this interactively on a large dataset, as it creates a separate plot for each line. If you want to save the plots or the like, set a noninteractive matplotlib backend and use savefig to save each figure, then close it before creating the next one.
It might also be of interest to stack the data frame so that you have the dates and times together in the same index. For example, doing
df = df.stack().unstack(0)
Will put the dates and times in the index and the id as the columns names. Calling df.plot() will give you a line plot for each time series on the same axes. So you could do it as
ax = df.stack().unstack(0).plot()
and format the axes either by passing arguments to the plot method or by calling methods on ax.
I am not totally happy with this solution but maybe it can serve as starting point. Since your data is cyclic, I chose a polar chart. Unfortunately, the resolution in the y direction is poor. Therefore, I zoomed manually into the plot:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('test1.csv')
df_new = df.set_index(['id','date'])
n = len(df_new.columns)
# convert from hours to rad
angle = np.linspace(0,2*np.pi,n)
# color palete to cycle through
n_data = len(df_new.T.columns)
color = plt.cm.Paired(np.linspace(0,1,n_data/2)) # divided by two since you have 'red', and 'blue'
from itertools import cycle
c_iter = cycle(color)
fig = plt.figure()
ax = fig.add_subplot(111, polar=True)
# looping through the columns and manually select one category
for ind, i in enumerate(df_new.T.columns):
if i[0] == 'red':
ax.plot(angle,df_new.T[i].values,color=c_iter.next(),label=i,linewidth=2)
# set the labels
ax.set_xticks(np.linspace(0, 2*np.pi, 24, endpoint=False))
ax.set_xticklabels(range(24))
# make the legend
ax.legend(loc='upper left', bbox_to_anchor = (1.2,1.1))
plt.show()
Zoom 0:
Zoom 1:
Zoom 2:
Related
There are two dataframes
df1
+-----+-----+-------+
| | id | price |
+-----+-----+-------+
| 1 | 1 | 5 |
+-----+-----+-------+
| 2 | 2 | 12 |
+-----+-----+-------+
| 3 | 3 | 34 |
+-----+-----+-------+
| 4 | 4 | 62 |
+-----+-----+-------+
| ... | ... | ... |
+-----+-----+-------+
| 125 | 125 | 90 |
+-----+-----+-------+
and
df2
+-----+-----+-------+
| | id | price |
+-----+-----+-------+
| 1 | 1 | 14 |
+-----+-----+-------+
| 2 | 2 | 15 |
+-----+-----+-------+
| 3 | 3 | 45 |
+-----+-----+-------+
| 4 | 4 | 62 |
+-----+-----+-------+
| ... | ... | ... |
+-----+-----+-------+
| 125 | 125 | 31 |
+-----+-----+-------+
I would like to have a plot that shows the both price columns on X axis and sum on the Y axis to see how are the difference between these two dataframes.
I tried the below but does nothing.
line1 = df1.plot.line()
line2 = df2.plot.line()
lines = df.plot.line(x=df1['price'], y=df2['price']
What is the best way to show the differences between the two patterns of the price in these two dataframes?
I thought of something like this, but if there is a better way to show the differences please mention it.
If you take first column from df1, and second column from df2, there will be no couple of lines, only one. For qualitatively compartion you can use matplotlib in simple way, because it automaticly creates a figure.
import pandas as pd
import matplotlib.pyplot as plt
import random
import seaborn as sns
df1 = pd.DataFrame({
'col1': range(0,5), 'col2': sorted([round(random.uniform(100, 2000)) for i in range(0,5)])
})
df2 = pd.DataFrame({
'col1': range(0,5), 'col2': sorted([round(random.uniform(100, 2000)) for i in range(0,5)])
})
plt.plot(df1['col2'], label='first')
plt.plot(df2['col2'], label='second')
plt.legend()
Here is the result:
For every arg on x it shows point on df1[col1] and df2[col2]. But with this plot you can not compare it quantitatively.
PS: Here is logic you tried to realize but with seaborn.
df3 = pd.merge(df1,df2, on='col1')
sns.lineplot(x='col2_x', y='col2_y', data=df3)
Result:
Optional
Quantitative comparison.
df3['dif'] = abs(df3['col2_x'] - df3['col2_y'])
sns.lineplot(x='col1', y='dif', data=df3)
plt.xticks(df3['col1'])
Result:
It is really not clear what you mean by "both price columns on X axis and sum on the Y axis"..
Since your ids are the same, you can plot them like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(221)
df1 = pd.DataFrame({'id':np.arange(125),
'price':np.random.randint(1,10,125)})
df2 = pd.DataFrame({'id':np.arange(125),
'price':np.random.randint(10,20,125)})
fig,ax = plt.subplots()
ax.plot(df1.price,label="df1")
ax.plot(df2.price,label="df2")
ax.legend()
I have a dataframe consisting of mean and std-dev of distributions
df.head()
+---+---------+----------------+-------------+---------------+------------+
| | user_id | session_id | sample_mean | sample_median | sample_std |
+---+---------+----------------+-------------+---------------+------------+
| 0 | 1 | 20081023025304 | 4.972789 | 5 | 0.308456 |
| 1 | 1 | 20081023025305 | 5.000000 | 5 | 1.468418 |
| 2 | 1 | 20081023025306 | 5.274419 | 5 | 4.518189 |
| 3 | 1 | 20081024020959 | 4.634855 | 5 | 1.387244 |
| 4 | 1 | 20081026134407 | 5.088195 | 5 | 2.452059 |
+---+---------+----------------+-------------+---------------+------------+
From this, I plot a histogram of the distribution
plt.hist(df['sample_mean'],bins=50)
plt.xlabel('sampling rate (sec)')
plt.ylabel('Frequency')
plt.title('Histogram of trips mean sampling rate')
plt.show()
I then write a function to compute pdf and cdf, passing dataframe and column name:
def compute_distrib(df, col):
stats_df = df.groupby(col)[col].agg('count').pipe(pd.DataFrame).rename(columns = {col: 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
return stats_df
So for example:
stats_df = compute_distrib(df, 'sample_mean')
stats_df.head(2)
+---+---------------+-----------+----------+----------+
| | sample_median | frequency | pdf | cdf |
+---+---------------+-----------+----------+----------+
| 0 | 1 | 4317 | 0.143575 | 0.143575 |
| 1 | 2 | 10169 | 0.338200 | 0.481775 |
+---+---------------+-----------+----------+----------+
Then I plot the cdf distribution this way:
ax1 = stats_df.plot(x = 'sample_mean', y = ['cdf'], grid = True)
ax1.legend(loc='best')
Goal:
I would like to plot these figures in one figure side-by-side instead of plotting separately and somehow putting them together in my slides.
You can use matplotlib.pyplot.subplots to draw multiple plots next to each other:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=1, ncols=2)
# Pass the data you wish to plot.
axs[0][0].hist(...)
axs[0][1].plot(...)
plt.show()
I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]
I would like to draw a barplot graph that would compare the evolution of 2 variables of revenues on a monthly time-axis (12 months of invoices).
I wanted to use sns.barplot, but can't use "hue" (cause the 2 variables aren't subcategories?). Is there another way, as simple as with hue? Can I "create" a hue?
Here is a small sample of my data:
(I did transform my table into a pivot table)
[In]
data_pivot['Revenue-Small-Seller-in'] = data_pivot["Small-Seller"] + data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot['Revenue-Not-Small-Seller-in'] = data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot
[Out]
InvoiceNo Month Year Revenue-Small-Seller-in Revenue-Not-Small-Seller-in
536365 12 2010 139.12 139.12
536366 12 2010 22.20 11.10
536367 12 2010 278.73 246.93
(sorry for the ugly presentation of my data, see the picture to see the complete table (as there are multiple columns))
You can do:
render_df = data_pivot[data_pivot.columns[-2:]]
fig, ax = plt.subplots(1,1)
render_df.plot(kind='bar', ax=ax)
ax.legend()
plt.show()
Output:
Or sns style like you requested
render_df = data_pivot[data_pivot.columns[-2:]].stack().reset_index()
sns.barplot('level_0', 0, hue='level_1',
render_df)
here render_df after stack() is:
+---+---------+-----------------------------+--------+
| | level_0 | level_1 | 0 |
+---+---------+-----------------------------+--------+
| 0 | 0 | Revenue-Small-Seller-in | 139.12 |
| 1 | 0 | Revenue-Not-Small-Seller-in | 139.12 |
| 2 | 1 | Revenue-Small-Seller-in | 22.20 |
| 3 | 1 | Revenue-Not-Small-Seller-in | 11.10 |
| 4 | 2 | Revenue-Small-Seller-in | 278.73 |
| 5 | 2 | Revenue-Not-Small-Seller-in | 246.93 |
+---+---------+-----------------------------+--------+
and output:
I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv, data_b.csv, etc.
But, I would also like to create index.csv which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?
You could loop through each file, grabbing the index of each and creating a set union of all indices.
import glob
tickers = set()
for csvfile in glob.glob('*.csv'):
data = pd.read_csv(csvfile, index_col=0, header=None) # or True, however your data is set up
tickers.update(data.index.tolist())
pd.Series(list(tickers)).to_csv('index.csv', index=False)
You can retrieve the index from the file names:
(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
"\\1",
list.files()))))
## Ticker
## 1 A
## 2 B
write.csv(index, "index.csv")