How to barplot Pandas dataframe columns aligning by sub-index? - python

I have a pandas dataframe df contains two stocks' financial ratio data :
>>> df
ROIC ROE
STK_ID RPT_Date
600141 20110331 0.012 0.022
20110630 0.031 0.063
20110930 0.048 0.103
20111231 0.063 0.122
20120331 0.017 0.033
20120630 0.032 0.077
20120930 0.050 0.120
600809 20110331 0.536 0.218
20110630 0.734 0.278
20110930 0.806 0.293
20111231 1.679 0.313
20120331 0.666 0.165
20120630 1.039 0.257
20120930 1.287 0.359
And I try to plot the ratio 'ROIC' & 'ROE' of stock '600141' & '600809' together on the same 'RPT_Date' to benchmark their performance.
df.plot(kind='bar') gives below
The chart draws '600141' on the left side , '600809' on the right side. It is somewhat inconvenience to compare the 'ROIC' & 'ROE' of the two stocks on same report date 'RPT_Date' .
What I want is to put the 'ROIC' & 'ROE' bar indexed by same 'RPT_Date' in same group side by side ( 4 bar per group), and x-axis only labels the 'RPT_Date', that will clearly tell the difference of two stocks.
How to do that ?
And if I df.plot(kind='line') , it only shows two lines, but it should be four lines (2 stocks * 2 ratios) :
Is it a bug, or what I can do to correct it ? Thanks.
I am using Pandas 0.8.1.

If you unstack STK_ID, you can create side by side plots per RPT_Date.
In [55]: dfu = df.unstack("STK_ID")
In [56]: fig, axes = subplots(2,1)
In [57]: dfu.plot(ax=axes[0], kind="bar")
Out[57]: <matplotlib.axes.AxesSubplot at 0xb53070c>
In [58]: dfu.plot(ax=axes[1])
Out[58]: <matplotlib.axes.AxesSubplot at 0xb60e8cc>

Related

Convert day numbers into dates in python

How do you convert day numbers (1,2,3...728,729,730) to dates in python? I can assign an arbitrary year to start the date count as the year doesn't matter to me.
I am working on learning time series analysis, ARIMA, SARIMA, etc using python. I have a CSV dataset with two columns, 'Day' and 'Revenue'. The Day column contains numbers 1-731, Revenue contains numbers 0-18.154... I have had success building the model, running statistical tests, building visualizations, etc. But when it comes to forecasting using prophet I am hitting a wall.
Here are what I feel are the relevant parts of the code related to the question:
# Loading the CSV with pandas. This code converts the "Day" column into the index.
df = read_csv("telco_time_series.csv", index_col=0, parse_dates=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 731 entries, 1 to 731
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Revenue 731 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
df.head()
Revenue
Day
1 0.000000
2 0.000793
3 0.825542
4 0.320332
5 1.082554
# Instantiate the model
model = ARIMA(df, order=(4,1,0))
# Fit the model
results = model.fit()
# Print summary
print(results.summary())
# line plot of residuals
residuals = (results.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
SARIMAX Results
==============================================================================
Dep. Variable: Revenue No. Observations: 731
Model: ARIMA(4, 1, 0) Log Likelihood -489.105
Date: Tue, 03 Aug 2021 AIC 988.210
Time: 07:29:55 BIC 1011.175
Sample: 0 HQIC 997.070
- 731
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.4642 0.037 -12.460 0.000 -0.537 -0.391
ar.L2 0.0295 0.040 0.746 0.456 -0.048 0.107
ar.L3 0.0618 0.041 1.509 0.131 -0.018 0.142
ar.L4 0.0366 0.039 0.946 0.344 -0.039 0.112
sigma2 0.2235 0.013 17.629 0.000 0.199 0.248
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 2.52
Prob(Q): 0.90 Prob(JB): 0.28
Heteroskedasticity (H): 1.01 Skew: -0.05
Prob(H) (two-sided): 0.91 Kurtosis: 2.73
===================================================================================
df.columns=['ds','y']
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
m = Prophet()
m.fit(df)
ValueError: Dataframe must have columns "ds" and "y" with the dates and values
respectively.
I've had success with the forecast using prophet if I fill the values in the CSV with dates, but I would like to convert the Day numbers within the code using pandas.
Any ideas?
I can assign an arbitrary year to start the date count as the year doesn't matter to me(...)Any ideas?
You might harness datetime.timedelta for this task. Select any date you wish as day 0 and then add datetime.timedelta(days=x) where x is your day number, for example:
import datetime
day0 = datetime.date(2000,1,1)
day120 = day0 + datetime.timedelta(days=120)
print(day120)
output
2000-04-30
encase in function and .apply if you have pandas.DataFrame like so
import datetime
import pandas as pd
def convert_to_date(x):
return datetime.date(2000,1,1)+datetime.timedelta(days=x)
df = pd.DataFrame({'day_n':[1,2,3,4,5]})
df['day_date'] = df['day_n'].apply(convert_to_date)
print(df)
output
day_n day_date
0 1 2000-01-02
1 2 2000-01-03
2 3 2000-01-04
3 4 2000-01-05
4 5 2000-01-06

New pandas version: how to groupby all columns with different aggregation statistics

I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5
You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159

How to introduce missing values in time series data

I'm new to python and also new to this site. My colleague and I are working on a time series dataset. we wish to introduce some missing values to the dataset and then use some techniques to fill in the missing values to see how well those techniques perform for the data imputation task. The challenge we have at the moment is how to introduce missing values to the dataset in a consecutive manner and not just randomly. For example, we want to replace data for a period of time with NaNs, eg, 3 consecutive days. I will really appreciate if anyone can point us in the right direction on how to get this done. we are working with python.
Here is my sample data
There is a method for filling NaNs
dataframe['name_of_column'].fillna('value')
See set_missing_data function below:
import numpy as np
np.set_printoptions(precision=3, linewidth=1000)
def set_missing_data(data, missing_locations, missing_length):
for i in missing_locations:
data[i:i+missing_length] = np.nan
np.random.seed(0)
n_data_points = np.random.randint(40, 50)
data = np.random.normal(size=[n_data_points])
n_missing = np.random.randint(3, 6)
missing_length = 3
missing_locations = np.random.choice(
n_data_points - missing_length,
size=n_missing,
replace=False
)
print(data)
set_missing_data(data, missing_locations, missing_length)
print(data)
Console output:
[ 0.118 0.114 0.37 1.041 -1.517 -0.866 -0.055 -0.107 1.365 -0.098 -2.426 -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 1.419 1.168 0.947 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
[ 0.118 nan nan nan -1.517 -0.866 -0.055 -0.107 nan nan nan -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 nan nan nan 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]

Q: Python (pandas or other) - I need to "flatten" a data file from many rows, few columns to 1 row many columns

I need to "flatten" a data file from many rows, few columns to 1 row many columns.
I currently have a dataframe in pandas (loaded from Excel) and ultimately need to change the way the data is displayed so I can accumulate large amounts of data in a logical manner. The below tables are an attempt at illustrating my requirements.
From:
1 2
Ryan 0.706 0.071
Chad 0.151 0.831
Stephen 0.750 0.653
To:
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0.706 0.151 0.75 0.071 0.831 0.653
Thank you for any assistance!
One line, for fun
df.unstack().pipe(
lambda s: pd.DataFrame([s.values], columns=s.index.map('{0[0]}_{0[1]}'.format))
)
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0 0.706 0.151 0.75 0.071 0.831 0.653
Let's use stack, swaplevel, to_frame, and T:
df_out = df.stack().swaplevel(1,0).to_frame().T.sort_index(axis=1)
Or better yet,(using #piRSquared unstack solution)
df_out = df.unstack().to_frame().T
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out
Output:
1_Chad 1_Ryan 1_Stephen 2_Chad 2_Ryan 2_Stephen
0 0.151 0.706 0.75 0.831 0.071 0.653

Calculating subtractions of pairs of columns in pandas DataFrame

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below).
My current code:
# Matrix is the pandas DataFrame containing all the data
comparison_df = pandas.DataFrame(index=matrix.index)
combinations = itertools.product(group1, group2)
for observed, reference in combinations:
observed_data = matrix[observed]
reference_data = matrix[reference]
comparison = observed_data - reference_data
name = observed + "_" + reference
comparison_df[name] = comparison
Since the data can be large (I'm using this piece of code also during a permutation test), I'm interested in knowing if it can be optimized a bit.
EDIT: As requested, here's a sample of a typical data set
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
And a typical result would be, if the "A" group is group1 and "B" group2, for each ID row, to have for each column a pair (e.g., A1_B1, A2_B1, A3_B1...) corresponding to the pairings generated above, containing the subtraction for each row ID.
Using itertools.combinations() on DataFrame columns
You can create combinations of columns with itertools.combinations() and evaluate subtractions along with new names based on these pairs:
import pandas as pd
from cStringIO import StringIO
import itertools as iter
matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3
Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279
fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739
ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453
x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID')
print 'Original DataFrame:'
print matrix
print
# Create DataFrame to fill with combinations
comparison_df = pd.DataFrame(index=matrix.index)
# Create combinations of columns
for a, b in iter.combinations(matrix.columns, 2):
# Subtract column combinations
comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b]
print 'Combination DataFrame:'
print comparison_df
Original DataFrame:
A1 A2 A3 B1 B2 B3
ID
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
Combination DataFrame:
A1_A2 A1_A3 A1_B1 A1_B2 A1_B3 A2_A3 A2_B1 A2_B2 A2_B3 A3_B1 A3_B2 A3_B3 B1_B2 B1_B3 B2_B3
ID
Ku8QhfS0n_hIOABXuE 0.039 -0.067 0.056 -0.060 0.064 -0.106 0.017 -0.099 0.025 0.123 0.007 0.131 -0.116 0.008 0.124
fqPEquJRRlSVSfL.8A 0.071 0.072 0.075 0.227 0.013 0.001 0.004 0.156 -0.058 0.003 0.155 -0.059 0.152 -0.062 -0.214
ckiehnugOno9d7vf1Q 0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205 0.142 0.208 0.071 0.066 -0.071 -0.137
x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065 0.072 0.080 -0.007 0.118 0.008 -0.079 0.046 -0.087 0.038 0.125

Categories

Resources