I'm new to python and also new to this site. My colleague and I are working on a time series dataset. we wish to introduce some missing values to the dataset and then use some techniques to fill in the missing values to see how well those techniques perform for the data imputation task. The challenge we have at the moment is how to introduce missing values to the dataset in a consecutive manner and not just randomly. For example, we want to replace data for a period of time with NaNs, eg, 3 consecutive days. I will really appreciate if anyone can point us in the right direction on how to get this done. we are working with python.
Here is my sample data
There is a method for filling NaNs
dataframe['name_of_column'].fillna('value')
See set_missing_data function below:
import numpy as np
np.set_printoptions(precision=3, linewidth=1000)
def set_missing_data(data, missing_locations, missing_length):
for i in missing_locations:
data[i:i+missing_length] = np.nan
np.random.seed(0)
n_data_points = np.random.randint(40, 50)
data = np.random.normal(size=[n_data_points])
n_missing = np.random.randint(3, 6)
missing_length = 3
missing_locations = np.random.choice(
n_data_points - missing_length,
size=n_missing,
replace=False
)
print(data)
set_missing_data(data, missing_locations, missing_length)
print(data)
Console output:
[ 0.118 0.114 0.37 1.041 -1.517 -0.866 -0.055 -0.107 1.365 -0.098 -2.426 -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 1.419 1.168 0.947 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
[ 0.118 nan nan nan -1.517 -0.866 -0.055 -0.107 nan nan nan -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 nan nan nan 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
Related
I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!
i am trying to calculate the Jarque-Bera-Bera test (normality test) on my data that look like that (after chain operation) :
ranking Q1 Q2 Q3 Q4
Date
2009-12-29 nan nan nan nan
2009-12-30 0.12 -0.21 -0.36 -0.39
2009-12-31 0.05 0.09 0.06 -0.02
2010-01-01 nan nan nan nan
2010-01-04 1.45 1.90 1.81 1.77
... ... ... ... ...
2020-10-13 -0.67 -0.59 -0.63 -0.61
2020-10-14 -0.05 -0.12 -0.05 -0.13
2020-10-15 -1.91 -1.62 -1.78 -1.91
2020-10-16 1.21 1.13 1.09 1.37
2020-10-19 -0.03 0.01 0.06 -0.02
I use a function like that :
from scipy import stats
def stat(x):
return pd.Series([x.mean(),
np.sqrt(x.var()),
stats.jarque_bera(x),
],
index=['Return',
'Volatility',
'JB P-Value'
])
data.apply(stat)
whereas the mean and variance calculation work fine, I have a error message stats.jarque_berafunction with is :
ValueError: Length of passed values is 10, index implies 9.
Any idea ?
I tried to reproduce and the function works fine for me, by copying the 10 rows of data you are providing above. This looks like a data input issue, where some column seems to have fewer values than the index of that pd.Series (effectively somehow len(data[col]) > len(data[col].index)). You can try to figure out which column it is by running a naive "debugging" function such as:
for col in data.columns:
if len(data[col].values) != len(data[col].index):
print(f"Column {col} has more/less values than the index")
However, the Jarque-Bera test documentation on Scipy says that x can be any "array-like" structure, so you don't need to pass a pd.Series, which might run you into issues with missing values, etc. Essentially you can just pass a list of values and calculate their JB test statistic and p-value.
So with that, I would modify your function to
def stat(x):
return pd.Series([x.mean(),
np.sqrt(x.var()),
stats.jarque_bera(x.dropna().values), # drop NaN and get numpy array instead of pd.Series
],
index=['Return',
'Volatility',
'JB P-Value'
])
I have a big dataframe of shape (3125000, 16), when writing the code the dataset was quite small so had no problem in running the analysis using pandas. But now I have data frame as mentioned above and it is throwing me 'Memory error' while running simple group by command. I'm at loss as the code is quite big and I can't rewrite the entire code again as the time is short and I have to give the output today to my client. Below are the two simple statements I'm running that are showing me a memory error.
####### Average 'A001' and 'A002' of each value of Symbol across the files
big_frame_grouped_A002 = big_frame.groupby('Symbol')['A002'].mean()
big_frame_grouped_A001 = big_frame.groupby('Symbol')['A001'].mean()
and
cnt = big_frame.groupby('Symbol').apply(lambda g:((g.A001 > g.A002) & g.A001.notnull() & g.A002.notnull()).sum())
I was able to read all the files in a data frame but it's not allowing me to do that anymore.
I tried searching online but have no idea how to take care of memory issues without rewriting the whole code.
I tried gc_collect method but it was of no use. Any help?
Edit: Dataset
Symbol Bid BidQty Ask AskQty TradeQty iBid iBidQty \
0 O.U20 99.740 16011.0 99.745 71102.0 77361 99.740 1669.0
1 O.Z20 99.695 30622.0 99.700 70102.0 72888 99.695 6803.0
2 O.H21 99.795 4168.0 99.800 71275.0 66692 NaN NaN
3 O.M21 99.820 12254.0 99.825 45183.0 93346 99.820 4035.0
4 O.U21 99.825 18379.0 99.830 33293.0 52012 99.825 4168.0
iBidLegs iAsk iAskQty iAskLegs RiskMid filename A001 \
0 2.0 99.745 6803.0 2.0 99.74092 63730342900.csv 0.005
1 2.0 99.700 93.0 2.0 99.69717 63730342900.csv 0.005
2 NaN 99.800 11902.0 2.0 99.79546 63730342900.csv 0.005
3 2.0 99.825 3742.0 2.0 99.82069 63730342900.csv 0.005
4 2.0 99.830 2361.0 2.0 99.82580 63730342900.csv 0.005
A002
0 0.005
1 0.005
2 NaN
3 0.005
4 0.005
Given the test data at the bottom in dataframe df, which is (3125000, 16)
I didn't experience any memory issues
Relative memory usage when running groupby functions
Did not come close to 16GB
For the original cnt: 851 ms ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For the implementation below: 576 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Creating a groupby object of only the columns being used should reduce memory usage.
Currently you're using .apply, which is essentially a for-loop to test every row
Instead, use a for-loop to loop through each group of the .groupby object, then .dropna, and use Boolean indexing, which will get rid of iterating through each row.
If you want cnt in a dataframe, use cnt_df = pd.DataFrame.from_dict(cnt, orient='index', columns=['counts'])
# groupby agg mean
dfg = df.groupby('Symbol').agg({'A001': 'mean', 'A002': 'mean'})
# display(dfg)
A001 A002
Symbol
O.H21 0.005 NaN
O.M21 0.005 0.005
O.U20 0.006 0.005
O.U21 0.007 0.005
O.Z20 0.004 0.005
# create a groupby object of only the columns being used
dfg_A = df[['Symbol', 'A001', 'A002']].groupby('Symbol')
# iterate through the groupby object
cnt = dict()
for g, f in dfg_A:
f = f.dropna() # drop any rows with Nan
cnt_f = f[(f.A001 > f.A002)].count()
cnt[g] = cnt_f.Symbol
print(cnt)
[out]: {'O.H21': 0, 'O.M21': 0, 'O.U20': 625000, 'O.U21': 625000, 'O.Z20': 0}
Test Data
# copy the data to the clipboard and read in with
df = pd.read_clipboard(sep='\\s+')
# make it large
df = pd.concat([df] * 625000).reset_index(drop=True)
# data
Symbol Bid BidQty Ask AskQty TradeQty iBid iBidQty iBidLegs iAsk iAskQty iAskLegs RiskMid filename A001 A002
0 O.U20 99.740 16011.0 99.745 71102.0 77361 99.740 1669.0 2.0 99.745 6803.0 2.0 99.74092 63730342900.csv 0.006 0.005
1 O.Z20 99.695 30622.0 99.700 70102.0 72888 99.695 6803.0 2.0 99.700 93.0 2.0 99.69717 63730342900.csv 0.004 0.005
2 O.H21 99.795 4168.0 99.800 71275.0 66692 NaN NaN NaN 99.800 11902.0 2.0 99.79546 63730342900.csv 0.005 NaN
3 O.M21 99.820 12254.0 99.825 45183.0 93346 99.820 4035.0 2.0 99.825 3742.0 2.0 99.82069 63730342900.csv 0.005 0.005
4 O.U21 99.825 18379.0 99.830 33293.0 52012 99.825 4168.0 2.0 99.830 2361.0 2.0 99.82580 63730342900.csv 0.007 0.005
I need to create some new columns based on the value of a dataframe filed and a look up dataframe with some rates.
Having df1 as
zone hh hhind
0 14 112.0 3.4
1 15 5.0 4.4
2 16 0.0 1.0
and a look_up df as
ind per1 per2 per3 per4
0 1.0 1.000 0.000 0.000 0.000
24 3.4 0.145 0.233 0.165 0.457
34 4.4 0.060 0.114 0.075 0.751
how can i update df1.hh1 by multiplying the look_up.per1 based on df1.hhind and lookup.ind
zone hh hhind hh1
0 14 112.0 3.4 16.240
1 15 5.0 4.4 0.300
2 16 0.0 1.0 0.000
at the moment im getting the result by merging the tables and the doing the arithmetic.
r = pd.merge(df1, look_up, left_on="hhind", right_on="ind")
r["hh1"] = r.hh *r.per1
i'd like to know if there is a more straight way to accomplish this by not merging the tables?
You could first set hhind and ind as the index axis of df1 and look_up dataframes respectively. Then, multiply corresponding elements in hh and per1 element-wise.
Map these results to the column hhind and assign these to a new column later as shown:
mapper = df1.set_index('hhind')['hh'].mul(look_up.set_index('ind')['per1'])
df1.assign(hh1=df1['hhind'].map(mapper))
another solution:
df1['hh1'] = (df1['hhind'].map(lambda x: look_up[look_up["ind"]==x]["per1"])) * df1['hh']
I have a pandas dataframe df contains two stocks' financial ratio data :
>>> df
ROIC ROE
STK_ID RPT_Date
600141 20110331 0.012 0.022
20110630 0.031 0.063
20110930 0.048 0.103
20111231 0.063 0.122
20120331 0.017 0.033
20120630 0.032 0.077
20120930 0.050 0.120
600809 20110331 0.536 0.218
20110630 0.734 0.278
20110930 0.806 0.293
20111231 1.679 0.313
20120331 0.666 0.165
20120630 1.039 0.257
20120930 1.287 0.359
And I try to plot the ratio 'ROIC' & 'ROE' of stock '600141' & '600809' together on the same 'RPT_Date' to benchmark their performance.
df.plot(kind='bar') gives below
The chart draws '600141' on the left side , '600809' on the right side. It is somewhat inconvenience to compare the 'ROIC' & 'ROE' of the two stocks on same report date 'RPT_Date' .
What I want is to put the 'ROIC' & 'ROE' bar indexed by same 'RPT_Date' in same group side by side ( 4 bar per group), and x-axis only labels the 'RPT_Date', that will clearly tell the difference of two stocks.
How to do that ?
And if I df.plot(kind='line') , it only shows two lines, but it should be four lines (2 stocks * 2 ratios) :
Is it a bug, or what I can do to correct it ? Thanks.
I am using Pandas 0.8.1.
If you unstack STK_ID, you can create side by side plots per RPT_Date.
In [55]: dfu = df.unstack("STK_ID")
In [56]: fig, axes = subplots(2,1)
In [57]: dfu.plot(ax=axes[0], kind="bar")
Out[57]: <matplotlib.axes.AxesSubplot at 0xb53070c>
In [58]: dfu.plot(ax=axes[1])
Out[58]: <matplotlib.axes.AxesSubplot at 0xb60e8cc>