R's pdIndent function in RPy - python

I am working on translating the code for the lmeSplines tutorial to RPy.
I am now stuck at the following line:
fit1s <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))
I have worked with nlme.lme before, and the following works just fine:
from rpy2.robjects.packages import importr
nlme = importr('nlme')
nlme.lme(r.formula('y ~ time'), data=some_data, random=r.formula('~1|ID'))
But this has an other random assignment. I am wondering how I can translate this bit and put it into my RPy code as well list(all=pdIdent(~Zt - 1)).
The structure of the (preprocessed) example data smSplineEx1 looks like this (with Zt.* up to 98):
time y y.true all Zt.1 Zt.2 Zt.3
1 1 5.797149 4.235263 1 1.168560e+00 2.071261e+00 2.944953e+00
2 2 5.469222 4.461302 1 1.487859e-01 1.072013e+00 1.948857e+00
3 3 4.567237 4.678477 1 -5.449190e-02 7.276623e-02 9.527613e-01
4 4 3.645763 4.887137 1 -5.364552e-02 -1.359115e-01 -4.333438e-02
5 5 5.094126 5.087615 1 -5.279913e-02 -1.337708e-01 -2.506194e-01
6 6 4.636121 5.280233 1 -5.195275e-02 -1.316300e-01 -2.466158e-01
7 7 5.501538 5.465298 1 -5.110637e-02 -1.294892e-01 -2.426123e-01
8 8 5.011509 5.643106 1 -5.025998e-02 -1.273485e-01 -2.386087e-01
9 9 6.114037 5.813942 1 -4.941360e-02 -1.252077e-01 -2.346052e-01
10 10 5.696472 5.978080 1 -4.856722e-02 -1.230670e-01 -2.306016e-01
11 11 6.615363 6.135781 1 -4.772083e-02 -1.209262e-01 -2.265980e-01
12 12 8.002526 6.287300 1 -4.687445e-02 -1.187854e-01 -2.225945e-01
13 13 6.887444 6.432877 1 -4.602807e-02 -1.166447e-01 -2.185909e-01
14 14 6.319205 6.572746 1 -4.518168e-02 -1.145039e-01 -2.145874e-01
15 15 6.482771 6.707130 1 -4.433530e-02 -1.123632e-01 -2.105838e-01
16 16 7.938015 6.836245 1 -4.348892e-02 -1.102224e-01 -2.065802e-01
17 17 7.585533 6.960298 1 -4.264253e-02 -1.080816e-01 -2.025767e-01
18 18 7.560287 7.079486 1 -4.179615e-02 -1.059409e-01 -1.985731e-01
19 19 7.571020 7.194001 1 -4.094977e-02 -1.038001e-01 -1.945696e-01
20 20 8.922418 7.304026 1 -4.010338e-02 -1.016594e-01 -1.905660e-01
21 21 8.241394 7.409737 1 -3.925700e-02 -9.951861e-02 -1.865625e-01
22 22 7.447076 7.511303 1 -3.841062e-02 -9.737785e-02 -1.825589e-01
23 23 7.317292 7.608886 1 -3.756423e-02 -9.523709e-02 -1.785553e-01
24 24 7.077333 7.702643 1 -3.671785e-02 -9.309633e-02 -1.745518e-01
25 25 8.268601 7.792723 1 -3.587147e-02 -9.095557e-02 -1.705482e-01
26 26 8.216013 7.879272 1 -3.502508e-02 -8.881481e-02 -1.665447e-01
27 27 8.968495 7.962427 1 -3.417870e-02 -8.667405e-02 -1.625411e-01
28 28 9.085605 8.042321 1 -3.333232e-02 -8.453329e-02 -1.585375e-01
29 29 9.002575 8.119083 1 -3.248593e-02 -8.239253e-02 -1.545340e-01
30 30 8.763187 8.192835 1 -3.163955e-02 -8.025177e-02 -1.505304e-01
31 31 8.936370 8.263695 1 -3.079317e-02 -7.811101e-02 -1.465269e-01
32 32 9.033403 8.331776 1 -2.994678e-02 -7.597025e-02 -1.425233e-01
33 33 8.248328 8.397188 1 -2.910040e-02 -7.382949e-02 -1.385198e-01
34 34 5.961721 8.460035 1 -2.825402e-02 -7.168873e-02 -1.345162e-01
35 35 8.400489 8.520418 1 -2.740763e-02 -6.954797e-02 -1.305126e-01
36 36 6.855125 8.578433 1 -2.656125e-02 -6.740721e-02 -1.265091e-01
37 37 9.798931 8.634174 1 -2.571487e-02 -6.526645e-02 -1.225055e-01
38 38 8.862758 8.687729 1 -2.486848e-02 -6.312569e-02 -1.185020e-01
39 39 7.282970 8.739184 1 -2.402210e-02 -6.098493e-02 -1.144984e-01
40 40 7.484208 8.788621 1 -2.317572e-02 -5.884417e-02 -1.104949e-01
41 41 8.404670 8.836120 1 -2.232933e-02 -5.670341e-02 -1.064913e-01
42 42 8.880734 8.881756 1 -2.148295e-02 -5.456265e-02 -1.024877e-01
43 43 8.826189 8.925603 1 -2.063657e-02 -5.242189e-02 -9.848418e-02
44 44 9.827906 8.967731 1 -1.979018e-02 -5.028113e-02 -9.448062e-02
45 45 8.528795 9.008207 1 -1.894380e-02 -4.814037e-02 -9.047706e-02
46 46 9.484073 9.047095 1 -1.809742e-02 -4.599961e-02 -8.647351e-02
47 47 8.911947 9.084459 1 -1.725103e-02 -4.385885e-02 -8.246995e-02
48 48 10.201343 9.120358 1 -1.640465e-02 -4.171809e-02 -7.846639e-02
49 49 8.908016 9.154849 1 -1.555827e-02 -3.957733e-02 -7.446283e-02
50 50 8.202368 9.187988 1 -1.471188e-02 -3.743657e-02 -7.045927e-02
51 51 7.432851 9.219828 1 -1.386550e-02 -3.529581e-02 -6.645572e-02
52 52 8.063268 9.250419 1 -1.301912e-02 -3.315505e-02 -6.245216e-02
53 53 10.155756 9.279810 1 -1.217273e-02 -3.101429e-02 -5.844860e-02
54 54 7.905281 9.308049 1 -1.132635e-02 -2.887353e-02 -5.444504e-02
55 55 9.688337 9.335181 1 -1.047997e-02 -2.673277e-02 -5.044148e-02
56 56 9.437176 9.361249 1 -9.633582e-03 -2.459201e-02 -4.643793e-02
57 57 9.165873 9.386295 1 -8.787198e-03 -2.245125e-02 -4.243437e-02
58 58 9.120195 9.410358 1 -7.940815e-03 -2.031049e-02 -3.843081e-02
59 59 9.955840 9.433479 1 -7.094432e-03 -1.816973e-02 -3.442725e-02
60 60 9.314230 9.455692 1 -6.248048e-03 -1.602897e-02 -3.042369e-02
61 61 9.706852 9.477035 1 -5.401665e-03 -1.388821e-02 -2.642014e-02
62 62 9.615765 9.497541 1 -4.555282e-03 -1.174746e-02 -2.241658e-02
63 63 7.918843 9.517242 1 -3.708898e-03 -9.606695e-03 -1.841302e-02
64 64 9.352892 9.536172 1 -2.862515e-03 -7.465935e-03 -1.440946e-02
65 65 9.722685 9.554359 1 -2.016132e-03 -5.325176e-03 -1.040590e-02
66 66 9.186888 9.571832 1 -1.169748e-03 -3.184416e-03 -6.402346e-03
67 67 8.652299 9.588621 1 -3.233650e-04 -1.043656e-03 -2.398788e-03
68 68 8.681421 9.604751 1 5.230184e-04 1.097104e-03 1.604770e-03
69 69 10.279181 9.620249 1 1.369402e-03 3.237864e-03 5.608328e-03
70 70 9.314963 9.635140 1 2.215785e-03 5.378623e-03 9.611886e-03
71 71 6.897151 9.649446 1 3.062168e-03 7.519383e-03 1.361544e-02
72 72 9.343135 9.663191 1 3.908552e-03 9.660143e-03 1.761900e-02
73 73 9.273135 9.676398 1 4.754935e-03 1.180090e-02 2.162256e-02
74 74 10.041796 9.689086 1 5.601318e-03 1.394166e-02 2.562612e-02
75 75 9.724713 9.701278 1 6.447702e-03 1.608242e-02 2.962968e-02
76 76 8.593517 9.712991 1 7.294085e-03 1.822318e-02 3.363323e-02
77 77 7.401988 9.724244 1 8.140468e-03 2.036394e-02 3.763679e-02
78 78 10.258688 9.735057 1 8.986852e-03 2.250470e-02 4.164035e-02
79 79 10.037192 9.745446 1 9.833235e-03 2.464546e-02 4.564391e-02
80 80 9.637510 9.755427 1 1.067962e-02 2.678622e-02 4.964747e-02
81 81 8.887625 9.765017 1 1.152600e-02 2.892698e-02 5.365102e-02
82 82 9.922013 9.774230 1 1.237239e-02 3.106774e-02 5.765458e-02
83 83 10.466709 9.783083 1 1.321877e-02 3.320850e-02 6.165814e-02
84 84 11.132830 9.791588 1 1.406515e-02 3.534926e-02 6.566170e-02
85 85 10.154038 9.799760 1 1.491154e-02 3.749002e-02 6.966526e-02
86 86 10.433068 9.807612 1 1.575792e-02 3.963078e-02 7.366881e-02
87 87 9.666781 9.815156 1 1.660430e-02 4.177154e-02 7.767237e-02
88 88 9.478004 9.822403 1 1.745069e-02 4.391230e-02 8.167593e-02
89 89 10.002749 9.829367 1 1.829707e-02 4.605306e-02 8.567949e-02
90 90 7.593259 9.836058 1 1.914345e-02 4.819382e-02 8.968305e-02
91 91 10.915754 9.842486 1 1.998984e-02 5.033458e-02 9.368660e-02
92 92 8.855580 9.848662 1 2.083622e-02 5.247534e-02 9.769016e-02
93 93 8.884683 9.854596 1 2.168260e-02 5.461610e-02 1.016937e-01
94 94 9.757451 9.860298 1 2.252899e-02 5.675686e-02 1.056973e-01
95 95 10.222361 9.865775 1 2.337537e-02 5.889762e-02 1.097008e-01
96 96 9.090410 9.871038 1 2.422175e-02 6.103838e-02 1.137044e-01
97 97 8.837872 9.876095 1 2.506814e-02 6.317914e-02 1.177080e-01
98 98 9.413135 9.880953 1 2.591452e-02 6.531990e-02 1.217115e-01
99 99 9.295531 9.885621 1 2.676090e-02 6.746066e-02 1.257151e-01
100 100 9.698118 9.890106 1 2.760729e-02 6.960142e-02 1.297186e-01

You can put list(all=pdIdent(~Zt - 1)) in the R's global environment using reval() method:
In [55]:
import rpy2.robjects as ro
import pandas.rpy.common as com
mydata = ro.r['data.frame']
read = ro.r['read.csv']
head = ro.r['head']
summary = ro.r['summary']
library = ro.r['library']
In [56]:
formula = '~ time'
library('lmeSplines')
ro.reval('data(smSplineEx1)')
ro.reval('smSplineEx1$all <- rep(1,nrow(smSplineEx1))')
ro.reval('smSplineEx1$Zt <- smspline(~ time, data=smSplineEx1)')
ro.reval('rnd <- list(all=pdIdent(~Zt - 1))')
#result = ro.r.smspline(formula=ro.r(formula), data=ro.r.smSplineEx1) #notice: data=ro.r.smSplineEx1
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
In [57]:
print com.convert_robj(result.rx('coefficients'))
{'coefficients': {'random': {'all': Zt1 Zt2 Zt3 Zt4 Zt5 Zt6 Zt7 \
1 0.000509 0.001057 0.001352 0.001184 0.000869 0.000283 -0.000424
Zt8 Zt9 Zt10 ... Zt89 Zt90 Zt91 \
1 -0.001367 -0.002325 -0.003405 ... -0.001506 -0.001347 -0.000864
Zt92 Zt93 Zt94 Zt95 Zt96 Zt97 Zt98
1 -0.000631 -0.000569 -0.000392 -0.000049 0.000127 0.000114 0.000071
[1 rows x 98 columns]}, 'fixed': (Intercept) 6.498800
time 0.038723
dtype: float64}}
Be careful, the result is a little bit out of shape. Basically it is nested dictionary which can not be converted into a pandas.DataFrame.
You can access y in smsSplineEx by ro.r.smSplineEx1.rx('y'), similar to smsSplineEx1$y as you would do so in R.
Now say if you have the result variable in python, generated by
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
and you want to plot it using R, (instead of plotting it using, say, matplotlib), you need to assign it to a variable in R workspace:
ro.R().assign('result', result)
Now there is a variable named result in R workspace, you can access it using ro.r.result.
Plotting it using R:
In [17]:
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
Out[17]:
rpy2.rinterface.NULL
In [21]:
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
Out[21]:
rpy2.rinterface.NULL
Or you can do everything in R:
ro.reval('result <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))')
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
and access the R variables using:ro.r.smSplineEx1.rx2('time') or ro.r.result
Edit
Notice some R objects can not be converted to pandas.dataFrame as-is due to mixture of data structure:
In [62]:
ro.r["smSplineEx1"]
Out[62]:
<DataFrame - Python:0x108525518 / R:0x109e5da38>
[FloatVe..., FloatVe..., FloatVe..., FloatVe..., Matrix]
time: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x10807e518 / R:0x1022599e0>
[1.000000, 2.000000, 3.000000, ..., 98.000000, 99.000000, 100.000000]
y: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x108525a70 / R:0x102259d30>
[5.797149, 5.469222, 4.567237, ..., 9.413135, 9.295531, 9.698118]
y.true: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085257a0 / R:0x10225dfb0>
[4.235263, 4.461302, 4.678477, ..., 9.880953, 9.885621, 9.890106]
all: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085258c0 / R:0x10225e300>
[1.000000, 1.000000, 1.000000, ..., 1.000000, 1.000000, 1.000000]
Zt: <class 'rpy2.robjects.vectors.Matrix'>
<Matrix - Python:0x108525908 / R:0x103e8ba00>
[1.168560, 0.148786, -0.054492, ..., -0.030141, -0.030610, 0.757597]
Notice that we have a few vectors but the last one is a Matrix. We have to convert smSplineEx to python in two parts.
In [63]:
ro.r["smSplineEx1"].names
Out[63]:
<StrVector - Python:0x108525dd0 / R:0x1042ca7c0>
['time', 'y', 'y.true', 'all', 'Zt']
In [64]:
print com.convert_robj(ro.r["smSplineEx1"].rx(ro.IntVector(range(1, 5)))).head()
time y y.true all
1 1 5.797149 4.235263 1
2 2 5.469222 4.461302 1
3 3 4.567237 4.678477 1
4 4 3.645763 4.887137 1
5 5 5.094126 5.087615 1
In [65]:
print com.convert_robj(ro.r["smSplineEx1"].rx2('Zt')).head(2)
0 1 2 3 4 5 6 \
1 1.168560 2.071261 2.944953 3.782848 4.584037 5.348937 6.078121
2 0.148786 1.072013 1.948857 2.789264 3.593423 4.361817 5.095016
7 8 9 ... 88 89 90 \
1 6.772184 7.431719 8.057321 ... 0.933947 0.769591 0.619420
2 5.793601 6.458153 7.089255 ... 0.904395 0.745337 0.599976
91 92 93 94 95 96 97
1 0.484029 0.36401 0.259959 0.172468 0.102133 0.049547 0.015305
2 0.468893 0.35267 0.251890 0.167135 0.098986 0.048026 0.014836
[2 rows x 98 columns]
com.convert_robj(ro.r["smSplineEx1"]) will not work due to the mixed data structure issue.

Related

Converting time format to second in a panda dataframe

I have a df with time data and I would like to transform these data to second (see example below).
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 0:19.938 0:24.649 0:3.062
1 1 76 0:17.910 0:25.929 0:3.098
2 2 74 1:02.619 0:27.724 0:3.014
3 3 73 0:20.607 0:27.937 0:3.193
4 4 67 0:19.598 0:28.853 0:2.925
5 5 67 0:21.032 0:30.119 0:3.206
6 6 66 0:27.013 0:31.462 0:3.106
7 7 65 0:27.337 0:36.226 0:3.060
8 8 64 0:37.651 0:47.246 0:2.933
9 9 64 0:59.241 1:8.333 0:3.027
This is the output I would like to obtain.
df["Real time (s)"]
0 19.938
1 17.910
2 62.619
...
I have some useful code but I do not how to itinerate this code in a data frame
x = time.strptime("00:01:00","%H:%M:%S")
datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()
Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds:
df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds()
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 0:24.649 0:3.062
1 1 76 17.910 0:25.929 0:3.098
2 2 74 62.619 0:27.724 0:3.014
3 3 73 20.607 0:27.937 0:3.193
4 4 67 19.598 0:28.853 0:2.925
5 5 67 21.032 0:30.119 0:3.206
6 6 66 27.013 0:31.462 0:3.106
7 7 65 27.337 0:36.226 0:3.060
8 8 64 37.651 0:47.246 0:2.933
9 9 64 59.241 1:8.333 0:3.027
Solution for processing multiple columns:
def to_td(x):
return pd.to_timedelta(x.radd('00:')).dt.total_seconds()
cols = ["Real time (s)", "User time (s)", "Sys time (s)"]
df[cols] = df[cols].apply(to_td)
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 24.649 3.062
1 1 76 17.910 25.929 3.098
2 2 74 62.619 27.724 3.014
3 3 73 20.607 27.937 3.193
4 4 67 19.598 28.853 2.925
5 5 67 21.032 30.119 3.206
6 6 66 27.013 31.462 3.106
7 7 65 27.337 36.226 3.060
8 8 64 37.651 47.246 2.933
9 9 64 59.241 68.333 3.027

Python, trying to calculate RSI but I am getting unusually high numbers

I am trying to calculate the RSI formula in python. I am getting the closing price data from the AlphaVantage TimeSeries API.
def rsi(data,period):
length = len(data) - 1
current_price = 0
previous_price = 0
avg_up = 0
avg_down = 0
for i in range(length-period,length):
current_price = data[i]
if current_price > previous_price:
avg_up += current_price - previous_price
else:
avg_down += previous_price - current_price
previous_price = data[i]
# Calculate average gain and loss
avg_up = avg_up/period
avg_down = avg_down/period
# Calculate relative strength
rs = avg_up/avg_down
# Calculate rsi
rsi = 100 - (100/(1+rs))
return rsi
print(rsi(data=closing_price,period=14))
In this case, this will output a really high number along the lines of RSI: 99.824. But according to TradingView, the current RSI is actually 62.68.
Any feedback on what I am doing wrong would be very much appreciated!
Here is some data, it is 100 mintues of AAPL data
0
0 118.3900
1 118.4200
2 118.3500
3 118.3000
4 118.2800
5 118.4000
6 118.3400
7 118.4500
8 118.3900
9 118.4100
10 118.4700
11 118.4000
12 118.4000
13 118.3400
14 118.4100
15 118.2850
16 118.2900
17 118.1700
18 118.2600
19 118.2800
20 118.2600
21 118.2400
22 118.2950
23 118.2800
24 118.2900
25 118.2850
26 118.3000
27 118.2150
28 118.2300
29 118.1450
30 118.1200
31 118.0800
32 118.1300
33 118.1100
34 118.1300
35 118.2300
36 118.1000
37 118.1900
38 118.2800
39 118.2400
40 118.2300
41 118.3300
42 118.3200
43 118.3500
44 118.3600
45 118.3650
46 118.3800
47 118.4500
48 118.5000
49 118.5100
50 118.5400
51 118.5100
52 118.5063
53 118.5200
54 118.5400
55 118.4700
56 118.4700
57 118.4300
58 118.4400
59 118.4300
60 118.3800
61 118.4000
62 118.3600
63 118.3700
64 118.3400
65 118.3200
66 118.3000
67 118.3210
68 118.3714
69 118.4000
70 118.4100
71 118.3500
72 118.3300
73 118.3200
74 118.3250
75 118.3200
76 118.3900
77 118.5000
78 118.4800
79 118.5300
80 118.5300
81 118.4800
82 118.5000
83 118.4400
84 118.5400
85 118.5550
86 118.5200
87 118.4600
88 118.4500
89 118.4400
90 118.4300
91 118.4019
92 118.4400
93 118.4400
94 118.4100
95 118.4000
96 118.4400
97 118.4400
98 118.4600
99 118.5050
I've managed to compute 59.4 with the code below, which is close to what you are looking to. Here is what I've changed:
_ averages are divided by n_up and n_down counters, and not by period.
_ previous and current prices were removed to directly access to actual data[i] and previous data[i-1] prices.
Note that the code has to be check with other data.
close_AAPL = [118.4200, 118.3500, 118.3000, 118.2800, 118.4000,
118.3400, 118.4500, 118.3900, 118.4100, 118.4700,
118.4000, 118.4000, 118.3400, 118.4100, 118.2850,
118.2900, 118.1700, 118.2600, 118.2800, 118.2600,
118.2400, 118.2950, 118.2800, 118.2900, 118.2850,
118.3000, 118.2150, 118.2300, 118.1450, 118.1200,
118.0800, 118.1300, 118.1100, 118.1300, 118.2300,
118.1000, 118.1900, 118.2800, 118.2400, 118.2300,
118.3300, 118.3200, 118.3500, 118.3600, 118.3650,
118.3800, 118.4500, 118.5000, 118.5100, 118.5400,
118.5100, 118.5063, 118.5200, 118.5400, 118.4700,
118.4700, 118.4300, 118.4400, 118.4300, 118.3800,
118.4000, 118.3600, 118.3700, 118.3400, 118.3200,
118.3000, 118.3210, 118.3714, 118.4000, 118.4100,
118.3500, 118.3300, 118.3200, 118.3250, 118.3200,
118.3900, 118.5000, 118.4800, 118.5300, 118.5300,
118.4800, 118.5000, 118.4400, 118.5400, 118.5550,
118.5200, 118.4600, 118.4500, 118.4400, 118.4300,
118.4019, 118.4400, 118.4400, 118.4100, 118.4000,
118.4400, 118.4400, 118.4600, 118.5050]
def rsi(data,period):
length = len(data) - 1
current_price = 0
previous_price = 0
avg_up = 0
n_up = 0
avg_down = 0
n_down = 0
for i in range(length-period,length):
if data[i] > data[i-1]:
avg_up += data[i] - data[i-1]
n_up += 1
else:
avg_down += data[i-1] - data[i]
n_down += 1
# Calculate average gain and loss
avg_up = avg_up/n_up
avg_down = avg_down/n_down
# Calculate relative strength
rs = avg_up/avg_down
# Calculate rsi
return 100. - 100./(1+rs)
print(rsi(data=close_AAPL, period=14))

Multiplying an entire df or matrix by 1000?

I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000
Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks
Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

Categories

Resources