R's pdIndent function in RPy - python
I am working on translating the code for the lmeSplines tutorial to RPy.
I am now stuck at the following line:
fit1s <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))
I have worked with nlme.lme before, and the following works just fine:
from rpy2.robjects.packages import importr
nlme = importr('nlme')
nlme.lme(r.formula('y ~ time'), data=some_data, random=r.formula('~1|ID'))
But this has an other random assignment. I am wondering how I can translate this bit and put it into my RPy code as well list(all=pdIdent(~Zt - 1)).
The structure of the (preprocessed) example data smSplineEx1 looks like this (with Zt.* up to 98):
time y y.true all Zt.1 Zt.2 Zt.3
1 1 5.797149 4.235263 1 1.168560e+00 2.071261e+00 2.944953e+00
2 2 5.469222 4.461302 1 1.487859e-01 1.072013e+00 1.948857e+00
3 3 4.567237 4.678477 1 -5.449190e-02 7.276623e-02 9.527613e-01
4 4 3.645763 4.887137 1 -5.364552e-02 -1.359115e-01 -4.333438e-02
5 5 5.094126 5.087615 1 -5.279913e-02 -1.337708e-01 -2.506194e-01
6 6 4.636121 5.280233 1 -5.195275e-02 -1.316300e-01 -2.466158e-01
7 7 5.501538 5.465298 1 -5.110637e-02 -1.294892e-01 -2.426123e-01
8 8 5.011509 5.643106 1 -5.025998e-02 -1.273485e-01 -2.386087e-01
9 9 6.114037 5.813942 1 -4.941360e-02 -1.252077e-01 -2.346052e-01
10 10 5.696472 5.978080 1 -4.856722e-02 -1.230670e-01 -2.306016e-01
11 11 6.615363 6.135781 1 -4.772083e-02 -1.209262e-01 -2.265980e-01
12 12 8.002526 6.287300 1 -4.687445e-02 -1.187854e-01 -2.225945e-01
13 13 6.887444 6.432877 1 -4.602807e-02 -1.166447e-01 -2.185909e-01
14 14 6.319205 6.572746 1 -4.518168e-02 -1.145039e-01 -2.145874e-01
15 15 6.482771 6.707130 1 -4.433530e-02 -1.123632e-01 -2.105838e-01
16 16 7.938015 6.836245 1 -4.348892e-02 -1.102224e-01 -2.065802e-01
17 17 7.585533 6.960298 1 -4.264253e-02 -1.080816e-01 -2.025767e-01
18 18 7.560287 7.079486 1 -4.179615e-02 -1.059409e-01 -1.985731e-01
19 19 7.571020 7.194001 1 -4.094977e-02 -1.038001e-01 -1.945696e-01
20 20 8.922418 7.304026 1 -4.010338e-02 -1.016594e-01 -1.905660e-01
21 21 8.241394 7.409737 1 -3.925700e-02 -9.951861e-02 -1.865625e-01
22 22 7.447076 7.511303 1 -3.841062e-02 -9.737785e-02 -1.825589e-01
23 23 7.317292 7.608886 1 -3.756423e-02 -9.523709e-02 -1.785553e-01
24 24 7.077333 7.702643 1 -3.671785e-02 -9.309633e-02 -1.745518e-01
25 25 8.268601 7.792723 1 -3.587147e-02 -9.095557e-02 -1.705482e-01
26 26 8.216013 7.879272 1 -3.502508e-02 -8.881481e-02 -1.665447e-01
27 27 8.968495 7.962427 1 -3.417870e-02 -8.667405e-02 -1.625411e-01
28 28 9.085605 8.042321 1 -3.333232e-02 -8.453329e-02 -1.585375e-01
29 29 9.002575 8.119083 1 -3.248593e-02 -8.239253e-02 -1.545340e-01
30 30 8.763187 8.192835 1 -3.163955e-02 -8.025177e-02 -1.505304e-01
31 31 8.936370 8.263695 1 -3.079317e-02 -7.811101e-02 -1.465269e-01
32 32 9.033403 8.331776 1 -2.994678e-02 -7.597025e-02 -1.425233e-01
33 33 8.248328 8.397188 1 -2.910040e-02 -7.382949e-02 -1.385198e-01
34 34 5.961721 8.460035 1 -2.825402e-02 -7.168873e-02 -1.345162e-01
35 35 8.400489 8.520418 1 -2.740763e-02 -6.954797e-02 -1.305126e-01
36 36 6.855125 8.578433 1 -2.656125e-02 -6.740721e-02 -1.265091e-01
37 37 9.798931 8.634174 1 -2.571487e-02 -6.526645e-02 -1.225055e-01
38 38 8.862758 8.687729 1 -2.486848e-02 -6.312569e-02 -1.185020e-01
39 39 7.282970 8.739184 1 -2.402210e-02 -6.098493e-02 -1.144984e-01
40 40 7.484208 8.788621 1 -2.317572e-02 -5.884417e-02 -1.104949e-01
41 41 8.404670 8.836120 1 -2.232933e-02 -5.670341e-02 -1.064913e-01
42 42 8.880734 8.881756 1 -2.148295e-02 -5.456265e-02 -1.024877e-01
43 43 8.826189 8.925603 1 -2.063657e-02 -5.242189e-02 -9.848418e-02
44 44 9.827906 8.967731 1 -1.979018e-02 -5.028113e-02 -9.448062e-02
45 45 8.528795 9.008207 1 -1.894380e-02 -4.814037e-02 -9.047706e-02
46 46 9.484073 9.047095 1 -1.809742e-02 -4.599961e-02 -8.647351e-02
47 47 8.911947 9.084459 1 -1.725103e-02 -4.385885e-02 -8.246995e-02
48 48 10.201343 9.120358 1 -1.640465e-02 -4.171809e-02 -7.846639e-02
49 49 8.908016 9.154849 1 -1.555827e-02 -3.957733e-02 -7.446283e-02
50 50 8.202368 9.187988 1 -1.471188e-02 -3.743657e-02 -7.045927e-02
51 51 7.432851 9.219828 1 -1.386550e-02 -3.529581e-02 -6.645572e-02
52 52 8.063268 9.250419 1 -1.301912e-02 -3.315505e-02 -6.245216e-02
53 53 10.155756 9.279810 1 -1.217273e-02 -3.101429e-02 -5.844860e-02
54 54 7.905281 9.308049 1 -1.132635e-02 -2.887353e-02 -5.444504e-02
55 55 9.688337 9.335181 1 -1.047997e-02 -2.673277e-02 -5.044148e-02
56 56 9.437176 9.361249 1 -9.633582e-03 -2.459201e-02 -4.643793e-02
57 57 9.165873 9.386295 1 -8.787198e-03 -2.245125e-02 -4.243437e-02
58 58 9.120195 9.410358 1 -7.940815e-03 -2.031049e-02 -3.843081e-02
59 59 9.955840 9.433479 1 -7.094432e-03 -1.816973e-02 -3.442725e-02
60 60 9.314230 9.455692 1 -6.248048e-03 -1.602897e-02 -3.042369e-02
61 61 9.706852 9.477035 1 -5.401665e-03 -1.388821e-02 -2.642014e-02
62 62 9.615765 9.497541 1 -4.555282e-03 -1.174746e-02 -2.241658e-02
63 63 7.918843 9.517242 1 -3.708898e-03 -9.606695e-03 -1.841302e-02
64 64 9.352892 9.536172 1 -2.862515e-03 -7.465935e-03 -1.440946e-02
65 65 9.722685 9.554359 1 -2.016132e-03 -5.325176e-03 -1.040590e-02
66 66 9.186888 9.571832 1 -1.169748e-03 -3.184416e-03 -6.402346e-03
67 67 8.652299 9.588621 1 -3.233650e-04 -1.043656e-03 -2.398788e-03
68 68 8.681421 9.604751 1 5.230184e-04 1.097104e-03 1.604770e-03
69 69 10.279181 9.620249 1 1.369402e-03 3.237864e-03 5.608328e-03
70 70 9.314963 9.635140 1 2.215785e-03 5.378623e-03 9.611886e-03
71 71 6.897151 9.649446 1 3.062168e-03 7.519383e-03 1.361544e-02
72 72 9.343135 9.663191 1 3.908552e-03 9.660143e-03 1.761900e-02
73 73 9.273135 9.676398 1 4.754935e-03 1.180090e-02 2.162256e-02
74 74 10.041796 9.689086 1 5.601318e-03 1.394166e-02 2.562612e-02
75 75 9.724713 9.701278 1 6.447702e-03 1.608242e-02 2.962968e-02
76 76 8.593517 9.712991 1 7.294085e-03 1.822318e-02 3.363323e-02
77 77 7.401988 9.724244 1 8.140468e-03 2.036394e-02 3.763679e-02
78 78 10.258688 9.735057 1 8.986852e-03 2.250470e-02 4.164035e-02
79 79 10.037192 9.745446 1 9.833235e-03 2.464546e-02 4.564391e-02
80 80 9.637510 9.755427 1 1.067962e-02 2.678622e-02 4.964747e-02
81 81 8.887625 9.765017 1 1.152600e-02 2.892698e-02 5.365102e-02
82 82 9.922013 9.774230 1 1.237239e-02 3.106774e-02 5.765458e-02
83 83 10.466709 9.783083 1 1.321877e-02 3.320850e-02 6.165814e-02
84 84 11.132830 9.791588 1 1.406515e-02 3.534926e-02 6.566170e-02
85 85 10.154038 9.799760 1 1.491154e-02 3.749002e-02 6.966526e-02
86 86 10.433068 9.807612 1 1.575792e-02 3.963078e-02 7.366881e-02
87 87 9.666781 9.815156 1 1.660430e-02 4.177154e-02 7.767237e-02
88 88 9.478004 9.822403 1 1.745069e-02 4.391230e-02 8.167593e-02
89 89 10.002749 9.829367 1 1.829707e-02 4.605306e-02 8.567949e-02
90 90 7.593259 9.836058 1 1.914345e-02 4.819382e-02 8.968305e-02
91 91 10.915754 9.842486 1 1.998984e-02 5.033458e-02 9.368660e-02
92 92 8.855580 9.848662 1 2.083622e-02 5.247534e-02 9.769016e-02
93 93 8.884683 9.854596 1 2.168260e-02 5.461610e-02 1.016937e-01
94 94 9.757451 9.860298 1 2.252899e-02 5.675686e-02 1.056973e-01
95 95 10.222361 9.865775 1 2.337537e-02 5.889762e-02 1.097008e-01
96 96 9.090410 9.871038 1 2.422175e-02 6.103838e-02 1.137044e-01
97 97 8.837872 9.876095 1 2.506814e-02 6.317914e-02 1.177080e-01
98 98 9.413135 9.880953 1 2.591452e-02 6.531990e-02 1.217115e-01
99 99 9.295531 9.885621 1 2.676090e-02 6.746066e-02 1.257151e-01
100 100 9.698118 9.890106 1 2.760729e-02 6.960142e-02 1.297186e-01
You can put list(all=pdIdent(~Zt - 1)) in the R's global environment using reval() method:
In [55]:
import rpy2.robjects as ro
import pandas.rpy.common as com
mydata = ro.r['data.frame']
read = ro.r['read.csv']
head = ro.r['head']
summary = ro.r['summary']
library = ro.r['library']
In [56]:
formula = '~ time'
library('lmeSplines')
ro.reval('data(smSplineEx1)')
ro.reval('smSplineEx1$all <- rep(1,nrow(smSplineEx1))')
ro.reval('smSplineEx1$Zt <- smspline(~ time, data=smSplineEx1)')
ro.reval('rnd <- list(all=pdIdent(~Zt - 1))')
#result = ro.r.smspline(formula=ro.r(formula), data=ro.r.smSplineEx1) #notice: data=ro.r.smSplineEx1
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
In [57]:
print com.convert_robj(result.rx('coefficients'))
{'coefficients': {'random': {'all': Zt1 Zt2 Zt3 Zt4 Zt5 Zt6 Zt7 \
1 0.000509 0.001057 0.001352 0.001184 0.000869 0.000283 -0.000424
Zt8 Zt9 Zt10 ... Zt89 Zt90 Zt91 \
1 -0.001367 -0.002325 -0.003405 ... -0.001506 -0.001347 -0.000864
Zt92 Zt93 Zt94 Zt95 Zt96 Zt97 Zt98
1 -0.000631 -0.000569 -0.000392 -0.000049 0.000127 0.000114 0.000071
[1 rows x 98 columns]}, 'fixed': (Intercept) 6.498800
time 0.038723
dtype: float64}}
Be careful, the result is a little bit out of shape. Basically it is nested dictionary which can not be converted into a pandas.DataFrame.
You can access y in smsSplineEx by ro.r.smSplineEx1.rx('y'), similar to smsSplineEx1$y as you would do so in R.
Now say if you have the result variable in python, generated by
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
and you want to plot it using R, (instead of plotting it using, say, matplotlib), you need to assign it to a variable in R workspace:
ro.R().assign('result', result)
Now there is a variable named result in R workspace, you can access it using ro.r.result.
Plotting it using R:
In [17]:
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
Out[17]:
rpy2.rinterface.NULL
In [21]:
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
Out[21]:
rpy2.rinterface.NULL
Or you can do everything in R:
ro.reval('result <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))')
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
and access the R variables using:ro.r.smSplineEx1.rx2('time') or ro.r.result
Edit
Notice some R objects can not be converted to pandas.dataFrame as-is due to mixture of data structure:
In [62]:
ro.r["smSplineEx1"]
Out[62]:
<DataFrame - Python:0x108525518 / R:0x109e5da38>
[FloatVe..., FloatVe..., FloatVe..., FloatVe..., Matrix]
time: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x10807e518 / R:0x1022599e0>
[1.000000, 2.000000, 3.000000, ..., 98.000000, 99.000000, 100.000000]
y: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x108525a70 / R:0x102259d30>
[5.797149, 5.469222, 4.567237, ..., 9.413135, 9.295531, 9.698118]
y.true: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085257a0 / R:0x10225dfb0>
[4.235263, 4.461302, 4.678477, ..., 9.880953, 9.885621, 9.890106]
all: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085258c0 / R:0x10225e300>
[1.000000, 1.000000, 1.000000, ..., 1.000000, 1.000000, 1.000000]
Zt: <class 'rpy2.robjects.vectors.Matrix'>
<Matrix - Python:0x108525908 / R:0x103e8ba00>
[1.168560, 0.148786, -0.054492, ..., -0.030141, -0.030610, 0.757597]
Notice that we have a few vectors but the last one is a Matrix. We have to convert smSplineEx to python in two parts.
In [63]:
ro.r["smSplineEx1"].names
Out[63]:
<StrVector - Python:0x108525dd0 / R:0x1042ca7c0>
['time', 'y', 'y.true', 'all', 'Zt']
In [64]:
print com.convert_robj(ro.r["smSplineEx1"].rx(ro.IntVector(range(1, 5)))).head()
time y y.true all
1 1 5.797149 4.235263 1
2 2 5.469222 4.461302 1
3 3 4.567237 4.678477 1
4 4 3.645763 4.887137 1
5 5 5.094126 5.087615 1
In [65]:
print com.convert_robj(ro.r["smSplineEx1"].rx2('Zt')).head(2)
0 1 2 3 4 5 6 \
1 1.168560 2.071261 2.944953 3.782848 4.584037 5.348937 6.078121
2 0.148786 1.072013 1.948857 2.789264 3.593423 4.361817 5.095016
7 8 9 ... 88 89 90 \
1 6.772184 7.431719 8.057321 ... 0.933947 0.769591 0.619420
2 5.793601 6.458153 7.089255 ... 0.904395 0.745337 0.599976
91 92 93 94 95 96 97
1 0.484029 0.36401 0.259959 0.172468 0.102133 0.049547 0.015305
2 0.468893 0.35267 0.251890 0.167135 0.098986 0.048026 0.014836
[2 rows x 98 columns]
com.convert_robj(ro.r["smSplineEx1"]) will not work due to the mixed data structure issue.
Related
Converting time format to second in a panda dataframe
I have a df with time data and I would like to transform these data to second (see example below). Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 0:19.938 0:24.649 0:3.062 1 1 76 0:17.910 0:25.929 0:3.098 2 2 74 1:02.619 0:27.724 0:3.014 3 3 73 0:20.607 0:27.937 0:3.193 4 4 67 0:19.598 0:28.853 0:2.925 5 5 67 0:21.032 0:30.119 0:3.206 6 6 66 0:27.013 0:31.462 0:3.106 7 7 65 0:27.337 0:36.226 0:3.060 8 8 64 0:37.651 0:47.246 0:2.933 9 9 64 0:59.241 1:8.333 0:3.027 This is the output I would like to obtain. df["Real time (s)"] 0 19.938 1 17.910 2 62.619 ... I have some useful code but I do not how to itinerate this code in a data frame x = time.strptime("00:01:00","%H:%M:%S") datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()
Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds: df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds() print (df) Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 19.938 0:24.649 0:3.062 1 1 76 17.910 0:25.929 0:3.098 2 2 74 62.619 0:27.724 0:3.014 3 3 73 20.607 0:27.937 0:3.193 4 4 67 19.598 0:28.853 0:2.925 5 5 67 21.032 0:30.119 0:3.206 6 6 66 27.013 0:31.462 0:3.106 7 7 65 27.337 0:36.226 0:3.060 8 8 64 37.651 0:47.246 0:2.933 9 9 64 59.241 1:8.333 0:3.027 Solution for processing multiple columns: def to_td(x): return pd.to_timedelta(x.radd('00:')).dt.total_seconds() cols = ["Real time (s)", "User time (s)", "Sys time (s)"] df[cols] = df[cols].apply(to_td) print (df) Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 19.938 24.649 3.062 1 1 76 17.910 25.929 3.098 2 2 74 62.619 27.724 3.014 3 3 73 20.607 27.937 3.193 4 4 67 19.598 28.853 2.925 5 5 67 21.032 30.119 3.206 6 6 66 27.013 31.462 3.106 7 7 65 27.337 36.226 3.060 8 8 64 37.651 47.246 2.933 9 9 64 59.241 68.333 3.027
Python, trying to calculate RSI but I am getting unusually high numbers
I am trying to calculate the RSI formula in python. I am getting the closing price data from the AlphaVantage TimeSeries API. def rsi(data,period): length = len(data) - 1 current_price = 0 previous_price = 0 avg_up = 0 avg_down = 0 for i in range(length-period,length): current_price = data[i] if current_price > previous_price: avg_up += current_price - previous_price else: avg_down += previous_price - current_price previous_price = data[i] # Calculate average gain and loss avg_up = avg_up/period avg_down = avg_down/period # Calculate relative strength rs = avg_up/avg_down # Calculate rsi rsi = 100 - (100/(1+rs)) return rsi print(rsi(data=closing_price,period=14)) In this case, this will output a really high number along the lines of RSI: 99.824. But according to TradingView, the current RSI is actually 62.68. Any feedback on what I am doing wrong would be very much appreciated! Here is some data, it is 100 mintues of AAPL data 0 0 118.3900 1 118.4200 2 118.3500 3 118.3000 4 118.2800 5 118.4000 6 118.3400 7 118.4500 8 118.3900 9 118.4100 10 118.4700 11 118.4000 12 118.4000 13 118.3400 14 118.4100 15 118.2850 16 118.2900 17 118.1700 18 118.2600 19 118.2800 20 118.2600 21 118.2400 22 118.2950 23 118.2800 24 118.2900 25 118.2850 26 118.3000 27 118.2150 28 118.2300 29 118.1450 30 118.1200 31 118.0800 32 118.1300 33 118.1100 34 118.1300 35 118.2300 36 118.1000 37 118.1900 38 118.2800 39 118.2400 40 118.2300 41 118.3300 42 118.3200 43 118.3500 44 118.3600 45 118.3650 46 118.3800 47 118.4500 48 118.5000 49 118.5100 50 118.5400 51 118.5100 52 118.5063 53 118.5200 54 118.5400 55 118.4700 56 118.4700 57 118.4300 58 118.4400 59 118.4300 60 118.3800 61 118.4000 62 118.3600 63 118.3700 64 118.3400 65 118.3200 66 118.3000 67 118.3210 68 118.3714 69 118.4000 70 118.4100 71 118.3500 72 118.3300 73 118.3200 74 118.3250 75 118.3200 76 118.3900 77 118.5000 78 118.4800 79 118.5300 80 118.5300 81 118.4800 82 118.5000 83 118.4400 84 118.5400 85 118.5550 86 118.5200 87 118.4600 88 118.4500 89 118.4400 90 118.4300 91 118.4019 92 118.4400 93 118.4400 94 118.4100 95 118.4000 96 118.4400 97 118.4400 98 118.4600 99 118.5050
I've managed to compute 59.4 with the code below, which is close to what you are looking to. Here is what I've changed: _ averages are divided by n_up and n_down counters, and not by period. _ previous and current prices were removed to directly access to actual data[i] and previous data[i-1] prices. Note that the code has to be check with other data. close_AAPL = [118.4200, 118.3500, 118.3000, 118.2800, 118.4000, 118.3400, 118.4500, 118.3900, 118.4100, 118.4700, 118.4000, 118.4000, 118.3400, 118.4100, 118.2850, 118.2900, 118.1700, 118.2600, 118.2800, 118.2600, 118.2400, 118.2950, 118.2800, 118.2900, 118.2850, 118.3000, 118.2150, 118.2300, 118.1450, 118.1200, 118.0800, 118.1300, 118.1100, 118.1300, 118.2300, 118.1000, 118.1900, 118.2800, 118.2400, 118.2300, 118.3300, 118.3200, 118.3500, 118.3600, 118.3650, 118.3800, 118.4500, 118.5000, 118.5100, 118.5400, 118.5100, 118.5063, 118.5200, 118.5400, 118.4700, 118.4700, 118.4300, 118.4400, 118.4300, 118.3800, 118.4000, 118.3600, 118.3700, 118.3400, 118.3200, 118.3000, 118.3210, 118.3714, 118.4000, 118.4100, 118.3500, 118.3300, 118.3200, 118.3250, 118.3200, 118.3900, 118.5000, 118.4800, 118.5300, 118.5300, 118.4800, 118.5000, 118.4400, 118.5400, 118.5550, 118.5200, 118.4600, 118.4500, 118.4400, 118.4300, 118.4019, 118.4400, 118.4400, 118.4100, 118.4000, 118.4400, 118.4400, 118.4600, 118.5050] def rsi(data,period): length = len(data) - 1 current_price = 0 previous_price = 0 avg_up = 0 n_up = 0 avg_down = 0 n_down = 0 for i in range(length-period,length): if data[i] > data[i-1]: avg_up += data[i] - data[i-1] n_up += 1 else: avg_down += data[i-1] - data[i] n_down += 1 # Calculate average gain and loss avg_up = avg_up/n_up avg_down = avg_down/n_down # Calculate relative strength rs = avg_up/avg_down # Calculate rsi return 100. - 100./(1+rs) print(rsi(data=close_AAPL, period=14))
Multiplying an entire df or matrix by 1000?
I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1. Currently: Pt GeneA GeneB GeneC 1 0.001 2 0 2 0 0.5 0.002 Would like: Pt GeneA GeneB GeneC 1 1 2000 0 2 0 500 2 I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :) Using R: df <- read.csv("/Users/m/Desktop/data.csv") df * 100 In Ops.factor(left, right) : ‘*’ not meaningful for factors mtx <- as.matrix(df) mtx * 100 Error in mtx * 100 : non-numeric argument to binary operator Using Python 3.7.6: df = df * 1000 ^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything... df = df.div(.001) TypeError: unsupported operand type(s) for /: 'str' and 'float' Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple: set.seed(42) dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10)) str(dat) # 'data.frame': 10 obs. of 10 variables: # $ X1 : int 49 65 25 74 100 18 49 47 24 71 # $ X2 : int 100 89 37 20 26 3 41 89 27 36 # $ X3 : int 95 5 84 34 92 3 58 97 42 24 # $ X4 : int 30 43 15 22 58 8 36 68 86 18 # $ X5 : int 92 69 4 98 50 99 88 87 49 26 # $ X6 : int 6 6 2 3 21 2 58 10 40 5 # $ X7 : int 33 49 100 73 29 76 84 9 35 93 # $ X8 : int 16 92 69 92 2 82 24 18 69 55 # $ X9 : int 40 21 100 57 100 42 18 91 13 53 # $ X10: int 54 83 32 80 60 29 81 73 85 43 dat1000 <- dat * 1000
Try this option: df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)] If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use: df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python: target_cols = [i for i in df.columns if i!='Pt'] for i in target_cols: df[i] = df[i].astype(float) df[i] = df[i]*1000
How can I Extract only numbers from this columns?
Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present 4 SELECTIO 6 N NO 14 37001 26 37002 38 37003 47 37004 60 37005 73 37006 82 37007 92 37008 105 37009 119 37010 132 37011 143 37012 157 37013 168 37014 184 37015 196 37016 207 37017 220 37018 236 37019 253 37020 267 37021 280 37022 287 Krishan 290 37023 300 37024 316 37025 337 37026 365 37027 ... 74141 42471 74154 42472 74169 42473 74184 42474 74200 42475 74216 42476 74233 42477 74242 42478 74256 42479 74271 42480 74290 42481 74309 42482 74323 42483 74336 42484 74350 42485 74365 42486 74378 42487 74389 42488 74398 42489 74413 42490 74430 42491 74446 42492 74459 42493 74474 42494 74491 42495 74504 42496 74516 42497 74530 42498 74544 42499 74558 42500 Name: Selection No., Length: 5602, dtype: object and I want to get only numeric values like this in python using pandas 37001 37002 37003 37004 37005 how can I do this? I have attached my code in python using pandas.............................................. def selection(sle): if sle in re.match('[3-4][0-9]{4}',sle): return 1 else: return 0 select['status'] = select['Selection No.'].apply(selection) and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers.. import pandas as pd import numpy as np df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]}) df SELECTIO some_col 0 N NO 4 1 37002 6 2 37003 14 3 Krishan 26 4 37004 38 5 singh 47 6 37005 60 >>> df[df[['SELECTIO']].applymap(np.isreal).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 result: Specific to column SELECTIO .. df[df[['SELECTIO']].applymap(np.isreal).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 OR just another approach importing numbers + lambda : import numbers df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match) I would suggest to proceed with pd.Series.str.isnumeric function: In [544]: df Out[544]: Selection No. 0 37001 1 37002 2 37003 3 asnsh 4 37004 5 singh 6 37005 In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int) In [546]: df Out[546]: Selection No. Status 0 37001 1 1 37002 1 2 37003 1 3 asnsh 0 4 37004 1 5 singh 0 6 37005 1 If a strict regex pattern is required - use pd.Series.str.contains function: df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)
Pandas sort() ignoring negative sign
I want to sort a pandas df but I'm having problems with the negative values. import pandas as pd df = pd.read_csv('File.txt', sep='\t', header=None) #Suppress scientific notation (finally) pd.set_option('display.float_format', lambda x: '%.8f' % x) print(df) print(df.dtypes) print(df.shape) b = df.sort(axis=0, ascending=True) print(b) This gives me the ascending order but completely disregards the sign. SPATA1 -0.00000005 HMBOX1 0.00000005 SLC38A11 -0.00000005 RP11-571M6.17 0.00000004 GNRH1 -0.00000004 PCDHB8 -0.00000004 CXCL1 0.00000004 RP11-48B3.3 -0.00000004 RNFT2 -0.00000004 GRIK3 -0.00000004 ZNF483 0.00000004 RP11-627G18.1 0.00000003 Any ideas what I'm doing wrong? Thanks
Loading your file with: df = pd.read_csv('File.txt', sep='\t', header=None) Since sort(....) is deprecated, you can use sort_values: b = df.sort_values(by=[1], axis=0, ascending=True) where [1] is your column of values. For me this returns: 0 1 0 ACTA1 -0.582570 1 MT-CO1 -0.543877 2 CKM -0.338265 3 MT-ND1 -0.306239 5 MT-CYB -0.128241 6 PDK4 -0.119309 8 GAPDH -0.090912 9 MYH1 -0.087777 12 RP5-940J5.9 -0.074280 13 MYH2 -0.072261 16 MT-ND2 -0.052551 18 MYL1 -0.049142 19 DES -0.048289 20 ALDOA -0.047661 22 ENO3 -0.046251 23 MT-CO2 -0.043684 26 RP11-799N11.1 -0.034972 28 TNNT3 -0.032226 29 MYBPC2 -0.030861 32 TNNI2 -0.026707 33 KLHL41 -0.026669 34 SOD2 -0.026166 35 GLUL -0.026122 42 TRIM63 -0.022971 47 FLNC -0.018180 48 ATP2A1 -0.017752 49 PYGM -0.016934 55 hsa-mir-6723 -0.015859 56 MT1A -0.015110 57 LDHA -0.014955 .. ... ... 60 RP1-178F15.4 0.013383 58 HSPB1 0.014894 54 UBB 0.015874 53 MIR1282 0.016318 52 ALDH2 0.016441 51 FTL 0.016543 50 RP11-317J10.2 0.016799 46 RP11-290D2.6 0.018803 45 RRAD 0.019449 44 MYF6 0.019954 43 STAC3 0.021931 41 RP11-138I1.4 0.023031 40 MYBPC1 0.024407 39 PDLIM3 0.025442 38 ANKRD1 0.025458 37 FTH1 0.025526 36 MT-RNR2 0.025887 31 HSPB6 0.027680 30 RP11-451G4.2 0.029969 27 AC002398.12 0.033219 25 MT-RNR1 0.040741 24 TNNC1 0.042251 21 TNNT1 0.047177 17 MT-ND3 0.051963 15 MTND1P23 0.059405 14 MB 0.063896 11 MYL2 0.076358 10 MT-ND5 0.076479 7 CA3 0.100221 4 MT-ND6 0.140729 [18152 rows x 2 columns]