I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000
Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000
Related
I have the following dataframe named state:
SSSLifestress SSSHealthstress SSSFinancialstress SSSSocialstress
0 61 80 78 46
1 62 85 19 75
2 63 57 62 21
3 64 11 90 26
4 65 31 77 48
and I want to prune out a high scale and low scale where lifestress >= 63 AND either one of the three is true where (healthStress >= 63 OR ssFinance >= 63 OR socialstress >= 63)
So lifestress must be >= 63 and one of the three others must be >= 63 as well as <= 33 for the low scale, same as above.
I have the following code here
high_scale1 = ( state[state['SSSLifestress']>=63].reset_index(drop=True) & (state[state['SSSHealthstress']>=63] | state[state['SSSFinancialstress']>=63] | state[state['SSSSocialstress']>=63])).reset_index(drop=True)
low_scale1 = (state[state['SSSLifestress']<=33].reset_index(drop=True) & (state[state['SSSHealthstress']<=33] | state[state['SSSFinancialstress']<=33] | state[state['SSSSocialstress']<=33])).reset_index(drop=True)
however I get the error of:
TypeError: unsupported operand type(s) for |: 'float' and 'bool'
I'm looking for the following output for the high scale:
SSSLifestress SSSHealthstress SSSFinancialstress SSSSocialstress
0 64 11 90 26
1 65 31 77 48
You don't need to create multiple dataframes and reset their indexes. Just put certain conditions in .loc function. For high scale it would be:
high_scale1 = state.loc[(state['SSSLifestress']>=63) &
((state['SSSHealthstress']>=63) |
(state['SSSFinancialstress']>=63) |
(state['SSSSocialstress']>=63)),
:].reset_index(drop=True)
Output:
SSSLifestress SSSHealthstress SSSFinancialstress SSSSocialstress
0 64 11 90 26
1 65 31 77 48
I have a dataframe with a sample of the employee survey results as shown below. The values in the delta columns are just the difference between the FY21 and FY20 columns.
Employee leadership_fy21 leadership_fy20 leadership_delta comms_fy21 comms_fy20 comms_delta
patrick.t#abc.com 88 50 38 90 80 10
johnson.g#abc.com 22 82 -60 80 90 -10
pamela.u#abc.com 41 94 -53 44 60 -16
yasmine.a#abc.com 90 66 24 30 10 20
I'd like to create multiple columns that
i. contain the % in the fy21 values
ii. merge it with the columns with the delta suffix such that the delta values are in a ().
example output would be:
Employee leadership_fy21 leadership_delta leadership_final comms_fy21 comms_delta comms_final
patrick.t#abc.com 88 38 88% (38) 90 10 90% (10)
johnson.g#abc.com 22 -60 22% (-60) 80 -10 80% (-10)
pamela.u#abc.com 41 -53 41% (-53) 44 -16 44% (-16)
yasmine.a#abc.com 90 24 90% (24) 30 20 30% (20)
I have tried the following code but it doesn't seem to work. It might have to do with numpy not being able to combine strings. Appreciate any form of help I can get, thank you.
#create a list of all the rating columns
ratingcollist = ['leadership','comms','wellbeing','teamwork']
#create a for loop to get all the columns that match the column list
for rat in ratingcollist:
cols = df.filter(like=rat).columns
fy21cols = df[cols].filter(like='_fy21').columns
deltacols = df[cols].filter(like='_delta').columns
if len(cols) > 0:
df[f'{rat.lower()}final'] = (df[fy21cols].values.astype(str) + '%' + '(' + df[deltacols].values.astype(str) + ')')
You can do this:
def yourfunction(ratingcol):
x=df.filter(regex=f'{ratingcol}(_delta|_fy21)')
fy=x.filter(regex='21').iloc[:,0].astype(str)
delta=x.filter(regex='_delta').iloc[:,0].astype(str)
return(fy+"%("+delta+")")
yourfunction('leadership')
0 88%(38)
1 22%(-60)
2 41%(-53)
3 90%(24)
Then, using a for loop you can create your columns
for i in ratingcollist:
df[f"{i}_final"]=yourfunction(i)
I am just trying to print the Unicode number ranging from 1 to 100 in python. I have searched a lot in StackOverflow but no question answers my queries.
So basically I want to print Bengali numbers from ১ to ১০০. The corresponding English number is 1 to 100.
What I have tried is to get the Unicode number of ১ which is '\u09E7'. Then I have tried to increase this number by 1 as depicted in the following code:
x = '\u09E7'
print(x+1)
But the above code says to me the following output.
TypeError: can only concatenate str (not "int") to str
So what I want is to get a number series as following:
১, ২, ৩, ৪, ৫, ৬, ৭, ৮, ৯, ১০, ১১, ১২, ১৩, ............, ১০০
TypeError: can only concatenate str (not "int") to str1
I wish if there is any solution to this. Thank you.
Make a translation table. The function str.maketrans() takes a string of characters and a string of replacements and builds a translation dictionary of Unicode ordinals to Unicode ordinals. Then, convert a counter variable to a string and use the translate() function on the result to convert the string:
#coding:utf8
xlat = str.maketrans('0123456789','০১২৩৪৫৬৭৮৯')
for i in range(1,101):
print(f'{i:3d} {str(i).translate(xlat)}',end=' ')
Output:
1 ১ 2 ২ 3 ৩ 4 ৪ 5 ৫ 6 ৬ 7 ৭ 8 ৮ 9 ৯ 10 ১০ 11 ১১ 12 ১২ 13 ১৩ 14 ১৪ 15 ১৫ 16 ১৬ 17 ১৭ 18 ১৮ 19 ১৯ 20 ২০ 21 ২১ 22 ২২ 23 ২৩ 24 ২৪ 25 ২৫ 26 ২৬ 27 ২৭ 28 ২৮ 29 ২৯ 30 ৩০ 31 ৩১ 32 ৩২ 33 ৩৩ 34 ৩৪ 35 ৩৫ 36 ৩৬ 37 ৩৭ 38 ৩৮ 39 ৩৯ 40 ৪০ 41 ৪১ 42 ৪২ 43 ৪৩ 44 ৪৪ 45 ৪৫ 46 ৪৬ 47 ৪৭ 48 ৪৮ 49 ৪৯ 50 ৫০ 51 ৫১ 52 ৫২ 53 ৫৩ 54 ৫৪ 55 ৫৫ 56 ৫৬ 57 ৫৭ 58 ৫৮ 59 ৫৯ 60 ৬০ 61 ৬১ 62 ৬২ 63 ৬৩ 64 ৬৪ 65 ৬৫ 66 ৬৬ 67 ৬৭ 68 ৬৮ 69 ৬৯ 70 ৭০ 71 ৭১ 72 ৭২ 73 ৭৩ 74 ৭৪ 75 ৭৫ 76 ৭৬ 77 ৭৭ 78 ৭৮ 79 ৭৯ 80 ৮০ 81 ৮১ 82 ৮২ 83 ৮৩ 84 ৮৪ 85 ৮৫ 86 ৮৬ 87 ৮৭ 88 ৮৮ 89 ৮৯ 90 ৯০ 91 ৯১ 92 ৯২ 93 ৯৩ 94 ৯৪ 95 ৯৫ 96 ৯৬ 97 ৯৭ 98 ৯৮ 99 ৯৯ 100 ১০০
You can try this. Convert the character to an integer. Do the addition and the convert it to character again. If the number is bigger than 10 you have to convert both digits to characters that's why we are using modulo %.
if num < 10:
x = ord('\u09E6')
print(chr(x+num))
elif num < 100:
mod = num % 10
num = int((num -mod) / 10)
x = ord('\u09E6')
print(''.join([chr(x+num), chr(x+mod)]))
else:
x = ord('\u09E6')
print(''.join([chr(x+1), '\u09E6', '\u09E6']))
You can try running it here
https://repl.it/repls/GloomyBewitchedMultitasking
EDIT:
Providing also javascript code as asked in comments.
function getAsciiNum(num){
zero = "০".charCodeAt(0)
if (num < 10){
return(String.fromCharCode(zero+num))
}
else if (num < 100) {
mod = num % 10
num = Math.floor((num -mod) / 10)
return(String.fromCharCode(zero+num) + String.fromCharCode(zero+mod))
}
else {
return(String.fromCharCode(zero+1) + "০০")
}
}
console.log(getAsciiNum(88))
I might be doing something wrong but I can't figure out what it is. I'm trying to reproduce some results from a real state dataset from Baton Rouge, LA. The original code is written in WinBUGS here. There are some minor differences between the dataset used in the link above and the one I'm using right now. However, I think that is not significant. This is the code:
import pymc as pm, pandas as pd, numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import inv
# Loading dataset
df = pd.read_table('http://pastebin.com/raw.php?i=41us4HVj', sep=' ')
# Setting priors
beta = pm.Normal('beta', 0.0, 0.1, size=3)
mu = pm.Lambda('mu', lambda b=beta:
b[0]+b[1]*df['LivingArea']/1000.0+b[2]*df['Age'])
tau = pm.Gamma('tau', 0.1, 0.1)
phi = pm.Uniform('phi', 0.1, 10)
# Trying to build a covariate matrix
A = squareform(pdist(np.array(zip(df['Latitude'], df['Longitude']))))
# Using the powered exponential to obtain a precision matrix
precision = pm.Lambda('exp', lambda u=A, tau=tau, phi=phi, kappa=1:
inv((1/tau)*np.exp(-(phi*u)**kappa)))
If I inspect the value of mu, I get this:
mu.value
Out[2]:
0 24.568272
1 2.909063
2 -2.778916
3 28.206696
4 -0.270571
5 -2.865153
6 14.158162
7 31.466438
8 44.681351
9 22.191397
10 -6.412350
11 11.709424
12 25.453254
13 24.366674
14 34.711048
...
55 24.625763
56 21.763089
57 65.108136
58 15.428714
59 20.992329
60 36.384037
61 16.730507
62 23.021763
63 54.887747
64 30.612696
65 52.685840
66 59.612372
67 18.822422
68 18.940658
69 72.678188
Length: 70, dtype: float64
However, after running MvNormal, the value of mu is changed:
w = pm.MvNormal('w', mu, precision)
mu.value
Out[4]:
0 -107.913779
1 -1.243466
2 8.283926
3 26.412651
4 1.806728
5 -1.300734
6 -80.657396
7 71.614343
8 -3.817774
9 -10.283683
10 -3.804962
11 8.639403
12 18.927553
13 -10.004095
14 -37.431770
...
55 88.612179
56 18.011459
57 -7.421157
58 7.974531
59 -3.697444
60 -17.520367
61 36.453531
62 -39.235745
63 -6.701737
64 68.672902
65 -44.040923
66 11.826075
67 -21.995198
68 -15.886362
69 4.653335
Length: 70, dtype: float64
By the way, this only happens to mu. The precision variable remains the same.
Did I make a mistake?
UPDATE:
Already filed an issue on GitHub. After further inspection, the culprit seems to be the pd.Series object that is used in the mu variable. If I convert or remove the Series, mu won't change after calling MvNormal.
Thanks!
See below data matrix get from sensors, just INT numbers, nothing specical.
A B C D E F G H I J K
1 25 0 25 66 41 47 40 12 69 76 1
2 17 23 73 97 99 39 84 26 0 44 45
3 34 15 55 4 77 2 96 92 22 18 71
4 85 4 71 99 66 42 28 41 27 39 75
5 65 27 28 95 82 56 23 44 97 42 38
…
10 95 13 4 10 50 78 4 52 51 86 20
11 71 12 32 9 2 41 41 23 31 70
12 54 31 68 78 55 19 56 99 67 34 94
13 47 68 79 66 10 23 67 42 16 11 96
14 25 12 88 45 71 87 53 21 96 34 41
The horizontal A to K is the sensor name, and vertical is the data from sensor by the timer manner.
Now I want to analysis those data with trial-and-error methods, I defined some concepts to explain what I want:
o source
source is all the raw data I get
o entry
a entry is a set of all A to K sensor, take the vertical 1st row for example: the entry is
25 0 25 66 41 47 40 12 69 76 1
o rules
a rule is a "suppose" function with assert value return, so far just "true" or "false".
For example, I suppose the sensor A, E and F value will never be same in one enrty, if one entry with A=E=F, it will tigger violation and this rule function will return false.
o range:
a range is function for selecting vertical entry, for example, the first 5 entries
Then, the basic idea is:
o source + range = subsource(s)
o subsource + rules = valiation(s)
The finally I want to get a list may looks like this:
rangeID ruleID violation
1 1 Y
2 1 N
3 1 Y
1 2 N
2 2 N
3 2 Y
1 3 N
2 3 Y
3 3 Y
But the problem is the rule and range I defined here will getting very complicated soon if you looks deeper, they have too much possible combinations, take "A=E=F" for example, one can define "B=E=F","C=E=F","C>F" ......
So soon I need a rule/range generator which may accept those "core parameters" such as "A=E=F" as input parameter even using regex string later. That is too complicated just defeated me, leave alone I may need to persistence rules unique ID, data storage problem, rules self nest combination problem ......
So my questions are:
Anyone knows if there's some module/soft fit for this kind of trial-and-error calculation or the rules defination I want?
Anyone can share me a better rules/range design I described?
Thanks for any hints.
Rgs,
KC
If I understand what you're asking correctly, I probably wouldn't even venture down the Numbpy path as I don't think given your description that it's really required. Here's a sample implementation of how I might go about solving the specific issue that you presented:
l = [\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1},\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1}\
]
r = ['a=g=i', 'a=b', 'a=c']
res = []
# test all given rules
for n in range(0, len(r)):
# i'm assuming equality here - you'd have to change this to accept other operators if needed
c = r[n].split('=')
vals = []
# build up a list of values given our current rule
for e in c:
vals.append(l[0][e])
# using len(set(v)) gives us the number of distinct values
res.append({'rangeID': 0, 'ruleID':n, 'violation':'Y' if len(set(vals)) == 1 else 'N'})
print res
Output:
[{'violation': 'N', 'ruleID': 0, 'rangeID': 0}, {'violation': 'N', 'ruleID': 1, 'rangeID': 0}, {'violation': 'Y', 'ruleID': 2, 'rangeID': 0}]
http://ideone.com/zbTZr
There are a few assumptions made here (such as equality being the only operator in use in your rules) and some functionality left out (such as parsing your input to the list of dicts I used, but I'm hopeful that you can figure that out on your own.
Of course, there could be a Numpy-based solution that's simpler than this that I'm just not thinking of at the moment (it's late and I'm going to bed now ;)), but hopefully this helps you out anyway.
Edit:
Woops, missed something else (forgot to add it in prior to posting) - I only test the first element in l (the given range).. You'd just want to stick that in another for loop rather than using that hard-coded 0 index.
You want to look at Numpy matrix for data structures like matrix etc. It exposes a list of functions that work on matrix manipulation.
As for rule / range generator I am afraid you will have to build your own domain specific language to achieve that.