I'm trying to normalize a Pandas DF by row and there's a column which has string values which is causing me a lot of trouble. Anyone have a neat way to make this work?
For example:
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 111 28 219 98 0 133
18 hyp.metricsystem1 97 22 242 84 0 137
22 hyp.metricsystem5 107 11 246 85 0 127
17 hyp.eTranslation 49 30 262 80 0 143
20 hyp.metricsystem3 86 23 263 89 0 118
21 hyp.metricsystem4 74 17 274 70 0 111
I am trying to normalize each row from Fluency, Terminology, etc. Other over the total. In other words, divide each integer column entry over the total of each row (Fluency[0]/total_row[0], Terminology[0]/total_row[0], ...)
I tried using this command, but it's giving me an error because I have a column of strings
bad_models.div(bad_models.sum(axis=1), axis = 0)
Any help would be greatly appreciated...
Use select_dtypes to select numeric only columns:
subset = bad_models.select_dtypes('number')
bad_models[subset.columns] = subset.div(subset.sum(axis=1), axis=0)
print(bad_models)
# Output
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 0.211832 0.21374 0.145418 0.193676 0 0.172952
18 hyp.metricsystem1 0.185115 0.167939 0.160691 0.166008 0 0.178153
22 hyp.metricsystem5 0.204198 0.083969 0.163347 0.167984 0 0.16515
17 hyp.eTranslation 0.093511 0.229008 0.173971 0.158103 0 0.185956
20 hyp.metricsystem3 0.164122 0.175573 0.174635 0.175889 0 0.153446
21 hyp.metricsystem4 0.141221 0.129771 0.181939 0.13834 0 0.144343
Related
I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000
Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000
I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)
Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
I have a dataset like below:
In this dataset first column represents the id of a person, the last column is label of this person and rest of the columns are features of the person.
101 166 633.0999756 557.5 71.80000305 60.40000153 2.799999952 1 1 -1
101 133 636.2000122 504.3999939 71 56.5 2.799999952 1 2 -1
105 465 663.5 493.7000122 82.80000305 66.40000153 3.299999952 10 3 -1
105 133 635.5999756 495.6000061 89 72 3.599999905 9 6 -1
105 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 -1
105 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 -1
105 99 615.5999756 575.7000122 80 67 3.200000048 0 0 -1
120 399 617.7000122 583.5 95.80000305 82.40000153 3.799999952 8 10 1
120 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 1
120 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 1
120 99 615.5999756 575.7000122 80 67 3.200000048 0 0 1
My aim is to classify these people, and I want to use leave one out person method as a split method. So I need to choose one person and his all data as a test data and the rest of the data for training. But when try to choose the test data I implemented list assignment operation but it gave an error. This is my code:
`import numpy as np
datasets=["raw_fixationData.txt"]
file_name_array=[101,105,120]
for data in datasets:
data = np.genfromtxt(data,delimiter="\t")
data=data[1:,:]
num_line=len(data[:,1])-1
num_feat=len(data[1,:])-2
label=num_feat+1
X = data[0:num_line+1,1:label]
y = data[0:num_line+1,label]
test_prtcpnt=[]; test_prtcpnt_label=[]; train_prtcpnt=[]; train_prtcpnt_label=[];
for i in range(len(file_name_array)):
m=0; # test index
n=0 #train index
for j in range(num_line):
if X[j,0]==file_name_array[i]:
test_prtcpnt[m,0:10]=X[j,0:10];
test_prtcpnt_label[m]=y[j];
m=m+1;
else:
train_prtcpnt[n,0:10]=X[j,0:10];
train_prtcpnt_label[n]=y[j];
n=n+1; `
This code give me this error test_prtcpnt[m,0:10]=X[j,0:10]; TypeError: list indices must be integers or slices, not tuple
How could I solve this problem?
I think that you are misusing Python's slice notation. Please refer to the following stack overflow post on slicing:
Explain Python's slice notation
In this case, the Python interpreter seems to be interpreting test_prtcpnt[m,0:10] as a tuple. Is it possible that you meant to say the following:
test_prtcpnt[0:10]=X[0:10]