Manipulating data in Pandas

Manipulating data in Pandas - python

That is my database:
Number Name Points Math Points BG Wish
0 1 Огнян 50 65 MT
1 2 Момчил 61 27 MT
2 3 Радослав 68 68 MT
3 4 Павел 28 16 MT
4 10 Виктор 67 76 MT
5 11 Петър 26 68 BT
6 12 Антон 64 58 BT
7 13 Васил 29 42 BT
8 20 Виктория 62 67 BT
That's my code:
df = pd.read_csv('Input_data.csv', encoding='utf-8-sig')
df['Total'] = df.iloc[:, 2:].sum(axis=1)
df = df.sort_values(['Total', 'Name'], ascending=[0, 1])
df_5.to_excel("BT RANKING_5.xlsx", encoding='utf-8-sig', index=False)
I want for each person who has Wish == MT to double the score in Points Math column.
I tried:
df.loc[df['Wish'] == 'MT', 'Points Math'] = df.loc[df['Points Math'] * 2]
but this didn't work. I als tried to do an if statement, for loop but they didn't work either.
What's the appropriate sytax to do the logic?

Use this:
df['Points_Math'] = np.where(df['Wish'] == 'MT', df['Points Math'] * 2, df['Points Math'])
A new column would be created 'Points_Math' with desired results or you can overwrite by replacing 'Points_Math' with 'Points Math'

Related

Converting time format to second in a panda dataframe

I have a df with time data and I would like to transform these data to second (see example below).
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 0:19.938 0:24.649 0:3.062
1 1 76 0:17.910 0:25.929 0:3.098
2 2 74 1:02.619 0:27.724 0:3.014
3 3 73 0:20.607 0:27.937 0:3.193
4 4 67 0:19.598 0:28.853 0:2.925
5 5 67 0:21.032 0:30.119 0:3.206
6 6 66 0:27.013 0:31.462 0:3.106
7 7 65 0:27.337 0:36.226 0:3.060
8 8 64 0:37.651 0:47.246 0:2.933
9 9 64 0:59.241 1:8.333 0:3.027
This is the output I would like to obtain.
df["Real time (s)"]
0 19.938
1 17.910
2 62.619
...
I have some useful code but I do not how to itinerate this code in a data frame
x = time.strptime("00:01:00","%H:%M:%S")
datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()

Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds:
df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds()
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 0:24.649 0:3.062
1 1 76 17.910 0:25.929 0:3.098
2 2 74 62.619 0:27.724 0:3.014
3 3 73 20.607 0:27.937 0:3.193
4 4 67 19.598 0:28.853 0:2.925
5 5 67 21.032 0:30.119 0:3.206
6 6 66 27.013 0:31.462 0:3.106
7 7 65 27.337 0:36.226 0:3.060
8 8 64 37.651 0:47.246 0:2.933
9 9 64 59.241 1:8.333 0:3.027
Solution for processing multiple columns:
def to_td(x):
return pd.to_timedelta(x.radd('00:')).dt.total_seconds()
cols = ["Real time (s)", "User time (s)", "Sys time (s)"]
df[cols] = df[cols].apply(to_td)
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 24.649 3.062
1 1 76 17.910 25.929 3.098
2 2 74 62.619 27.724 3.014
3 3 73 20.607 27.937 3.193
4 4 67 19.598 28.853 2.925
5 5 67 21.032 30.119 3.206
6 6 66 27.013 31.462 3.106
7 7 65 27.337 36.226 3.060
8 8 64 37.651 47.246 2.933
9 9 64 59.241 68.333 3.027

Multiplying an entire df or matrix by 1000?

I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!

What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000

Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]

Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000

Create a pandas dataframe from dictionary whilst maintaining order of columns

When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)

One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52

Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)

If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks

Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

How to make table with multi-tier row header (index) using Pandas

I have the following data:
# colh1 rh1 rh2 rh3/up rh4/down
AddaVax ID LV 29 18
AddaVax ID SP 16 13
AddaVax ID LN 61 73
ADX ID LV 11 14
ADX IP LV 160 88
ADX ID SP 14 13
ADX IP SP 346 129
ADX ID LN 25 25
What I'd like to do is to make a table that looks like this
(later to be written in text or Excel file):
The actual data contain more than 2 columns but the number of rows
is always fixed (i.e. 10 rows).
I'm stuck with the following code:
import pandas as pd
from collections import defaultdict
dod = defaultdict(dict)
with open("mediate.txt", 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter=' ')
for row in tabreader:
if "#" in row[0]: continue
colh1, rh1, rh2, rhup, rhdown = row
dod["colh1"] = colh1
dod["rh1"] = rh1
dod["rh2"] = rh2
dod["rhup"] = rhup
dod["rhdown"] = rhdown
What's the way to do it?

Just using Pandas:
import pandas as pd
df = pd.read_csv('mediate.txt', sep='\t') # or sep=',' if comma delimited.
df.rename(columns={'rh3/up': 'Up', 'rh4/down': 'Down'}, inplace=True)
result = df.pivot_table(values=['Up', 'Down'],
columns='colh1',
index=['rh1', 'rh2']).stack(0) # Stack Up/Down
>>> result
colh1 ADX AddaVax
rh1 rh2
ID LN Up 25 61
Down 25 73
LV Up 11 29
Down 14 18
SP Up 14 16
Down 13 13
IP LV Up 160 NaN
Down 88 NaN
SP Up 346 NaN
Down 129 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Manipulating data in Pandas - python

Use this: df['Points_Math'] = np.where(df['Wish'] == 'MT', df['Points Math'] * 2, df['Points Math']) A new column would be created 'Points_Math' with desired results or you can overwrite by replacing 'Points_Math' with 'Points Math'

Related

Converting time format to second in a panda dataframe

Multiplying an entire df or matrix by 1000?

Create a pandas dataframe from dictionary whilst maintaining order of columns

Pandas sort() ignoring negative sign

How to make table with multi-tier row header (index) using Pandas

Categories

Resources