Converting all object columns to float except for one column - python

Let's say in the dataframe df there is:
a b c d
ana 31% 26% 29%
bob 52% 45% 9%
cal 11% 6% 23%
dan 29% 12% 8%
where all data types under a, b c and d are objects. I want to convert b, c and d to their decimal forms with:
df.columns = df.columns.str.rstrip('%').astype('float') / 100.0
but I don't know how to not include column a

Let us do update with to_numeric
df.update(df.apply(lambda x : pd.to_numeric(x.str.rstrip('%'),errors='coerce'))/100)
df
Out[128]:
a b c d
0 ana 0.31 0.26 0.29
1 bob 0.52 0.45 0.09
2 cal 0.11 0.06 0.23
3 dan 0.29 0.12 0.08

Use Index.drop for all columns without a with DataFrame.replace, convert to floats and divide by 100:
cols = df.columns.drop('a')
df[cols] = df[cols].replace('%', '', regex=True).astype('float') / 100.0
print (df)
a b c d
0 ana 0.31 0.26 0.29
1 bob 0.52 0.45 0.09
2 cal 0.11 0.06 0.23
3 dan 0.29 0.12 0.08
Or you can convert first column to index by DataFrame.set_index, so all columns without a should be processing:
df = df.set_index('a').replace('%', '', regex=True).astype('float') / 100.0
print (df)
b c d
a
ana 0.31 0.26 0.29
bob 0.52 0.45 0.09
cal 0.11 0.06 0.23
dan 0.29 0.12 0.08

Related

Convert pandas dateframe from row to column [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 12 months ago.
I have the following data frame:
ID
value
freq
A
a
0.1
A
b
0.12
A
c
0.19
B
a
0.15
B
b
0.2
B
c
0.09
C
a
0.39
C
b
0.15
C
c
0.01
and I would like to get the following
ID
freq_a
freq_b
freq_c
A
0.1
0.12
0.19
B
0.15
0.2
0.09
C
0.39
0.15
0.01
Any ideas how to easily do this?
using pivot:
df.pivot(index='ID', columns='value', values='freq').add_prefix('freq_').reset_index()
output:
>>
value ID freq_a freq_b freq_c
0 A 0.10 0.12 0.19
1 B 0.15 0.20 0.09
2 C 0.39 0.15 0.01
Use pivot_table:
out = df.pivot_table('freq', 'ID', 'value').add_prefix('freq_') \
.rename_axis(columns=None).reset_index()
print(out)
# Output
ID freq_a freq_b freq_c
0 A 0.10 0.12 0.19
1 B 0.15 0.20 0.09
2 C 0.39 0.15 0.01

Find matching column interval in pandas

I have a pandas dataframe with multiple columns were their values increase from some value between 0 and 1 for column A up to column E which is always 1 (representing cumulative probabilities).
ID A B C D E SIM
1: 0.49 0.64 0.86 0.97 1.00 0.98
2: 0.76 0.84 0.98 0.99 1.00 0.87
3: 0.32 0.56 0.72 0.92 1.00 0.12
The column SIM represents a column with random uniform numbers.
I wish to add a new column SIM_CAT with values equal to the column-name which value is the right boundary of the interval in which the value in column SIM falls:
ID A B C D E SIM SIM_CAT
1: 0.49 0.64 0.86 0.97 1.00 0.98 E
2: 0.76 0.84 0.98 0.99 1.00 0.87 C
3: 0.32 0.56 0.72 0.92 1.00 0.12 A
I there a concise way to do that?
You can compare columns with SIM and use idxmax to find the 1st greater value:
cols = list('ABCDE')
df['SIM_CAT'] = df[cols].ge(df.SIM, axis=0).idxmax(axis=1)
df
ID A B C D E SIM SIM_CAT
0 1: 0.49 0.64 0.86 0.97 1.0 0.98 E
1 2: 0.76 0.84 0.98 0.99 1.0 0.87 C
2 3: 0.32 0.56 0.72 0.92 1.0 0.12 A
If SIM can contain values greater than 1:
cols = list('ABCDE')
df['SIM_CAT'] = None
df.loc[df.SIM <= 1, 'SIM_CAT'] = df[cols].ge(df.SIM, axis=0).idxmax(axis=1)
df
ID A B C D E SIM SIM_CAT
0 1: 0.49 0.64 0.86 0.97 1.0 0.98 E
1 2: 0.76 0.84 0.98 0.99 1.0 0.87 C
2 3: 0.32 0.56 0.72 0.92 1.0 0.12 A

python pandas change dataframe to pivoted columns

I have a dataframe that looks as following:
Type Month Value
A 1 0.29
A 2 0.90
A 3 0.44
A 4 0.43
B 1 0.29
B 2 0.50
B 3 0.14
B 4 0.07
I want to change the dataframe to following format:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Is this possible ?
Use set_index + unstack
df.set_index(['Month', 'Type']).Value.unstack()
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
To match your exact output
df.set_index(['Month', 'Type']).Value.unstack().rename_axis(None)
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Pivot solution:
In [70]: df.pivot(index='Month', columns='Type', values='Value')
Out[70]:
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
In [71]: df.pivot(index='Month', columns='Type', values='Value').rename_axis(None)
Out[71]:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
You're having a case of long format table which you want to transform to a wide format.
This is natively handled in pandas:
df.pivot(index='Month', columns='Type', values='Value')

Pandas round is not working for DataFrame

Round works on a single element but not the DataFrame, tried DataFrame.round() but didn't work... any idea? Thanks.
Have code below:
print "Panda Version: ", pd.__version__
print "['5am'][0]: ", x3['5am'][0]
print "Round element: ", np.round(x3['5am'][0]*4) /4
print "Round Dataframe: \r\n", np.round(x3 * 4, decimals=2) / 4
df = np.round(x3 * 4, decimals=2) / 4
print "Round Dataframe Again: \r\n", df.round(2)
Got result:
Panda Version: 0.18.0
['5am'][0]: 0.279914529915
Round element: 0.25
Round Dataframe:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Round Dataframe Again:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Try to cast to float type:
x3.astype(float).round(2)
as simple as this
df['col_name'] = df['col_name'].astype(float).round(2)
Explanation of your code:
In [166]: np.round(df * 4, decimals=2)
Out[166]:
a b c d
0 0.11 0.45 1.65 3.38
1 3.97 2.90 1.89 3.42
2 1.46 0.79 3.00 1.44
3 3.48 2.33 0.81 1.02
4 1.03 0.65 1.94 2.92
5 1.88 2.21 0.59 0.39
6 0.08 2.09 4.00 1.02
7 2.86 0.71 3.56 0.57
8 1.23 1.38 3.47 0.03
9 3.09 1.10 1.12 3.31
In [167]: np.round(df * 4, decimals=2) / 4
Out[167]:
a b c d
0 0.0275 0.1125 0.4125 0.8450
1 0.9925 0.7250 0.4725 0.8550
2 0.3650 0.1975 0.7500 0.3600
3 0.8700 0.5825 0.2025 0.2550
4 0.2575 0.1625 0.4850 0.7300
5 0.4700 0.5525 0.1475 0.0975
6 0.0200 0.5225 1.0000 0.2550
7 0.7150 0.1775 0.8900 0.1425
8 0.3075 0.3450 0.8675 0.0075
9 0.7725 0.2750 0.2800 0.8275
In [168]: np.round(np.round(df * 4, decimals=2) / 4, 2)
Out[168]:
a b c d
0 0.03 0.11 0.41 0.84
1 0.99 0.72 0.47 0.86
2 0.36 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.26
7 0.72 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.28 0.28 0.83
This is working properly for me (pandas 0.18.1)
In [162]: df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
In [163]: df
Out[163]:
a b c d
0 0.028700 0.112959 0.412192 0.845663
1 0.991907 0.725550 0.472020 0.856240
2 0.365117 0.197468 0.750554 0.360272
3 0.870041 0.582081 0.203692 0.255915
4 0.257433 0.161543 0.483978 0.730548
5 0.470767 0.553341 0.146612 0.096358
6 0.020052 0.522482 0.999089 0.254312
7 0.714934 0.178061 0.889703 0.143701
8 0.308284 0.344552 0.868151 0.007825
9 0.771984 0.274245 0.280431 0.827999
In [164]: df.round(2)
Out[164]:
a b c d
0 0.03 0.11 0.41 0.85
1 0.99 0.73 0.47 0.86
2 0.37 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.25
7 0.71 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.27 0.28 0.83
Similar issue. df.round(1) didn't round as expected (e.g. .400000000123) but df.astype('float64').round(1) worked. Significantly, the dtype of df is float32. Apparently round() doesn't work properly on float32. How is this behavior not a bug?
As I just found here,
"round does not modify in-place. Rather, it returns the dataframe
rounded."
It might be helpful to think of this as follows:
df.round(2) is doing the correct rounding operation, but you are not asking it to see the result or saving it anywhere.
Thus, df_final = df.round(2) will likely complete your expected functionality, instead of just df.round(2). That's because the results of the rounding operation are now being saved to the df_final dataframe.
Additionally, it might be best to do one additional thing and use df_final = df.round(2).copy() instead of simply df_final = df.round(2). I find that some things return unexpected results if I don't assign a copy of the old dataframe to the new dataframe.
I've tried to reproduce your situation. and it seems to work nicely.
import pandas as pd
import numpy as np
from io import StringIO
s = """Date 5am 6am 7am 8am 9am 10am 11am
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
"""
df = pd.read_table(StringIO(s), delim_whitespace=True)
df.set_index('Date').round(2)

Creating Multi-hierarchy pivot table in Pandas

1. Background
The .xls files I have now contain some parameters of multi-pollutant in many aspects for different sites.
I created an simplified dataframe below as an illustration:
Some declaration:
Column Site contain the monitoring sites properties. In this case, Sites S1, S2 are the only two locatio here.
Column Time contain the monitoring period for different sites.
Species A & B represents two chemical pollutants had been detected.
Conc is one key parameter for each species(A & B) represents the concentration. Notice that, the concentration of Species A should be measured twice as parallel.
P and Q are two different analysis experiments. Since species A has two samples, it has P1, P2, P3 & Q1, Q2 as the analysis results respectively. Species B has only be analyzed by P. So, P1, P2, P3 are the only parameters.
After read some post on manipulating the pivot_table using Pandas, I want to have a try.
2. My target
I presented my target file construction manually in Excel showing like this:
3. My work
df = pd.ExcelFile("./test_file.xls")
df = df.parse("Sheet1")
pd.pivot_table(df,index = ["Site","Time","Species"])
This is the result:
Update
What I'm trying to figure out is to creat two columns P & Q and sub_columns below them.
I have re-upload my test file here. Anyone interested in can download it.
The P and Q tests are for each sample of species A respectively.
The Conc test are for them both.
Any advice would be appreciate!
IIUC
You want the same dataframe, but with a better column index.
To create the first level:
level0 = df.columns.str.extract(r'([^\d]*)', expand=False)
then assign a multiindex to the columns attribute.
df.columns = pd.MultiIndex.from_arrays([level0, df.columns])
Looks like:
print df
Conc P Q
Conc P1 P2 P3 Q1 Q2
Site Time Species
S1 20141222 A 0.79 0.02 0.62 1.05 0.01 1.73
20141228 A 0.13 0.01 0.79 0.44 0.01 1.72
20150103 B 0.48 0.03 1.39 0.84 NaN NaN
20150104 A 0.36 0.02 1.13 0.31 0.01 0.94
20150109 A 0.14 0.01 0.64 0.35 0.00 1.00
20150114 B 0.47 0.08 1.16 1.40 NaN NaN
20150115 A 0.62 0.02 0.90 0.95 0.01 2.63
20150116 A 0.71 0.03 1.72 1.71 0.01 2.53
20150121 B 0.61 0.03 0.67 0.87 NaN NaN
S2 20141222 A 0.23 0.01 0.66 0.44 0.01 1.49
20141228 A 0.42 0.06 0.99 1.56 0.00 2.18
20150103 B 0.09 0.01 0.56 0.12 NaN NaN
20150104 A 0.18 0.01 0.56 0.36 0.00 0.67
20150109 A 0.50 0.03 0.74 0.71 0.00 1.11
20150114 B 0.64 0.06 1.76 0.92 NaN NaN
20150115 A 0.58 0.05 0.77 0.95 0.01 1.54
20150116 A 0.93 0.04 1.33 0.69 0.00 0.82
20150121 B 0.33 0.09 1.33 0.76 NaN NaN

Categories

Resources