How can I remove extra digits of a float64 value? - python

I have a data frame column.
P08107 3.658940e-11
P62979 4.817399e-05
P16401 7.784275e-05
Q96B49 7.784275e-05
Q15637 2.099078e-04
P31689 1.274387e-03
P62258 1.662718e-03
P07437 3.029516e-03
O00410 3.029516e-03
P23381 3.029516e-03
P27348 5.733834e-03
P29590 9.559550e-03
P25685 9.957186e-03
P09429 1.181282e-02
P62937 1.260040e-02
P11021 1.396807e-02
P31946 1.409311e-02
P19338 1.503901e-02
Q14974 2.213431e-02
P11142 2.402201e-02
I want to leave one decimal and remove extra digits, that it looks like
3.7e-11
instead of
3.658940e-11
and etc with all the others.
I know how to slice a string but it doesn't seem to work here.

If you have a pandas dataframe you could set the display options.
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.2f}'.format
pd.DataFrame(dict(randomvalues=np.random.random_sample((5,))))
Returns:
randomvalues
0 0.02
1 0.66
2 0.24
3 0.87
4 0.63

You could use str.format:
>>> '{:.2g}'.format(3.658940e-11)
'3.7e-11'
String slicing will not work here, because it does not round the values:
>>> s = '3.658940e-11'
>>> s[:3] + 'e' + s.split('e')[1]
'3.6e-11'

Related

Subsetting columns of a data frame that are stored into variables

I have a data frame (a .txt from R) that looks like this:
my_sample my_coord1 my_coord2 my_cl
A 0.34 0.12 1
B 0.2 1.11 1
C 0.23 0.10 1
D 0.9 0.34 2
E 0.21 0.6 2
... ... ... ...
Using python I would like to extract columns 2 and 3 and put them into a variable as well as I would like to put column 4 into another variable. In R is: my_var1 = mydf[,c(2:3)] and my_var2 = mydf[,4]. I don't know how to do this in python but I tried:
mydf = open("mydf.txt", "r")
print(mydf.read())
lines = mydf.readlines()
for line in lines:
sline = line.split(' ')
print(sline)
mydf.close()
But I don't know how to save into a variable each subsetting.
I know it seems a quite simple question but I'm a newbie in the field.
Thank you in advance
You can use read_table from pandas in order to deal with tabular data file. The code
import pandas as pd
mydf = pd.read_table('mydf.txt',delim_whitespace = True)
my_var1 = mydf[['my_coord1','my_coord2']]
my_var2 = mydf['my_cl']

Pandas Dataframe How to cut off float decimal points without rounding?

I have longitude and latitude in two dataframes that are close together. If I run an exact similarity check such as
test_similar = test1_latlon.loc[~test1_latlon['cr'].isin(test2_latlon['cr'])]
I get a lot of failures because a lot of the numbers are off at the 5th decimal place. I want to truncate at after the 3rd decimal. I've seen people format so it shows up truncated, but I want to change the actual value. Using round() rounds off the data and I get even more errors, so is there a way to just drop after 3 decimal points?
You may want to use numpy.trunc:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1.2366, 1.2310], [1, 1]])
df1 = np.trunc(1000 * df) / 1000
print(df1, type(df1))
# 0 1
# 0 1.236 1.231
# 1 1.000 1.000 <class 'pandas.core.frame.DataFrame'>
Note that df1 is still a DataFrame not a numpy.array
As suggested here you can do:
x = 1.123456
float( '%.3f'%(x) )
if you want more decimal places, just change the 3 with any number you need.
import math
value1 = 1.1236
value2 = 1.1266
value1 = math.trunc(1000 * value1) / 1000;
value2 = math.trunc(1000 * value2) / 1000;
#value1 output
1.123
#value2 output
1.126

Python Tabulate format only one float column

I'm using the tabulate module to print a fixed width file and I have one column that I need formatted in such a way that there are 19 places to the left of the decimal and 2 places to the right of the decimal.
import pandas as pd
from tabulate import tabulate
df = pd.DataFrame.from_dict({'A':['x','y','z'],
'B':[1,1.1,11.21],'C':[34.2334,81.1,11]})
df
Out[4]:
A B C
0 x 1.00 34.2334
1 y 1.10 81.1000
2 z 11.21 11.0000
df['C'] = df['C'].apply(lambda x: format(x,'0>22.2f'))
df
Out[6]:
A B C
0 x 1.00 0000000000000000034.23
1 y 1.10 0000000000000000081.10
2 z 11.21 0000000000000000011.00
print(tabulate(df))
- - ----- -----
0 x 1 34.23
1 y 1.1 81.1
2 z 11.21 11
- - ----- -----
Is there any way I can preserve the formatting in column C without affecting the formatting in column B? I know I could use floatfmt = '0>22.2f' but I don't need column B to look that way just column C.
According to the tabulate documentation strings that look like decimals will be automatically converted to numeric. If I could suppress this then format my table before printing (as in the example above) that would solve it for me as well.
The documentation at GitHub is more up-to-date and it states that with floatfmt "every column may have different number formatting". Here is an example using your data:
import pandas as pd
from tabulate import tabulate
df = pd.DataFrame.from_dict({'A':['x','yy','zzz'],
'B':[1,1.1,11.21],'C':[34.2334,81.1,11]})
print(tabulate(df, floatfmt=(None, None, '.2f', '0>22.2f',)))
The result is:
- --- ----- ----------------------
0 x 1.00 0000000000000000034.23
1 yy 1.10 0000000000000000081.10
2 zzz 11.21 0000000000000000011.00
- --- ----- ----------------------
Additionally, as you suggested, you also have the option disable_numparse which disables the automatic convert from string to numeric. You can then format each field manually but this requires more coding. The option colalign may come handy in such a case, so that you can specify different column alignment for strings and numbers (which you would have converted to formatted strings, too).
Do you absolutely need tabulate for this? You can achieve similar effect (bar dashes) with:
In [18]: print(df.__repr__().split('\n',1)[1])
0 x 1.00 0000000000000000034.23
1 y 1.10 0000000000000000081.10
2 z 11.21 0000000000000000011.00
df.__repr__ is representation of df, i.e. what you see when you just type df in a cell. Then I remove the header line by splitting on the first new line char and taking the other half of the split.
Also, if you write it to a machine readable form, you might want to use tabs:
In [8]: df.to_csv(sys.stdout, sep='\t', header=False)
0 x 1.0 0000000000000000034.23
1 y 1.1 0000000000000000081.10
2 z 11.21 0000000000000000011.00
It will render pretty depending on tab rendering settings, but if you output in a file, then you get tab symbols

Querying a Pandas dataframe

I am still very new to Pandas and hence this might be very silly.I have Pandas data frame as follows:
>>> data_frame
median quarter status change
0 240 2015-1 BV NaN
1 300 2015-2 BV 0.25
2 300 2015-1 CORR 0.00
3 240 2015-2 CORR -0.20
Now i need only the quarter 2015-2,so i perform the query
>>> data_frame.query('quarter == "2015-2"')
median quarter status change
1 300 2015-2 BV 0.25
2 240 2015-2 CORR -0.20
That works fine.However if I need to search via a variable name,it does not work.
>>> completed_quarter = '2015-2'
>>> data_frame.query('quarter == "completed_quarter"')
Empty DataFrame
Columns: [median, quarter, status, change]
Index: []
I tried a few other combinations with single quotes, no quotes etc but nothing works.What am I doing wrong ? Is there any other way in Pandas through which I can accomplish the same thing ?
Trying using this:
>>> completed_quarter = '2015-2'
>>> data_frame.query('quarter == "{}"'.format(completed_quarter))
At the moment you are searching for a quarter that equals "completed_quarter" rather than the value of the completed_quarter variable. Using string format method will replace the value in braces with the variable value.
You can access the value of the variable like this
completed_quarter = '2015-2'
data_frame.query('quarter == #completed_quarter')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Output different precision by column with pandas.DataFrame.to_csv()?

Question
Is it possible to specify a float precision specifically for each column to be printed by the Python pandas package method pandas.DataFrame.to_csv?
Background
If I have a pandas dataframe that is arranged like this:
In [53]: df_data[:5]
Out[53]:
year month day lats lons vals
0 2012 6 16 81.862745 -29.834254 0.0
1 2012 6 16 81.862745 -29.502762 0.1
2 2012 6 16 81.862745 -29.171271 0.0
3 2012 6 16 81.862745 -28.839779 0.2
4 2012 6 16 81.862745 -28.508287 0.0
There is the float_format option that can be used to specify a precision, but this applys that precision to all columns of the dataframe when printed.
When I use that like so:
df_data.to_csv(outfile, index=False,
header=False, float_format='%11.6f')
I get the following, where vals is given an inaccurate precision:
2012,6,16, 81.862745, -29.834254, 0.000000
2012,6,16, 81.862745, -29.502762, 0.100000
2012,6,16, 81.862745, -29.171270, 0.000000
2012,6,16, 81.862745, -28.839779, 0.200000
2012,6,16, 81.862745, -28.508287, 0.000000
Change the type of column "vals" prior to exporting the data frame to a CSV file
df_data['vals'] = df_data['vals'].map(lambda x: '%2.1f' % x)
df_data.to_csv(outfile, index=False, header=False, float_format='%11.6f')
The more current version of hknust's first line would be:
df_data['vals'] = df_data['vals'].map(lambda x: '{0:.1}'.format(x))
To print without scientific notation:
df_data['vals'] = df_data['vals'].map(lambda x: '{0:.1f}'.format(x))
This question is a bit old, but I'd like to contribute with a better answer, I think so:
formats = {'lats': '{:10.5f}', 'lons': '{:.3E}', 'vals': '{:2.1f}'}
for col, f in formats.items():
df_data[col] = df_data[col].map(lambda x: f.format(x))
I tried with the solution here, but it didn't work for me, I decided to experiment with previus solutions given here combined with that from the link above.
You can use round method for dataframe before saving the dataframe to the file.
df_data = df_data.round(6)
df_data.to_csv('myfile.dat')
You can do this with to_string. There is a formatters argument where you can provide a dict of columns names to formatters. Then you can use some regexp to replace the default column separators with your delimiter of choice.
The to_string approach suggested by #mattexx looks better to me, since it doesn't modify the dataframe.
It also generalizes well when using jupyter notebooks to get pretty HTML output, via the to_html method. Here we set a new default precision of 4, and override it to get 5 digits for a particular column wider:
from IPython.display import HTML
from IPython.display import display
pd.set_option('precision', 4)
display(HTML(df.to_html(formatters={'wider': '{:,.5f}'.format})))

Categories

Resources