Number of digits after decimal point in pandas - python

I have CSV file with data:
Number
1.1
2.2
4.1
5.4
9.176
14.54345774
16.25664
If I print to display with pandas I get:
df = pd.read_csv('data.csv')
print(df)
Number
0 1.100000
1 2.200000
2 4.100000
3 5.400000
4 9.176000
5 14.543458
6 16.256640
But if I cut 14.54345774 to 14.543 output is changed:
Number
0 1.10000
1 2.20000
2 4.10000
3 5.40000
4 9.17600
5 14.54300
6 16.25664
The first case number of digits after decimal point in pandas is 6, second case is 5.
Why format is changed?
What pandas parameters should I change so these cases are equal? I want the number of digits after the decimal point to be constant and digits after the decimal point is round to max digits after the decimal point if it possibly.
UPDATE:
IMO, This moment arises on data initialization, so round don't get to desirable result if I want use 6 digits. It only can be decreased (6->5 digits), but it can't be increased (5->6).

You can use pd.set_option to set the decimal number display precision to e.g. 5 in this case:
pd.set_option("display.precision", 5)
or use:
pd.options.display.float_format = '{:.5f}'.format
Result:
print(df) # with original value of 14.54345774
Number
0 1.10000
1 2.20000
2 4.10000
3 5.40000
4 9.17600
5 14.54346
6 16.25664

You can use df.round(decimals=val) to fix number of digits after decimal to val.
Also, when you changed to 14.453, pandas didn't needed to show 6 digits then as 16.25664 has most digits after the decimal (i.e. 5) and so now it started showing 5 digits. You can fix this to some constant value so that it doesn't changes with operations.

Related

Cluster similar - but not identical - digits in pandas dataframe

I have a pandas dataframe with 2M+ rows. One of the columns pin, contains a series of 14 digits.
I'm trying to cluster similar — but not identical — digits. Specifically, I want to match the first 10 digits without regard to the final four. The pin column was imported as an int then converted to a string.
Put another way, the first 10 digits should match but the final four shouldn't. Duplicates of exact-matching pins should be dropped.
For example these should all be grouped together:
17101110141403
17101110141892
17101110141763
17101110141199
17101110141788
17101110141851
17101110141831
17101110141487
17101110141914
17101110141843
Desired output:
Biggest cluster | other columns
Second biggest cluster | other columns
...and so on | other columns
I've tried using a combination of groupby and regex without success.
pat2 = '1710111014\d\d\d\d'
pat = '\d\d\d\d\d\d\d\d\d\d\d\d\d\d'
grouped = df2.groupby(df2['pin'].str.extract(pat, expand=False), axis= 1)
and
df.groupby(['pin']).filter(lambda group: re.match > 1)
Here's a link to the original data set: https://datacatalog.cookcountyil.gov/Property-Taxation/Assessor-Parcel-Sales/wvhk-k5uv
It's not clear why you need regex for this, what about the following(assuming pin is stored as a string) (Note that you haven't included your expected output)
pin
0 17101110141403
1 17101110141892
2 17101110141763
3 17101110141199
4 17101110141788
5 17101110141851
6 17101110141831
7 17101110141487
8 17101110141914
9 17101110141843
df.groupby(df['pin'].str[:10]).size()
pin
1710111014 10
dtype: int64
If you want this information appended back to your original dataframe, you can use
df['size']=df.groupby(df['pin'].astype(str).str[:10])['pin'].transform(len)
pin size
0 17101110141403 10
1 17101110141892 10
2 17101110141763 10
3 17101110141199 10
4 17101110141788 10
5 17101110141851 10
6 17101110141831 10
7 17101110141487 10
8 17101110141914 10
9 17101110141843 10
Then, assuming you have more columns, you can sort your dataframe by size of cluster with
df.sort_values('size')

Why do I lose numerical precision when extracting element from list in python?

I have a pandas dataframe that looks like this:
data
0 [26.113017616106, 106.948066803935, 215.488217...
1 [26.369709448639, 106.961107298101, 215.558911...
2 [26.261267444521, 106.991763898421, 215.384122...
3 [26.285746968657, 106.912377030428, 215.287348...
4 [26.155342026996, 106.825440402654, 215.114619...
5 [26.159917638984, 106.819720887669, 215.117593...
6 [26.023564401739, 106.843056508808, 215.129947...
7 [26.1155342027, 106.828185769847, 215.15991763...
8 [26.028826355525, 106.841912605811, 215.146190...
9 [26.015099519561, 106.824296499657, 215.130404...
I am trying to extract the 1st element from the Series of lists using this code:
[x[1] for x in df.data]
and I get this result:
0 106.948067
1 106.961107
2 106.991764
3 106.912377
4 106.825440
5 106.819721
6 106.843057
7 106.828186
8 106.841913
9 106.824296
Why do I lose precision and what can I do to keep it?
By default, pandas displays floating-point values with 6 digits of precision.
You can control the precision with pandas’ set_option e.g.
pd.set_option('precision', 12)

Count instances of a random number of n length

I have a pandas dataframe column containing numbers of varying length. I want to count how many instances of a six digit number I have in a column, regardless of which numbers and their order.
Example:
import pandas as pd
df = pd.DataFrame({"number": [1234, 12345, 777777, 949494, 22, 987654]})
Should return that there is three instances of a six digit number in the column.
I would convert it to string, check the length of the string and sum those which length is 6:
(df['number'].astype(str).apply(len) == 6).sum()
Use np.log10 and floor division which gives you order of magnitude for numbers. Then check how many satisfy that condition.
N = 6
(np.log10(df['number'])//1).eq(N-1).sum()
#3
You can use np.ceil and np.log10:
df['length'] = np.ceil(np.log10(df['number']))
Result:
number length
0 1234 4.0
1 12345 5.0
2 777777 6.0
3 949494 6.0
4 22 2.0
5 987654 6.0
To count instances use:
np.ceil(np.log10(df['number'])).eq(6).sum()
Valid only for values > 0.

Why is Python Returning 0.0 for division? [duplicate]

This question already has answers here:
Python: Pandas Dataframe how to multiply entire column with a scalar
(12 answers)
Is floating point math broken?
(31 answers)
Closed 2 years ago.
I am trying to divide a column in my data set by 100 so as to turn it into a percentage (i.e. 99% = .99.) However, when I divide by 100 it returns zero for all values in the column. I understand " / " returns the floor division when dividing two integers. However, I turned the column into a float, and also divided the column by 100.0 (float.) Also, a lot of the values in the column are '100.0' to begin with, so when I divide by 100, it should return '1' if it was doing floor division. The ' + .04 ' part of the if statement is to account for an error in the report. Is it a problem with my for loop? I attached the code below.
import pandas as pd
import numpy as np
data = pd.read_csv('Lakeshore Variance.csv')
Percent = data['Percent at Cutoff'].astype(float)
for i in Percent:
if i < 96:
Percent = (Percent/100.0) + .04
else:
Percent = (Percent/100.0)
You can use the apply method to update a column in pandas instead of using loops.
I'm voting to close this question, as it was already answered in Python: Pandas Dataframe how to multiply entire column with a scalar
data['Percent at Cutoff'] = data['Percent at Cutoff'].apply(lambda x: x / 100)
Setup:
import pandas as pd
import numpy as np
from numpy.random import default_rng
rng = default_rng()
df = pd.DataFrame(rng.integers(100,200, (5,)))
>>> df
0
0 150
1 171
2 188
3 140
4 180
I understand " / " returns the floor division when dividing two integers.
No - // is the floor division operator
>>> df / 100
0
0 1.50
1 1.71
2 1.88
3 1.40
4 1.80
>>> df // 100
0
0 1
1 1
2 1
3 1
4 1
>>>
I turned the column into a float, and also divided the column by 100.0 (float.) ... values in the column are '100.0' to begin with, ... divide by 100, it should return '1' if it was doing floor division
Two things in that statement
You are not doing floor division
>>> df.astype(float) / 100
0
0 1.50
1 1.71
2 1.88
3 1.40
4 1.80
>>> df.astype(float) // 100
0
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
>>>
You expect an integer result (should return '1') from floor division between a Pandas Series of dtype float and a float. Arithmetic operations where both the operands are floats will not return an integer.
From Python documentation 6.1. Arithmetic conversions
When a description of an arithmetic operator below uses the phrase “the numeric arguments are converted to a common type”, this means that the operator implementation for built-in types works as follows:
..., if either argument is a floating point number, the other is converted to floating point;
otherwise, both must be integers and no conversion is necessary.
6.7. Binary arithmetic operations
The / (division) and // (floor division) operators yield the quotient of their arguments. The numeric arguments are first converted to a common type. Division of integers yields a float, while floor division of integers results in an integer; the result is that of mathematical division with the ‘floor’ function applied to the result.
I'm having a hard time finding specific documentation for Numpy or Pandas. There are SO Q&A's regarding this concept but they are also alluding me.

Pandas Qcut, rounding values, Python

I want to create a key for the qcut bins I have from my data set.
So below I have the data from the 'total' column into ten bins, I have dropped the duplicates and sorted the values so I can see what the bin values are and in order. The below has the bins without using 'precision'.
bin_key=pd.qcut(bin_key['Total'], 10).drop_duplicates().sort_values()
bin_key.reset_index(drop=True, inplace=True)
bin_key
Output:
0 (11.199, 7932.26]
1 (7932.26, 15044.289]
2 (15044.289, 22709.757]
3 (22709.757, 32762.481]
4 (32762.481, 43491.146]
5 (43491.146, 55728.56]
6 (55728.56, 72823.314]
7 (72823.314, 100161.814]
8 (100161.814, 156406.846]
9 (156406.846, 1310448.18]
I want to round the values to the nearest thounsand. Using precision it looks like this:
bin_key=pd.qcut(bin_key['Total_Costs'], 10, 'precision=-3').drop_duplicates().sort_values()
bin_key.reset_index(drop=True, inplace=True)
bin_key
Output
0 (-1000.0, 8000.0]
1 (8000.0, 15000.0]
2 (15000.0, 23000.0]
3 (23000.0, 33000.0]
4 (33000.0, 43000.0]
5 (43000.0, 56000.0]
6 (56000.0, 73000.0]
7 (73000.0, 100000.0]
8 (100000.0, 156000.0]
9 (156000.0, 1310000.0]
How can I round to 0 rather than -1000?

Categories

Resources