How to edit display precision for only one dataframe pandas - python

I would like to edit the display precision for a specific dataframe.
Now I saw people stating that you could use something like this:
pd.set_option('precision', 5)
However, how do you make sure that only one specific dataframe uses this precision, and the other remain as they were?
Also, is it possible to alter this precision for specific columns only?

One way is string representation of float column:
np.random.seed(2019)
df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
print (df)
a b c
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
2 0.702198 0.903206 0.881382
3 0.405750 0.452447 0.267070
4 0.162865 0.889215 0.148476
df['a'] = df['a'].map("{:,.15f}".format)
print (df)
a b c
0 0.903482214419274 0.393081 0.623970
1 0.637877401022227 0.880499 0.299172
2 0.702198270186552 0.903206 0.881382
3 0.405749797979913 0.452447 0.267070
4 0.162864870291925 0.889215 0.148476
print (df.dtypes)
a object
b float64
c float64
dtype: object

This method will not affect the original dataframe.
df.style.set_precision(5)
For One column you can use
df.style.format({'Column_Name': "{:.5f}"})

Related

Compute a row in Pandas operating other columns

I have a dataframe like this-
df = pd.DataFrame{"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] }
I need to compute another column c by the following operation on column a-
c = len(str(a))-len(str(int(a)))-1
Tried different methods but not able to achieve.
If there is different digits after . is possible use Series.str.len with Series.astype:
df = pd.DataFrame({"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 4
1 5
2 6
Name: a, dtype: int64
df['c'] = df.a.astype(str).str.len() - df.a.astype(int).astype(str).str.len() - 1
But because float precision is problematic count values with general data (simulate problem):
df = pd.DataFrame({"a":[2.220000000236, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 14
1 5
2 6
Name: a, dtype: int64
This solution creates column C with the desired result.
df['c'] = df['a'].astype(str).str.len() - df['a'].astype(int).astype(str).str.len() - 1

data type conversion in dataFrame

I have a csv file which has a column named population. In this CSV file the values of this column are shown as decimal (float) i.e. for e.g. 12345.00. I have converted whole of this file to ttl RDF format, and the population literal is shown as the same i.e. 12345.0 in the ttl file. I want it to show as integer (whole number) i.e. 12345 - Do I need to convert the data type of this column or what to do? Also, I would ask how can I check the data type of a column of a dataFrame in python?
(A beginner in python)- Thanks
You can try change the column data type first.
For example
df = pd.DataFrame([1.0,2.0,3.0,4.0], columns=['A'])
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: float64
now
df['A'] = df['A'].astype(int)
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: int32
If you have some np.NaN in the column you can try
df = df.astype('Int64')
This will get you
A
0 1.0
1 2.0
2 3.0
3 4.0
4 <NA>
Where < NA> is the Int64 equilavent to np.NaN. Is important to know that np:NaN is a float and < NA> is not widely used yet and is not memory and performance optimized, you can read more about here
https://pandas.pydata.org/docs/user_guide/missing_data.html#integer-dtypes-and-missing-data
csv_data['theColName'] = csv_data['theColName'].fillna(0)
csv_data['theColName'] = csv_data['theColName'].astype('int64') worked and the column is successfully converted to int64. Thanks everybody

How to bring pandas.Series.str.get_dummies() to report NaN?

I have data in a file. CSV-like but multiple values per field are possible. I use get_dummies() to generate an overview of my column. What is in there and how often. Just like an histogram with nominal data. I want to see the missing (nan) values. But my code hides them.
I am using: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html
I can't use: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html the dummy_na would solve the problem
Reason: I need the sep parameter.
To illustrate the difference.
import pandas
data = pandas.read_csv("testdata.csv",sep=";")
Bla["a"].str.get_dummies(",").sum() #no nan values
pandas.get_dummies(Bla["a"],dummy_na=True).sum() #not separated
Data:
a;b
Test,Tes;
;a
Tes;a
T;b
I would expect:
T 1
Tes 2
Test 1
NaN 1
But the output is:
T 1
Tes 2
Test 1
dtype: int64
or
T 1
Tes 1
Test,Tes 1
NaN 1
dtype: int64
Happy to also use another function! Maybe the .str part is the problem. I have not quite figured out what that does.
First replace missing values by Series.fillna and then in index by rename to NaN:
print (data["a"].fillna('Missing').str.get_dummies(",").sum().rename({'Missing':np.nan}))
NaN 1
T 1
Tes 2
Test 1
dtype: int64

Change values on some criteria in Pandas DataFrame and save it to the new df without affecting the original one

Here's the example. I will write Pandas DataFrame as a list so it's easier to read:
df_0 = [1,2,3]
I want to change values bigger than 2 to np.nan and save the new DataFrame to new variable df_1. Final result:
df_0 = [1,2,3]
df_1 = [1,2,np.nan]
You just need mask here
df_1=df_0.mask(df_0>2)
df_1
Out[291]:
0 1.0
1 2.0
2 NaN
dtype: float64
df_0
Out[292]:
0 1
1 2
2 3
dtype: int64
What is your column name in the dataframe? Let's say, it is COL. Then you can do:
df_1 = df_0.copy()
df_1.loc[df_1['COL'] >= 2, 'COL'] = np.nan
Also, there is a keyword where that you can make use of.
df_0 = pd.Series([1,2,3])
df_1 = df_0.where(df_0 <= 2)
0 1.0
1 2.0
2 NaN
dtype: float64

Pandas get_dummies to output dtype integer/bool instead of float

I would like to know if could ask the get_dummies function in pandas to output the dummies dataframe with a dtype lighter than the default float64.
So, for a sample dataframe with categorical columns:
In []: df = pd.DataFrame([(blue,wood),(blue,metal),(red,wood)],
columns=['C1','C2'])
In []: df
Out[]:
C1 C2
0 blue wood
1 blue metal
2 red wood
after getting the dummies, it looks like:
In []: df = pd.get_dummies(df)
In []: df
Out[]:
C1_blue C1_red C2_metal C2_wood
0 1 0 0 1
1 1 0 1 0
2 0 1 0 1
which is perfectly fine. However, by default the 1's and 0's are float64:
In []: df.dtypes
Out[]:
C1_blue float64
C1_red float64
C2_metal float64
C2_wood float64
dtype: object
I know I can change the dtype afterwards with astype:
In []: df = pd.get_dummies(df).astype(np.int8)
But I don't want to have the dataframe with floats in memory, because I am dealing with a big dataframe (from a csv of about ~5Gb). I would like to have the dummies directly as integers.
There is an open issue w.r.t. this, see here: https://github.com/pydata/pandas/issues/8725
The float issue is now solved. From pandas version 0.19, pd.get_dummies function returns dummy-encoded columns as small integers.
See: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#get-dummies-now-returns-integer-dtypes

Categories

Resources