data type conversion in dataFrame - python

I have a csv file which has a column named population. In this CSV file the values of this column are shown as decimal (float) i.e. for e.g. 12345.00. I have converted whole of this file to ttl RDF format, and the population literal is shown as the same i.e. 12345.0 in the ttl file. I want it to show as integer (whole number) i.e. 12345 - Do I need to convert the data type of this column or what to do? Also, I would ask how can I check the data type of a column of a dataFrame in python?
(A beginner in python)- Thanks

You can try change the column data type first.
For example
df = pd.DataFrame([1.0,2.0,3.0,4.0], columns=['A'])
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: float64
now
df['A'] = df['A'].astype(int)
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: int32
If you have some np.NaN in the column you can try
df = df.astype('Int64')
This will get you
A
0 1.0
1 2.0
2 3.0
3 4.0
4 <NA>
Where < NA> is the Int64 equilavent to np.NaN. Is important to know that np:NaN is a float and < NA> is not widely used yet and is not memory and performance optimized, you can read more about here
https://pandas.pydata.org/docs/user_guide/missing_data.html#integer-dtypes-and-missing-data

csv_data['theColName'] = csv_data['theColName'].fillna(0)
csv_data['theColName'] = csv_data['theColName'].astype('int64') worked and the column is successfully converted to int64. Thanks everybody

Related

How can I remove the .0 after floats in a DataFrame without changing the object type? (I have NANs and require numerical sorting)

I need to remove the .0s after floats in a column of a DataFrame, made from a dictionary.
For example, the dictionary might be:
mydict = { "part1" : [1,2,None,4,5] "part2" : [6,7,None,9,10] }
and then when I mydf = pd.DataFrame(mydict), the DataFrame generated is as follows:
part1 part2
0 1.0 6.0
1 2.0 7.0
2 NaN NaN
3 4.0 9.0
4 5.0 10.0
This happens because every single column in a DataFrame must have all objects of the same type. But, I want to have no .0s at the end of my data for look purposes. Obviously, I can't make them integers, due to the lack of an NaN in integers. I also can't make them strings, for the reason of numerical sorting. I also wouldn't want "01","02","03"…"10" for the purpose of looks.
Becuase this project is really serious, the looks matter, so please don't blame me of overthinking looks of data.
the comments point out a couple of solutions.
i prefer to cast to type .astype('Int64') which retains the NaN as a <NA>.
here is my solution (with your data):
import pandas as pd
mydict = {"part1":[1,2,None,4,5], "part2":[6,7,None,9,10]}
df = pd.DataFrame(mydict)
df['part2'] = df['part2'].astype('Int64')
print(df)
returns this:
part1 part2
0 1.0 6
1 2.0 7
2 NaN <NA>
3 4.0 9
4 5.0 10
You can apply the above to one (or many) columns of your choice.

How to change format of floar values in column with also NaN values in Pandas Data Frame in Python?

I have Pandas DataFrame in Python like below:
col
-------
7.0
2.0
NaN
...
"col1" is in float data type but I would like to convert displaying of floar values in this column from for example 7.0 to 7. I can not simply change date type to int because I have also "NaN" values in col1.
So as a result I need something like below:
col
-------
7
2
NaN
...
How can I do that in Python Pandas ?
You can use convert_dtypes to perform an automatic conversion:
df = df.convert_dtypes('col')
For all columns:
df = df.convert_dtypes()
output:
col
0 7
1 2
2 <NA>
After conversion:
df.dtypes
col Int64
dtype: object

comparing each value in two columns

How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN

Are there any workarounds for pandas casting INT to Float when NaN is present? [duplicate]

This question already has answers here:
NumPy or Pandas: Keeping array type as integer while having a NaN value
(10 answers)
Closed 4 years ago.
Trying to get my column to be formatted as INT as the 1.0 2.0 3.0 is causing issues with how I am using the data. The first thing I tried was df['Severity'] = pd.to_numeric(df['Severity'], errors='coerce'). While this looked like it worked initially, it reverted back to appearing as float when I wrote to csv. Next I tried using df['Severity'] = df['Severity'].astype(int) followed by another failed attempt using df['Severity'] = df['Severity'].astype(int, errors='coerce') because it seemed a logical solution to me.
I did some digging into pandas' docs and found this regarding how pandas handles NAs:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
What I find strange though, is that when I run df.info(), I get Severity 452646 non-null object
Sample Data:
Age,Severity
1,1
2,2
3,3
4,NaN
5,4
6,4
7,5
8,7
9,6
10,5
Any help would be greatly appreciated :)
It's up to you how to handle missing values there is no correct way as it's up to you. You can either drop them using dropna or replace/fill them using replace/fillna, note that there is no way to represent NaN using integers: https://en.wikipedia.org/wiki/NaN#Integer_NaN.
The reason you get object as the dtype is because you now have a mixture of integers and floats. Depending on the operation then the entire Series maybe upcast to float but in your case you have mixed dtypes.
As of pandas 0.24 (January 2019), it is possible to do what you want by using the nullable integer data type, using an arrays.IntegerArray to represent the data:
In [83]: df.Severity
Out[83]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 4.0
6 5.0
7 7.0
8 6.0
9 5.0
Name: Severity, dtype: float64
In [84]: df.Severity.astype('Int64')
Out[84]:
0 1
1 2
2 3
3 NaN
4 4
5 4
6 5
7 7
8 6
9 5
Name: Severity, dtype: Int64

Pandas: convert column with empty strings to float

In my application, I receive a pandas DataFrame (say, block), that has a column called est. This column can contain a mix of strings or floats. I need to convert all values in the column to floats and have the column type be float64. I do so using the following code:
block[est].convert_objects(convert_numeric=True)
block[est].astype('float')
This works for most cases. However, in one case, est contains all empty strings. In this case, the first statement executes without error, but the empty strings in the column remain empty strings. The second statement then causes an error: ValueError: could not convert string to float:.
How can I modify my code to handle a column with all empty strings?
Edit: I know I can just do block[est].replace("", np.NaN), but I was wondering if there's some way to do it with just convert_objects or astype that I'm missing.
Clarification: For project-specific reasons, I need to use pandas 0.16.2.
Here's an interaction with some sample data that demonstrates the failure:
>>> block = pd.DataFrame({"eps":["", ""]})
>>> block = block.convert_objects(convert_numeric=True)
>>> block["eps"]
0
1
Name: eps, dtype: object
>>> block["eps"].astype('float')
...
ValueError: could not convert string to float:
It's easier to do it using:
pandas.to_numeric
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html
import pandas as pd
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
df['eps'] = pd.to_numeric(df['eps'], errors='coerce')
'coerce' will convert any value error to NaN
df['eps'].astype('float')
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
Then you can apply other functions without getting errors :
df['eps'].round()
0 1.0
1 2.0
2 2.0
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
def convert_float(val):
try:
return float(val)
except ValueError:
return np.nan
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
>>> df.eps.apply(lambda x: convert_float(x))
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64

Categories

Resources