Pandas convert mixed types to string - python

Given the following dataframe:
DF = pd.DataFrame({'COL1': ['A', 'B', 'C', 'D','D','D'],
'mixed': [2016.0, 2017.0, 'sweatervest', 20, 209, 21]})
DF
COL1 mixed
0 A 2016.0
1 B 2017.0
2 C sweatervest
3 D 20
4 D 209
5 D 21
I want to convert 'mixed' to an object such that all numbers are integers as strings and all strings remain, of course, strings.
The desired output is as follows:
COL1 mixed
0 A 2016
1 B 2017
2 C sweatervest
3 D 20
4 D 209
5 D 21
Background info:
Originally, 'mixed' was part of a data frame taken from a CSV that mainly consisted of numbers, with some strings here and there. When I tried converting it to string, some numbers ended up with '.0' at the end.

try:
DF['mixed']=DF.mixed.astype(object)
this results in:
DF['mixed']
0 2016
1 2017
2 sweatervest
3 20
4 209
5 21
Name: mixed, dtype: object

df.mixed = df.mixed.apply(lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt))
This calls the lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt) function over each element of the 'mixed' column.
Note: This assumes that all of your floats are convertible to integers, as you implied in your comments on your question.

This approach builds upon the answer by gbrener. It iterates over a dateframe looking for mixed dtype columns. For each such mixed column, it first replaces all nan values with pd.NA. It then safely converts its values to strings. It isusable in-place as unmix_dtypes(df). It was tested with Pandas 1 under Python 3.8.
Note that this answer uses assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
from typing import Union
import pandas as pd
def _to_str(val: Union[type(pd.NA), float, int, str]) -> Union[type(pd.NA), str]:
"""Return a string representation of the given integer, rounded float, or otherwise a string.
`pd.NA` values are returned as is.
It can be useful to call `df[col].fillna(value=pd.NA, inplace=True)` before calling this function.
"""
if val is pd.NA:
return val
if isinstance(val, float) and (val % 1 == 0.0):
return str(int(val))
if isinstance(val, int):
return str(val)
assert isinstance(val, str)
return val
def unmix_dtypes(df: pd.DataFrame) -> None:
"""Convert mixed dtype columns in the given dataframe to strings.
Ref: https://stackoverflow.com/a/61826020/
"""
for col in df.columns:
if not (orig_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
continue
df[col].fillna(value=pd.NA, inplace=True)
df[col] = df[col].apply(_to_str)
if (new_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
raise TypeError(f"Unable to convert {col} to a non-mixed dtype. Its previous dtype was {orig_dtype} and new dtype is {new_dtype}.")
Caution: One of the dangers of not specifying an explicit dtype, however, is that a column such as ["012", "0034", "4"] can be read by pd.read_csv as an integer column, thereby irrecoverably losing the leading zeros. What's worse is that if dataframes are concatenated, such a loss of the leading zeros can happen inconsistently, leading to column values such as ["012", "12", "34", "0034"].

Related

pandas: convert column with multiple datatypes to int, ignore errors

I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.

Why does pandas remove leading zero when writing to a csv?

I have a dataframe that has a column called 'CBG' with numbers as a string value.
CBG acs_total_persons acs_total_housing_units
0 010010211001 1925 1013
1 010030114011 2668 1303
2 010070100043 930 532
When I write it to a csv file, the leading 'O' are removed:
combine_acs_merge.to_csv(new_out_csv, sep=',')
>>> CBG: [0: 10010221101, ...]
It's already a string; how can I keep the leading zero from being removed in the .csv file
Lets take an example:
Below is your example DataFrame:
>>> df
col1 num
0 One 011
1 two 0123
2 three 0122
3 four 0333
Considering the num as an int which you can convert to str().
>>> df["num"] = df["num"].astype(str)
>>> df.to_csv("datasheet.csv")
Output:
$ cat datasheet.csv
You will find the leading zeros are intacted..
,col1,num
0,One,011
1,two,0123
2,three,0122
3,four,0333
OR, if you reading the data from csv first then use belwo..
pd.read_csv('test.csv', dtype=str)
However, if your column CBG already str then it should be straight forward..
>>> df = pd.DataFrame({'CBG': ["010010211001", "010030114011", "010070100043"],
... 'acs_total_persons': [1925, 2668, 930],
... 'acs_total_housing_units': [1013, 1303, 532]})
>>>
>>> df
CBG acs_total_housing_units acs_total_persons
0 010010211001 1013 1925
1 010030114011 1303 2668
2 010070100043 532 930
>>> df.to_csv("CBG.csv")
result:
$ cat CBG.csv
,CBG,acs_total_housing_units,acs_total_persons
0,010010211001,1013,1925
1,010030114011,1303,2668
2,010070100043,532,930
Pandas doesn't strip padded zeros. You're liking seeing this when opening in Excel. Open the csv in a text editor like notepad++ and you'll see they're still zero padded.
When reading a CSV file pandas tries to convert values in every column to some data type as it sees fit. If it sees a column which contains only digits it will set the dtype of this column to int64. This converts "010010211001" to 10010211001.
If you don't want any data type conversions to happen specify dtype=str when reading in the CSV file.
As per pandas documentation for read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object
together with suitable na_values settings to preserve and not interpret dtype. If
converters are specified, they will be applied INSTEAD of dtype conversion.

How to cast column in Pandas with multiple datatypes?

I've got an ID column with mixed datatypes, which are causing me issues when I pivot. I have some IDs as float type, so when I try and cast them to ints, then to strings. If I cast the column as a whole, the strings subset throw an error, since it is illogical to cast a string to an int.
I also know that mutating a datatype whilst iterating over a column is a bad idea. Has anyone got any ideas?
Here's a visual representation:
ID
Str
Int
Float
Trying to cast them all to strings. Also, want the '.0' ending of the floats to not be there. Any ideas?
Assuming you have a column that consists of integers, floats, and strings, which are all read in as strings from a file, you'll have something like this:
s = pd.Series(['10', '20', '30.4', '40.7', 'text', 'more text', '50.0'])
in which case, you can apply a function to convert the floats to integers, then a second function to convert the integers (back) to strings:
import pandas as pd
def print_type(x):
print(type(x))
return x
def to_int(x):
try:
# x is a float or an integer, and will be returned as an integer
return int(pd.to_numeric(x))
except ValueError:
# x is a string
return x
def to_str(x):
return str(x)
s = pd.Series(['10', '20', '30.4', '40.7', 'text', 'more text', '50.0'])
s2 = s.apply(to_int).apply(to_str)
print("Series s:")
print(s)
print("\nSeries s2:")
print(s2)
print("\nData types of series s2:")
print(s2.apply(print_type))
Here is the output, showing that, in the end, each number has been converted to a string version of an integer:
Series s:
0 10
1 20
2 30.4
3 40.7
4 text
5 more text
6 50.0
dtype: object
Series s2:
0 10
1 20
2 30
3 40
4 text
5 more text
6 50
dtype: object
Data types of series s2:
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
0 10
1 20
2 30
3 40
4 text
5 more text
6 50
dtype: object
Not sure if that's what you're after, but if not, hopefully it will give you an idea of how to get started. This is using Pandas 0.19.2:
In [1]: import pandas as pd
In [2]: print(pd.__version__)
0.19.2

print index and value if value is str in a numeric data type column pandas dataframe

I am new to data science and currently I'm exploring a bit further. I have over 600,000 columns of a data set and I'm currently cleaning and checking it for inconsistency or outliers. I came across a problem which I am not sure how to solve it. I have some solutions in mind but I am not sure how to do it with pandas.
I have converted the data types of some columns from object to int. I got no errors and checked whether it's in int and it was. I checked the values of one column to check for the factual data. This involves age and I got an error saying my column has a string. so I checked it using this method:
print('if there is string in numeric column',np.any([isinstance(val, str) for val in homicide_df['Perpetrator Age']])
Now, I wanted to print all indices and with their values and type only on this column which has the string data type.
currently I came up with this solution that works fine:
def check_type(homicide_df):
for age in homicide_df['Perpetrator Age']:
if type(age) is str:
print(age, type(age))
check_type(homicide_df)
Here are some of the questions I have:
is there a pandas way to do the same thing?
how should I convert these elements to int?
why were some elements on the columns did not convert to int?
I would appreciate any help. Thank you very much
You can use iteritems:
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
print (homicide_df)
Perpetrator Age
0 10
1 15
2 aa
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
check_type(homicide_df)
1 15 <class 'str'>
2 aa <class 'str'>
If values are mixed - numeric with non numeric, better is check :
def check_type(homicide_df):
return homicide_df.loc[homicide_df['Perpetrator Age'].apply(type)==str,'Perpetrator Age']
print (check_type(homicide_df))
1 15
2 aa
Name: Perpetrator Age, dtype: object
If all values are numeric, but all types are str:
print ((homicide_df['Perpetrator Age'].apply(type)==str).all())
True
homicide_df = pd.DataFrame({'Perpetrator Age':['10', '15']})
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
print (homicide_df['Perpetrator Age'].dtypes)
int32
But if some numeric with strings:
Solution for convert to int with to_numeric which replace non numeric values to NaN. then is necessary replace NaN to some numeric like 0 and last cast to int:
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
homicide_df['Perpetrator Age']=pd.to_numeric(homicide_df['Perpetrator Age'], errors='coerce')
print (homicide_df)
Perpetrator Age
0 10.0
1 15.0
2 NaN
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].fillna(0).astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
2 0

Will na_values in the read_fwf()/read_csv()/read_table() be converted after converter functions are performed?

I want to read a dataframe from a fixed width flat file. This is a somewhat performance sensitive operation.
I would like all blank whitespace to be stripped from column value. After that whitespace is stripped, I want blank strings to be converted to NaN or None values. Here are the two ideas I had:
pd.read_fwf(path, colspecs=markers, names=columns,
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = null_convert
return convert_dict
def null_convert(value):
value = value.strip()
if value == "":
return None
else:
return value
or:
pd.read_fwf(path, colspecs=markers, names=columns, na_values='',
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = col_strip
return convert_dict
def col_strip(value):
return value.strip()
The second option depends on the converter (which strips whitespace) be evaluated before na_values.
I was wondering if the second one would work. The reason I am curious is because it seems better to retain NaN has the Null value opposed to None.
I am also open to any other suggestions for how I might perform this operation (stripping whitespace and then converting blank strings to NaN).
I do not have access to a computer with pandas installed at the moment, which is why I cannot test this myself.
In case of fixed width file, no need to do anything special to strip white space, or handle missing fields. Below a small example of a fixed width file, three columns each of width 5. There is trailing and leading white space + missing data.
In [57]: data = """\
A B C
0 foo
3 bar 2.0
1 3.0
"""
In [58]: df = pandas.read_fwf(StringIO(data), widths=[5, 5, 5])
In [59]: df
Out[59]:
A B C
0 0 foo NaN
1 3 bar 2
2 1 NaN 3
In [60]: df.dtypes
Out[60]:
A int64
B object
C float64

Categories

Resources