Pandas: Find and print all floats in column - python

I have tried to use
if df.loc[df['col_1']] == float:
print(df.loc[df['col_1']])
But that doesn't work. I basically just want to find everything of the datatype float in a column and see what it is and where. How do I go about doing that?
I need to do this because the column is an object according df.dtypes but in trying to do string operations on it, I am getting a TypeError that there are floats.

So I assuming you have column type is object , usually pandas only have one data type per columns
df.col_1.map(type)==float# will return bool

Use a Boolean mask to perform operations only on strings. This assumes that your series only consist of numeric and string types.
df = pd.DataFrame({'A': [1, 2, 'hello', 'test', 5, 'another']})
num_mask = pd.to_numeric(df['A'], errors='coerce').isnull()
df.loc[num_mask, 'A'] += ' checked!'
print(df)
A
0 1
1 2
2 hello checked!
3 test checked!
4 5
5 another checked!

Related

How do I apply the string split method on a pandas dataframe based on a condition?

I would like to replace some values in my dataframe that were entered in the wrong format. For example, 850/07-498745 should be 07-498745. Now, I used string split successfully to do so. However, it turns all previously correctly formatted strings into NaNs. I tried to base it on a condition, but still I have the same problem. How can I fix it ?
Example Input:
mylist = ['850/07-498745', '850/07-148465', '07-499015']
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
My Attempt:
df['mycolumn'] = df[df.mycolumn.str.contains('/') == True].mycolumn.str.split('/', 1).str[1]
df
Output:
What I wanted:
You can use split with / and grab the last returning string from the list:
df['mycolumn'].str.split('/').str[-1]
0 07-498745
1 07-148465
2 07-499015
Name: mycolumn, dtype: object
This would also work, and may help you understand why your original attempt did not:
mask = df.mycolumn.str.contains('/')
df.mycolumn.loc[mask] = df.mycolumn[mask].str.split('/', 1).str[1]
You were doing df['mycolumn'] = ..., which I believe is just replacing the entire Series for that column with the new one you formed.
For a regex solution:
df.mycolumn.str.extract('(?:.*/)?(.*)$')[0]
Output:
0 07-498745
1 07-148465
2 07-499015
Name: 0, dtype: object

Pandas: how to identify columns with dtype object but mixed-type items?

In a pandas dataframe, a column with dtype = object can, in fact, contain items of mixed types, eg integers and strings.
In this example, column a is dtype object, but the first item is string while all the others are int:
import numpy as np, pandas as pd
df=pd.DataFrame()
df['a']=np.arange(0,9)
df.iloc[0,0]='test'
print(df.dtypes)
print(type(df.iloc[0,0]))
print(type(df.iloc[1,0]))
My question is: is there a quick way to identify which columns with dtype=object contain, in fact, mixed types like above? Since pandas does not have a dtype = str, this is not immediately apparent.
However, I have had situations where, importing a large csv file into pandas, I would get a warning like:
sys:1: DtypeWarning: Columns (15,16) have mixed types. Specify dtype option on import or set low_memory=False
Is there an easy way to replicate that and explicitly list the columns with mixed types? Or do I manually have to go through them one by one, see if I can convert them to string, etc?
The background is that I am trying to export a dataframe to a Microsoft SQL Server using DataFrame.to_sql and SQLAlchemy. I get an
OverflowError: int too big to convert
but my dataframe does not contain columns with dtype int - only object and float64. I'm guessing this is because one of the object columns must have both strings and integers.
Setup
df = pd.DataFrame(np.ones((3, 3)), columns=list('WXY')).assign(Z='c')
df.iloc[0, 0] = 'a'
df.iloc[1, 2] = 'b'
df
W X Y Z
0 a 1.0 1 c
1 1 1.0 b c
2 1 1.0 1 c
Solution
Find all types and count how many unique ones per column.
df.loc[:, df.applymap(type).nunique().gt(1)]
W Y
0 a 1
1 1 b
2 1 1

How to round/remove trailing ".0" zeros in pandas column?

I'm trying to see if I can remove the trailing zeros from this phone number column.
Example:
0
1 8.00735e+09
2 4.35789e+09
3 6.10644e+09
The type in this column is an object, and I tried to round it but I am getting an error. I checked a couple of them I know they are in this format "8007354384.0", and want to get rid of the trailing zeros with the decimal point.
Sometimes I received in this format and sometimes I don't, they will be integer numbers. I would like to check if the phone column has a trailing zero, then remove it.
I have this code but I'm stuck on how to check for trailing zeros for each row.
data.ix[data.phone.str.contains('.0'), 'phone']
I get an error => *** ValueError: cannot index with vector containing NA / NaN values. I believe the issue is because some rows have empty data, which sometime I do receive. The code above should be able to skip an empty row.
Does anybody have any suggestions? I'm new to pandas but so far it's an useful library. Your help will be appreciated.
Note
The provided example above, the first row has an empty data, which I do sometimes I get. Just to make sure this is not represented as 0 for phone number.
Also empty data is considered a string, so it's a mix of floats and string, if rows are empty.
use astype(np.int64)
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
mask = pd.to_numeric(s).notnull()
s.loc[mask] = s.loc[mask].astype(np.int64)
s
0
1 8007350000
2 4357890000
3 6106440000
dtype: object
In Pandas/NumPy, integers are not allowed to take NaN values, and arrays/series (including dataframe columns) are homogeneous in their datatype --- so having a column of integers where some entries are None/np.nan is downright impossible.
EDIT:data.phone.astype('object')
should do the trick; in this case, Pandas treats your column as a series of generic Python objects, rather than a specific datatype (e.g. str/float/int), at the cost of performance if you intend to run any heavy computations with this data (probably not in your case).
Assuming you want to keep those NaN entries, your approach of converting to strings is a valid possibility:
data.phone.astype(str).str.split('.', expand = True)[0]
should give you what you're looking for (there are alternative string methods you can use, such as .replace or .extract, but .split seems the most straightforward in this case).
Alternatively, if you are only interested in the display of floats (unlikely I'd suppose), you can do pd.set_option('display.float_format','{:.0f}'.format), which doesn't actually affect your data.
This answer by cs95 removes trailing “.0” in one row.
df = df.round(decimals=0).astype(object)
import numpy as np
import pandas as pd
s = pd.Series([ None, np.nan, '',8.00735e+09, 4.35789e+09, 6.10644e+09])
s_new = s.fillna('').astype(str).str.replace(".0","",regex=False)
s_new
Here I filled null values with empty string, converted series to string type, replaced .0 with empty string.
This outputs:
0
1
2
3 8007350000
4 4357890000
5 6106440000
dtype: object
Just do
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
which uses a regex style lookup on all entries in the column and replaces any '.0' matches with blank space. For example
data = pd.DataFrame(
data = [['bob','39384954.0'],['Lina','23827484.0']],
columns = ['user','phone'], index = [1,2]
)
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
print data
user phone
1 bob 39384954
2 Lina 23827484
So Pandas automatically assign data type by looking at type of data in the event when you have mix type of data like some rows are NaN and some has int value there is huge possibilities it would assign dtype: object or float64
EX 1:
import pandas as pd
data = [['tom', 10934000000], ['nick', 1534000000], ['juli', 1412000000]]
df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 10934000000
1 nick 1534000000
2 juli 1412000000
>>> df.dtypes
Name object
Phone int64
dtype: object
In above example pandas assume data type int64 reason being neither of row has NaN and all the rows in Phone column has integer value.
EX 2:
>>> data = [['tom'], ['nick', 1534000000], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom NaN
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df.dtypes
Name object
Phone float64
dtype: object
To answer to your actual question, to get rid of .0 at the end you can do something like this
Solution 1:
>>> data = [['tom', 9785000000.0], ['nick', 1534000000.0], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 9.785000e+09
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df['Phone'] = df['Phone'].astype(int).astype(str)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000
Solution 2:
>>> df['Phone'] = df['Phone'].astype(str).str.replace('.0', '', regex=False)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000
Try str.isnumeric with astype and loc:
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
c = s.str.isnumeric().astype(bool)
s.loc[c] = s.loc[c].astype(np.int64)
print(s)
And now:
print(s)
Outputs:
0
1 8007350000
2 4357890000
3 6106440000
dtype: object
Here is a solution using pandas nullable integers (the solution assumes that input Series values are either empty strings or floating point numbers):
import pandas as pd, numpy as np
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
s.replace('', np.nan).astype('Int64')
Output (pandas-0.25.1):
0 NaN
1 8007350000
2 4357890000
3 6106440000
dtype: Int64
Advantages of the solution:
The output values are either integers or missing values (not 'object' data type)
Efficient
It depends on the data format the telephone number is stored.
If it is in an numeric format changing to an integer might solve the problem
df = pd.DataFrame({'TelephoneNumber': [123.0, 234]})
df['TelephoneNumber'] = df['TelephoneNumber'].astype('int32')
If it is really a string you can replace and re-assign the column.
df2 = pd.DataFrame({'TelephoneNumber': ['123.0', '234']})
df2['TelephoneNumber'] = df2['TelephoneNumber'].str.replace('.0', '')
import numpy as np
tt = 8.00735e+09
time = int(np.format_float_positional(tt)[:-1])
If somebody is still interesting:
I had the problem that I round the df and get the trailing zeros.
Here is what I did.
new_df = np.round(old_df,3).astype(str)
Then all trailing zeros were gone in the new_df.
I was also facing the same problem with empty rings in some rows.
The most helpful answer on this Python - Remove decimal and zero from string link helped me.

Select row from a DataFrame based on the type of the object(i.e. str)

So there's a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str.
For example I want to select the row where type of data in the column A is a str.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn't works!
Thanks please help!
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
You can do something similar to what you're asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can't ask which rows are of what type - they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
It's generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object, which is nothing more than a sequence of pointers. Much like list and, indeed, many operations on such series can be more efficiently processed with list.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]

Convert all elements in float Series to integer

I have a column, having float values,in a dataframe (so I am calling this column as Float series). I want to convert all the values to integer or just round it up so that there are no decimals.
Let us say the dataframe is df and the column is a, I tried this :
df['a'] = round(df['a'])
I got an error saying this method can't be applied to a Series, only applicable to individual values.
Next I tried this :
for obj in df['a']:
obj =int(round(obj))
After this I printed df but there was no change.
Where am I going wrong?
round won't work as it's being called on a pandas Series which is array-like rather than a scalar value, there is the built in method pd.Series.round to operate on the whole Series array after which you can change the dtype using astype:
In [43]:
df = pd.DataFrame({'a':np.random.randn(5)})
df['a'] = df['a'] * 100
df
Out[43]:
a
0 -4.489462
1 -133.556951
2 -136.397189
3 -106.993288
4 -89.820355
In [45]:
df['a'] = df['a'].round(0).astype(int)
df
Out[45]:
a
0 -4
1 -134
2 -136
3 -107
4 -90
Also it's unnecessary to iterate over the rows when there are vectorised methods available
Also this:
for obj in df['a']:
obj =int(round(obj))
Does not mutate the individual cell in the Series, it's operating on a copy of the value which is why the df is not mutated.
The code in your loop:
obj = int(round(obj))
Only changes which object the name obj refers to. It does not modify the data stored in the series. If you want to do this you need to know where in the series the data is stored and update it there.
E.g.
for i, num in enumerate(df['a']):
df['a'].iloc[i] = int(round(obj))
When converting a float to an integer, I found out using df.dtypes that the column I was trying to round off was an object not a float. The round command won't work on objects so to do the conversion I did:
df['a'] = pd.to_numeric(df['a'])
df['a'] = df['a'].round(0).astype(int)
or as one line:
df['a'] = pd.to_numeric(df['a']).round(0).astype(int)
If you specifically want to round up as your question states, you can use np.ceil:
import numpy as np
df['a'] = np.ceil(df['a'])
See also Floor or ceiling of a pandas series in python?
Not sure there's much advantage to type converting to int; pandas and numpy love floats.

Categories

Resources