I'm trying to see if I can remove the trailing zeros from this phone number column.
Example:
0
1 8.00735e+09
2 4.35789e+09
3 6.10644e+09
The type in this column is an object, and I tried to round it but I am getting an error. I checked a couple of them I know they are in this format "8007354384.0", and want to get rid of the trailing zeros with the decimal point.
Sometimes I received in this format and sometimes I don't, they will be integer numbers. I would like to check if the phone column has a trailing zero, then remove it.
I have this code but I'm stuck on how to check for trailing zeros for each row.
data.ix[data.phone.str.contains('.0'), 'phone']
I get an error => *** ValueError: cannot index with vector containing NA / NaN values. I believe the issue is because some rows have empty data, which sometime I do receive. The code above should be able to skip an empty row.
Does anybody have any suggestions? I'm new to pandas but so far it's an useful library. Your help will be appreciated.
Note
The provided example above, the first row has an empty data, which I do sometimes I get. Just to make sure this is not represented as 0 for phone number.
Also empty data is considered a string, so it's a mix of floats and string, if rows are empty.
use astype(np.int64)
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
mask = pd.to_numeric(s).notnull()
s.loc[mask] = s.loc[mask].astype(np.int64)
s
0
1 8007350000
2 4357890000
3 6106440000
dtype: object
In Pandas/NumPy, integers are not allowed to take NaN values, and arrays/series (including dataframe columns) are homogeneous in their datatype --- so having a column of integers where some entries are None/np.nan is downright impossible.
EDIT:data.phone.astype('object')
should do the trick; in this case, Pandas treats your column as a series of generic Python objects, rather than a specific datatype (e.g. str/float/int), at the cost of performance if you intend to run any heavy computations with this data (probably not in your case).
Assuming you want to keep those NaN entries, your approach of converting to strings is a valid possibility:
data.phone.astype(str).str.split('.', expand = True)[0]
should give you what you're looking for (there are alternative string methods you can use, such as .replace or .extract, but .split seems the most straightforward in this case).
Alternatively, if you are only interested in the display of floats (unlikely I'd suppose), you can do pd.set_option('display.float_format','{:.0f}'.format), which doesn't actually affect your data.
This answer by cs95 removes trailing “.0” in one row.
df = df.round(decimals=0).astype(object)
import numpy as np
import pandas as pd
s = pd.Series([ None, np.nan, '',8.00735e+09, 4.35789e+09, 6.10644e+09])
s_new = s.fillna('').astype(str).str.replace(".0","",regex=False)
s_new
Here I filled null values with empty string, converted series to string type, replaced .0 with empty string.
This outputs:
0
1
2
3 8007350000
4 4357890000
5 6106440000
dtype: object
Just do
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
which uses a regex style lookup on all entries in the column and replaces any '.0' matches with blank space. For example
data = pd.DataFrame(
data = [['bob','39384954.0'],['Lina','23827484.0']],
columns = ['user','phone'], index = [1,2]
)
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
print data
user phone
1 bob 39384954
2 Lina 23827484
So Pandas automatically assign data type by looking at type of data in the event when you have mix type of data like some rows are NaN and some has int value there is huge possibilities it would assign dtype: object or float64
EX 1:
import pandas as pd
data = [['tom', 10934000000], ['nick', 1534000000], ['juli', 1412000000]]
df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 10934000000
1 nick 1534000000
2 juli 1412000000
>>> df.dtypes
Name object
Phone int64
dtype: object
In above example pandas assume data type int64 reason being neither of row has NaN and all the rows in Phone column has integer value.
EX 2:
>>> data = [['tom'], ['nick', 1534000000], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom NaN
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df.dtypes
Name object
Phone float64
dtype: object
To answer to your actual question, to get rid of .0 at the end you can do something like this
Solution 1:
>>> data = [['tom', 9785000000.0], ['nick', 1534000000.0], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 9.785000e+09
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df['Phone'] = df['Phone'].astype(int).astype(str)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000
Solution 2:
>>> df['Phone'] = df['Phone'].astype(str).str.replace('.0', '', regex=False)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000
Try str.isnumeric with astype and loc:
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
c = s.str.isnumeric().astype(bool)
s.loc[c] = s.loc[c].astype(np.int64)
print(s)
And now:
print(s)
Outputs:
0
1 8007350000
2 4357890000
3 6106440000
dtype: object
Here is a solution using pandas nullable integers (the solution assumes that input Series values are either empty strings or floating point numbers):
import pandas as pd, numpy as np
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
s.replace('', np.nan).astype('Int64')
Output (pandas-0.25.1):
0 NaN
1 8007350000
2 4357890000
3 6106440000
dtype: Int64
Advantages of the solution:
The output values are either integers or missing values (not 'object' data type)
Efficient
It depends on the data format the telephone number is stored.
If it is in an numeric format changing to an integer might solve the problem
df = pd.DataFrame({'TelephoneNumber': [123.0, 234]})
df['TelephoneNumber'] = df['TelephoneNumber'].astype('int32')
If it is really a string you can replace and re-assign the column.
df2 = pd.DataFrame({'TelephoneNumber': ['123.0', '234']})
df2['TelephoneNumber'] = df2['TelephoneNumber'].str.replace('.0', '')
import numpy as np
tt = 8.00735e+09
time = int(np.format_float_positional(tt)[:-1])
If somebody is still interesting:
I had the problem that I round the df and get the trailing zeros.
Here is what I did.
new_df = np.round(old_df,3).astype(str)
Then all trailing zeros were gone in the new_df.
I was also facing the same problem with empty rings in some rows.
The most helpful answer on this Python - Remove decimal and zero from string link helped me.
Related
I have an example df:
df = pd.DataFrame({'A': ['100,100', '200,200'],
'B': ['200,100,100', '100']})
A B
0 100,100 200,100,100
1 200,200 100
and I want to replace the commas ',' with nothing (basically, remove them). You can probably guess a real-world application, as many data is written with thousand separators, feel free to introduce me to a better method.
Now I read the documentation for pd.replace() here and I tried several versions of code - it raises no error, but does not modify my data frame.
df = df.replace(',','')
df = df.replace({',': ''})
df = df.replace([','],'')
df = df.replace([','],[''])
I can get it working when specifying the column names and using the ".str.replace()" method for Series, but imagine having 20 columns. I also can get this working specifying columns in the df.replace() method but there must be a more convenient way for such an easy task. I could write a custom function, but pandas is such an amazing library it must be something I am missing.
This works:
df['A'] = df['A'].str.replace(',','')
Thank you!
df.replace has a parameter regex set it to True for partial matches.
By default regex param is False. When False it replaces only exact or fullmatches.
From Pandas docs:
str: string exactly matching to_replace will be replaced with the value.
df.replace(',', '', regex=True)
A B
0 100100 200100100
1 200200 100
In pd.Series.str.replace by default it's regex param is True.
From docs:
Equivalent to str.replace() or re.sub(), depending on the regex value.
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
If False, treats the pattern as a literal string
Though your immediate question has probably been answered, I wanted to mention that if you are reading this data in from a csv file, you can pass the thousands argument with a comma "," to indicate that this should be treated as an integer and remove the comma:
import io
import pandas as pd
csv_file = io.StringIO("""
A,B,C
"1,000","2,000","3,000"
1,2,3
"50,000",50,5
""")
df = pd.read_csv(csv_file, thousands=",")
print(df)
A B C
0 1000 2000 3000
1 1 2 3
2 50000 50 5
print(df.dtypes)
A int64
B int64
C int64
dtype: object
I have a short script to pivot data. The first column is a 9 digit ID number, often beginning with zeros such as 000123456
Here is the script:
df = pd.read_csv('source')
new_df = df.pivot_table(index = 'id', columns = df.groupby('id').cumcount().add(1), values = ['prog_id', 'prog_type'], aggfunc='first').sort_index(axis=1,level=1)
new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
new_df.to_csv('destination')
print(new_df)
Although the CSV is being read with an id such as 000123456, the output only contains 123456
Even when setting an explicit dtype, Pandas removes the leading zeros. Is there a work around for telling Pandas to leave the leading zeros?
Per comment on original post, set dtype as string:
df = pd.read_csv('source', dtype={'id':np.str})
You could use pandas' zfill() method right after reading your csv file "source". Basically, you would fill the values of your attribute "id", with as many zeros as you would like, in this particular case, making the number 9 digits long (3 zeros + 6 original digits). So, we would have:
df = pd.read_csv('source')
df.index = df.index.str.zfill(9)
# (...)
I would like to replace some values in my dataframe that were entered in the wrong format. For example, 850/07-498745 should be 07-498745. Now, I used string split successfully to do so. However, it turns all previously correctly formatted strings into NaNs. I tried to base it on a condition, but still I have the same problem. How can I fix it ?
Example Input:
mylist = ['850/07-498745', '850/07-148465', '07-499015']
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
My Attempt:
df['mycolumn'] = df[df.mycolumn.str.contains('/') == True].mycolumn.str.split('/', 1).str[1]
df
Output:
What I wanted:
You can use split with / and grab the last returning string from the list:
df['mycolumn'].str.split('/').str[-1]
0 07-498745
1 07-148465
2 07-499015
Name: mycolumn, dtype: object
This would also work, and may help you understand why your original attempt did not:
mask = df.mycolumn.str.contains('/')
df.mycolumn.loc[mask] = df.mycolumn[mask].str.split('/', 1).str[1]
You were doing df['mycolumn'] = ..., which I believe is just replacing the entire Series for that column with the new one you formed.
For a regex solution:
df.mycolumn.str.extract('(?:.*/)?(.*)$')[0]
Output:
0 07-498745
1 07-148465
2 07-499015
Name: 0, dtype: object
I have tried to use
if df.loc[df['col_1']] == float:
print(df.loc[df['col_1']])
But that doesn't work. I basically just want to find everything of the datatype float in a column and see what it is and where. How do I go about doing that?
I need to do this because the column is an object according df.dtypes but in trying to do string operations on it, I am getting a TypeError that there are floats.
So I assuming you have column type is object , usually pandas only have one data type per columns
df.col_1.map(type)==float# will return bool
Use a Boolean mask to perform operations only on strings. This assumes that your series only consist of numeric and string types.
df = pd.DataFrame({'A': [1, 2, 'hello', 'test', 5, 'another']})
num_mask = pd.to_numeric(df['A'], errors='coerce').isnull()
df.loc[num_mask, 'A'] += ' checked!'
print(df)
A
0 1
1 2
2 hello checked!
3 test checked!
4 5
5 another checked!
Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)