Pandas - Remove leading Zeros from String but not from Integers

Pandas - Remove leading Zeros from String but not from Integers - python

I currently have a column in my dataset that looks like the following:
Identifier
09325445
02242456
00MatBrown
0AntonioK
065824245
The column data type is object.
What I'd like to do is remove the leading zeros only from column rows where there is a string. I want to keep the leading zeros where the column rows are integers.
Result I'm looking to achieve:
Identifier
09325445
02242456
MatBrown
AntonioK
065824245
Code I am currently using (that isn't working)
def removeLeadingZeroFromString(row):
if df['Identifier'] == str:
return df['Identifier'].str.strip('0')
else:
return df['Identifier']
df['Identifier' ] = df.apply(lambda row: removeLeadingZeroFromString(row), axis=1)

One approach would be to try to convert Identifier to_numeric. Test where the converted values isna, using this mask to only str.lstrip (strip leading zeros only) where the values could not be converted:
m = pd.to_numeric(df['Identifier'], errors='coerce').isna()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
df:
Identifier
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Alternatively, a less robust approach, but one that will work with number only strings, would be to test where not str.isnumeric:
m = ~df['Identifier'].str.isnumeric()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
*NOTE This fails easily to_numeric is the much better approach if looking for all number types.
Sample Frame:
df = pd.DataFrame({
'Identifier': ['0932544.5', '02242456']
})
Sample Results with isnumeric:
Identifier
0 932544.5 # 0 Stripped
1 02242456
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'Identifier': ['09325445', '02242456', '00MatBrown', '0AntonioK',
'065824245']
})

Use replace with regex and a positive lookahead:
>>> df['Identifier'].str.replace(r'^0+(?=[a-zA-Z])', '', regex=True)
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Name: Identifier, dtype: object
Regex: replace one or more 0 (0+) at the start of the string (^) if there is a character ([a-zA-Z]) after 0s ((?=...)).

Related

Converting Set(float) into float/int

I have this df and trying to clean it. How to convert irs_pop,latitude,longitude and fips in real floats and ints?
The code below returns float() argument must be a string or a real number, not 'set'
mask['latitude'] = mask['latitude'].astype('float64')
mask['longitude'] = mask['irs_pop'].astype('float64')
mask['irs_pop'] = mask['irs_pop'].astype('int64')
mask['fips'] = mask['fips'].astype('int64')
Code below returns sequence item 0: expected str instance, float found
mask['fips'] = mask['fips'].apply(lambda x: ','.join(x))
mask = mask.astype({'fips' : 'int64'}) returns int() argument must be a string, a bytes-like object or a real number, not 'set'

So, you could do the following. Notice, you need to convert every element in the set to a str, so just use map and str:
mask['fips'] = mask['fips'].apply(lambda x: ','.join(map(str, x)))
This will store your floats as a comma delimited string. This would have to be parsed back into whatever format you want when reading it back.

Try this:
for col in ['irs_pop', 'latitude', 'longitude']:
mask[col] = mask[col].astype(str).str[1:-1].astype(int)
It looks like you have multiple FIPS in your FIPS column so you wont be able to convert to a single FIPS code. Most importantly, FIPS can have leading zeros so should be converted to strings.

You would need to convert to tuple/list and to slice with str:
df['col'] = df['col'].agg(tuple).str[0]
Example:
df = pd.DataFrame({'col': [{1},{2,3},{}]})
df['col2'] = df['col'].agg(tuple).str[0]
Output:
col col2
0 {1} 1.0
1 {2, 3} 2.0 # this doesn't seem to be the case in your data
2 {} NaN
If you want a string as output, with all values if multiple:
df['col'] = df['col'].astype(str).str[1:-1]
Output (as new column for clarity):
col col2
0 {1} 1
1 {2, 3} 2, 3
2 {}

It looks like you have sets with a single value in these columns. The problem may be upstream where these values were filled in the first place. But you could clean it up by applying a function that pops a value from the set and converts it to a float.
import pandas as pd
mask = pd.DataFrame({"latitude":[{40.81}, {40.81}],
"longitude":[{-73.04}, {-73.04}]})
print(mask)
columns = ["latitude", "longitude"]
for col in columns:
mask[col] = mask[col].apply(lambda s: float(s.pop()))
print(mask)
You could have pandas handle the for loop by doing a double apply
mask[columns] = mask[columns].apply(
lambda series: series.apply(lambda s: float(s.pop())))
print(mask)

How to standardize column in pandas

I have dataframe which contains id column with the following sample values
16620625 5686
16310427-5502
16501010 4957
16110430 8679
16990624/4174
16230404.1177
16820221/3388
I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.
here's my code
df['id']
df.replace(" ", "-")

Can use DataFrame.replace() function using a regular expression like this:
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
Here's example code with sample data.
import pandas as pd
df = pd.DataFrame({'id': [
'16620625 5686',
'16310427-5502',
'16501010 4957',
'16110430 8679',
'16990624/4174',
'16230404.1177',
'16820221/3388']})
# normalize matching strings with 8-digits + delimiter + 4-digits
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
print(df)
Output:
id
0 16620625-5686
1 16310427-5502
2 16501010-4957
3 16110430-8679
4 16990624-4174
5 16230404-1177
6 16820221-3388
If any value does not match the regexp of the expected format then it's value will not be changed.

inside a for loop:
convert your data frame entry to a string.
traverse this string up to 7th index.
concatenate '-' after 7th index to the string.
concatenate remaining string to the end.
traverse to next data frame entry.

Converting dtype: object to integer for a column that has numbers with spaces between them

I have the following dataframe that I am trying to remove the spaces between the numbers in the value column and then use pd.to_numeric to change the dtype. THe current dtype of value is an object.
periodFrom value
1 17.11.2020 28 621 240
2 18.11.2020 30 211 234
3 19.11.2020 33 065 243
4 20.11.2020 34 811 330
I have tried multiple variations of this but can't work it out:
df['value'] = df['value'].str.strip()
df['value'] = df['value'].str.replace(',', '').astype(int)
df['value'] = df['value'].astype(str).astype(int)

One option is to apply .str.split() first in order to split by whitespaces(even if the anyone of them has more than one character length), then concatenate (''.join()) while removing those whitespaces along with converting to integers(int()) such as
j=0
for i in df['value'].str.split():
df['value'][j]=int(''.join(i))
j+=1

You can do:
df['value'].replace({' ':''}, regex=True)
Or
df['value'].apply(lambda x: re.sub(' ', '', str(x)))
And add to both .astype(int).

finding and replacing strings with numbers only within a pandas dataframe

I am trying to replace the strings that contain numbers with another string (an empty one in this case) within a pandas DataFrame.
I tried with the .replace method and a regex expression:
# creating dummy dataframe
data = pd.DataFrame({'A': ['test' for _ in range(5)]})
# the value that should get replaced with ''
data.iloc[0] = 'test5'
data.replace(regex=r'\d', value='', inplace=True)
print(data)
A
0 test
1 test
2 test
3 test
4 test
As you can see, it only replace the '5' within the string and not the whole string.
I also tried using the .where method but it doesn't seem to fit my need as I don't want to replace any of the strings not containing numbers
this is what it should look like:
A
0
1 test
2 test
3 test
4 test

You can use Boolean indexing via pd.Series.str.contains with loc:
data.loc[data['A'].str.contains(r'\d'), 'A'] = ''
Similarly, with mask or np.where:
data['A'] = data['A'].mask(data['A'].str.contains(r'\d'), '')
data['A'] = np.where(data['A'].str.contains(r'\d'), '', data['A'])

How to round/remove trailing ".0" zeros in pandas column?

I'm trying to see if I can remove the trailing zeros from this phone number column.
Example:
0
1 8.00735e+09
2 4.35789e+09
3 6.10644e+09
The type in this column is an object, and I tried to round it but I am getting an error. I checked a couple of them I know they are in this format "8007354384.0", and want to get rid of the trailing zeros with the decimal point.
Sometimes I received in this format and sometimes I don't, they will be integer numbers. I would like to check if the phone column has a trailing zero, then remove it.
I have this code but I'm stuck on how to check for trailing zeros for each row.
data.ix[data.phone.str.contains('.0'), 'phone']
I get an error => *** ValueError: cannot index with vector containing NA / NaN values. I believe the issue is because some rows have empty data, which sometime I do receive. The code above should be able to skip an empty row.
Does anybody have any suggestions? I'm new to pandas but so far it's an useful library. Your help will be appreciated.
Note
The provided example above, the first row has an empty data, which I do sometimes I get. Just to make sure this is not represented as 0 for phone number.
Also empty data is considered a string, so it's a mix of floats and string, if rows are empty.

use astype(np.int64)
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
mask = pd.to_numeric(s).notnull()
s.loc[mask] = s.loc[mask].astype(np.int64)
s
0
1 8007350000
2 4357890000
3 6106440000
dtype: object

In Pandas/NumPy, integers are not allowed to take NaN values, and arrays/series (including dataframe columns) are homogeneous in their datatype --- so having a column of integers where some entries are None/np.nan is downright impossible.
EDIT:data.phone.astype('object')
should do the trick; in this case, Pandas treats your column as a series of generic Python objects, rather than a specific datatype (e.g. str/float/int), at the cost of performance if you intend to run any heavy computations with this data (probably not in your case).
Assuming you want to keep those NaN entries, your approach of converting to strings is a valid possibility:
data.phone.astype(str).str.split('.', expand = True)[0]
should give you what you're looking for (there are alternative string methods you can use, such as .replace or .extract, but .split seems the most straightforward in this case).
Alternatively, if you are only interested in the display of floats (unlikely I'd suppose), you can do pd.set_option('display.float_format','{:.0f}'.format), which doesn't actually affect your data.

This answer by cs95 removes trailing “.0” in one row.
df = df.round(decimals=0).astype(object)

import numpy as np
import pandas as pd
s = pd.Series([ None, np.nan, '',8.00735e+09, 4.35789e+09, 6.10644e+09])
s_new = s.fillna('').astype(str).str.replace(".0","",regex=False)
s_new
Here I filled null values with empty string, converted series to string type, replaced .0 with empty string.
This outputs:
0
1
2
3 8007350000
4 4357890000
5 6106440000
dtype: object

Just do
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
which uses a regex style lookup on all entries in the column and replaces any '.0' matches with blank space. For example
data = pd.DataFrame(
data = [['bob','39384954.0'],['Lina','23827484.0']],
columns = ['user','phone'], index = [1,2]
)
data['phone'] = data['phone'].astype(str)
data['phone'] = data['phone'].str.replace('.0', ' ')
print data
user phone
1 bob 39384954
2 Lina 23827484

So Pandas automatically assign data type by looking at type of data in the event when you have mix type of data like some rows are NaN and some has int value there is huge possibilities it would assign dtype: object or float64
EX 1:
import pandas as pd
data = [['tom', 10934000000], ['nick', 1534000000], ['juli', 1412000000]]
df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 10934000000
1 nick 1534000000
2 juli 1412000000
>>> df.dtypes
Name object
Phone int64
dtype: object
In above example pandas assume data type int64 reason being neither of row has NaN and all the rows in Phone column has integer value.
EX 2:
>>> data = [['tom'], ['nick', 1534000000], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom NaN
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df.dtypes
Name object
Phone float64
dtype: object
To answer to your actual question, to get rid of .0 at the end you can do something like this
Solution 1:
>>> data = [['tom', 9785000000.0], ['nick', 1534000000.0], ['juli', 1412000000]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Phone'])
>>> df
Name Phone
0 tom 9.785000e+09
1 nick 1.534000e+09
2 juli 1.412000e+09
>>> df['Phone'] = df['Phone'].astype(int).astype(str)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000
Solution 2:
>>> df['Phone'] = df['Phone'].astype(str).str.replace('.0', '', regex=False)
>>> df
Name Phone
0 tom 9785000000
1 nick 1534000000
2 juli 1412000000

Try str.isnumeric with astype and loc:
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
c = s.str.isnumeric().astype(bool)
s.loc[c] = s.loc[c].astype(np.int64)
print(s)
And now:
print(s)
Outputs:
0
1 8007350000
2 4357890000
3 6106440000
dtype: object

Here is a solution using pandas nullable integers (the solution assumes that input Series values are either empty strings or floating point numbers):
import pandas as pd, numpy as np
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
s.replace('', np.nan).astype('Int64')
Output (pandas-0.25.1):
0 NaN
1 8007350000
2 4357890000
3 6106440000
dtype: Int64
Advantages of the solution:
The output values are either integers or missing values (not 'object' data type)
Efficient

It depends on the data format the telephone number is stored.
If it is in an numeric format changing to an integer might solve the problem
df = pd.DataFrame({'TelephoneNumber': [123.0, 234]})
df['TelephoneNumber'] = df['TelephoneNumber'].astype('int32')
If it is really a string you can replace and re-assign the column.
df2 = pd.DataFrame({'TelephoneNumber': ['123.0', '234']})
df2['TelephoneNumber'] = df2['TelephoneNumber'].str.replace('.0', '')

import numpy as np
tt = 8.00735e+09
time = int(np.format_float_positional(tt)[:-1])

If somebody is still interesting:
I had the problem that I round the df and get the trailing zeros.
Here is what I did.
new_df = np.round(old_df,3).astype(str)
Then all trailing zeros were gone in the new_df.

I was also facing the same problem with empty rings in some rows.
The most helpful answer on this Python - Remove decimal and zero from string link helped me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Remove leading Zeros from String but not from Integers - python

Related

Converting Set(float) into float/int

How to standardize column in pandas

Converting dtype: object to integer for a column that has numbers with spaces between them

finding and replacing strings with numbers only within a pandas dataframe

How to round/remove trailing ".0" zeros in pandas column?

Categories

Resources