I have this df and trying to clean it. How to convert irs_pop,latitude,longitude and fips in real floats and ints?
The code below returns float() argument must be a string or a real number, not 'set'
mask['latitude'] = mask['latitude'].astype('float64')
mask['longitude'] = mask['irs_pop'].astype('float64')
mask['irs_pop'] = mask['irs_pop'].astype('int64')
mask['fips'] = mask['fips'].astype('int64')
Code below returns sequence item 0: expected str instance, float found
mask['fips'] = mask['fips'].apply(lambda x: ','.join(x))
mask = mask.astype({'fips' : 'int64'}) returns int() argument must be a string, a bytes-like object or a real number, not 'set'
So, you could do the following. Notice, you need to convert every element in the set to a str, so just use map and str:
mask['fips'] = mask['fips'].apply(lambda x: ','.join(map(str, x)))
This will store your floats as a comma delimited string. This would have to be parsed back into whatever format you want when reading it back.
Try this:
for col in ['irs_pop', 'latitude', 'longitude']:
mask[col] = mask[col].astype(str).str[1:-1].astype(int)
It looks like you have multiple FIPS in your FIPS column so you wont be able to convert to a single FIPS code. Most importantly, FIPS can have leading zeros so should be converted to strings.
You would need to convert to tuple/list and to slice with str:
df['col'] = df['col'].agg(tuple).str[0]
Example:
df = pd.DataFrame({'col': [{1},{2,3},{}]})
df['col2'] = df['col'].agg(tuple).str[0]
Output:
col col2
0 {1} 1.0
1 {2, 3} 2.0 # this doesn't seem to be the case in your data
2 {} NaN
If you want a string as output, with all values if multiple:
df['col'] = df['col'].astype(str).str[1:-1]
Output (as new column for clarity):
col col2
0 {1} 1
1 {2, 3} 2, 3
2 {}
It looks like you have sets with a single value in these columns. The problem may be upstream where these values were filled in the first place. But you could clean it up by applying a function that pops a value from the set and converts it to a float.
import pandas as pd
mask = pd.DataFrame({"latitude":[{40.81}, {40.81}],
"longitude":[{-73.04}, {-73.04}]})
print(mask)
columns = ["latitude", "longitude"]
for col in columns:
mask[col] = mask[col].apply(lambda s: float(s.pop()))
print(mask)
You could have pandas handle the for loop by doing a double apply
mask[columns] = mask[columns].apply(
lambda series: series.apply(lambda s: float(s.pop())))
print(mask)
Related
In my DataFrame, the "Value_String" column consists of strings that are either:
number-like strings starting with dollar sign, and thousands are separated by a comma, [e.g. $1,000]
"None"
Therefore, I tried to create a new column and convert the string to float with the following lambda function:
to_replace = '$,'
df['Value_Float'] = df[df['Value_String'].apply(lambda x: 0 if x == 'None'
else float(x.replace(y, '')) for y in to_replace)]
This actually generates a "TypeError: 'generator' object is not callable".
How can I solve this?
The numpy where method is very helpful for conditionally updating values. In this case where the value is not 'None' we will use the replace function. However since str.replace uses regex by default, we need to change the pattern to a literal dollar sign OR a comma
import pandas as pd
import numpy as np
df = pd.DataFrame({'Value_String':["$1,000","None"]})
df['Value_String'] = np.where(df['Value_String']!='None', df['Value_String'].str.replace('\$|,',''), df['Value_String'])
print(df)
Output
Value_String
0 1000
1 None
I currently have a column in my dataset that looks like the following:
Identifier
09325445
02242456
00MatBrown
0AntonioK
065824245
The column data type is object.
What I'd like to do is remove the leading zeros only from column rows where there is a string. I want to keep the leading zeros where the column rows are integers.
Result I'm looking to achieve:
Identifier
09325445
02242456
MatBrown
AntonioK
065824245
Code I am currently using (that isn't working)
def removeLeadingZeroFromString(row):
if df['Identifier'] == str:
return df['Identifier'].str.strip('0')
else:
return df['Identifier']
df['Identifier' ] = df.apply(lambda row: removeLeadingZeroFromString(row), axis=1)
One approach would be to try to convert Identifier to_numeric. Test where the converted values isna, using this mask to only str.lstrip (strip leading zeros only) where the values could not be converted:
m = pd.to_numeric(df['Identifier'], errors='coerce').isna()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
df:
Identifier
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Alternatively, a less robust approach, but one that will work with number only strings, would be to test where not str.isnumeric:
m = ~df['Identifier'].str.isnumeric()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
*NOTE This fails easily to_numeric is the much better approach if looking for all number types.
Sample Frame:
df = pd.DataFrame({
'Identifier': ['0932544.5', '02242456']
})
Sample Results with isnumeric:
Identifier
0 932544.5 # 0 Stripped
1 02242456
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'Identifier': ['09325445', '02242456', '00MatBrown', '0AntonioK',
'065824245']
})
Use replace with regex and a positive lookahead:
>>> df['Identifier'].str.replace(r'^0+(?=[a-zA-Z])', '', regex=True)
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Name: Identifier, dtype: object
Regex: replace one or more 0 (0+) at the start of the string (^) if there is a character ([a-zA-Z]) after 0s ((?=...)).
This question already has answers here:
convert entire pandas dataframe to integers in pandas (0.17.0)
(4 answers)
Closed 3 years ago.
So I have a dataframe about NBA stats from the last season which I am using to learn pandas and matplotlib but all numbers (Points per game, salaries, PER etc.) are strings. I noticed it when I tried to sum them and they just concatenated. So I used this :
df['Salary'] = df['Salary'].astype(float)
to change the values but there is many more columns that I have to do the same thing for and I know that I should do it manually. First thing that comes to mind is some kind of regex but I am not familiar with it so I am seeking for help. Thanks in advance!
In Pandas, DataFrame objects make a list of all columns contained in the frame available via the columns attribute. This attribute is iterable, which means you can use this as the iterable object of a for-in loop. This allows you to easily run through and apply an operation to all columns:
for col in df.columns:
df[col] = df[col].astype('float', errors='ignore')
Documentation page for Pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
Another way to do this if you know the columns in advance is to specify the dtype when you import the dataframe.
df = pd.read_csv("file.tsv", sep='\t', dtype={'a': np.float. 'b': str, 'c': np.float}
A second method could be to use a conversion dictionary:
conversion_dict = {'a': np.float, 'c': np.float}
df = df.astype(conversion_dict)
A third method if your column would be an object would be to use the infer_object() method from pandas. Using this method you dont have to specify all the columns yourself.
df = df.infer_objects()
good luck
I think you can use select_dtypes
The strategy is to find the columns with types object, which usually are string. You can check it out by using df.info().
so :
df.select_dtypes(include = ['object']).astype(float)
would do the trick
If you want to keep a trace of this :
str_cols = df.select_dtypes(include = ['object'].columns
mapping = {col_name:col_type for col_name, col_type in zip(str_cols, [float]*len(str_cols))}
df[str_cols] = df[str_cols].astype(mapping)
I like this approach because you can create a dictionary of the types you want your columns to be in.
If you know the names of the columns you can use a for loop to apply the same transformation to each column. This is useful if you don't want to convert entire data frame but only the numeric columns etc. Hope that helps 👍
cols = ['points','salary','wins']
for i in cols:
df[i] = df[i].astype(float)
I think what OP is asking is how he can convert each column to it's appropriate type (int, float, or str) without having to manually inspect each column and then explicitly convert it.
I think something like the below should work for you. Keep in mind that this is pretty exhaustive and checks each value for the entire column. You can always the second for loop to maybe only look at the first 100 columns to make a decision on what type to use for that column.
import pandas as pd
import numpy as np
# Example dataframe full of strings
df = pd.DataFrame.from_dict({'name':['Lebron James','Kevin Durant'],'points':['38',' '],'steals':['2.5',''],'position':['Every Position','SG'],'turnovers':['0','7']})
def convertTypes(df):
for col in df:
is_an_int = True
is_a_float = True
if(df[col].dtype == np.float64 or df[col].dtype == np.int64):
# If the column's type is already a float or int, skip it
pass
else:
# Iterate through each value in the column
for value in df[col].iteritems():
if value[1].isspace() == True or value[1] == '':
continue
# If the string's isnumeric method returns false, it's not an int
if value[1].isnumeric() == False:
is_an_int = False
# if the string is made up of two numerics split by a '.', it's a float
if isinstance(value[1],str):
if len(value[1].split('.')) == 2:
if value[1].split('.')[0].isnumeric() and value[1].split('.')[1].isnumeric():
is_a_float = True
else:
is_a_float = False
else:
is_a_float = False
else:
is_a_float = False
if is_a_float == True:
# If every value's a float, convert the whole column
# Replace blanks and whitespaces with np.nan
df[col] = df[col].replace(r'^\s*$', np.nan, regex=True).astype(float)
elif is_an_int == True:
# If every value's an int, convert the whole column
# Replace blanks and whitespaces with 0
df[col] = df[col].replace(r'^\s*$', 0, regex=True).astype(int)
convertTypes(df)
I have a csv file with two formatted columns that currently read in as objects:
contains percentage values which read in as strings like '0.01%'. The % is always at the end.
contains currency values which read in as string like '$1234.5'.
I have tried using the split function to remove the % or $ inside the dataframe, then using float on the result of the split. This will print the correct result but will not assign the value. It also gives a type error that float does not have split function, even though I do the split before the float????
Try this:
import pandas as pd
df = pd.read_csv('data.csv')
"""
The example df looks like this:
col1 col2
0 3.04% $100.25
1 0.15% $1250
2 0.22% $322
3 1.30% $956
4 0.49% $621
"""
df['col1'] = df['col1'].str.split('%', expand=True)[[0]]
df['col2'] = df['col2'].str.split('$', 1, expand=True)[[1]]
df[['col1', 'col2']] = df[['col1', 'col2']].apply(pd.to_numeric)
You are probably looking for the apply method.
With
df['first_col'] = df['first_col'].apply(lambda x: float(x.strip('%'))
I did searched online posts but what I found were all how to only round float columns in a mixed dataframe, but my problem is how to round float values in a string type column.
Say my dataframe like this:
pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
Looking for an output like
pd.DataFrame({'a':[1.1,2.2, 'aaaa'], 'b':['bbbb', 2.2,3.3], 'c':[3.3,'cccc', 4.4]})
----Above is a straight question------
----Reason why I do so is below----
I have 3 csv files, each has string header and float value, with different row and column number.
I need to append the 3 in one dataframe horizontally then expoert as a new csv, separate by a empty row.
My 3 dataframe like this:
One:
Two:
Three:
to
Pls note that the output dataframe contains headers from the 3 sub dataframe
So, when I import them, first csv of course imported by pd.read_csv, no issue.
Then I used .append(pd.Series([np.NaN])) to create an empty row as separator row
Then second csv loaded then I used pd.append(), but if I don't include 'header=None' in 'read_csv()' then the second one will not be mapped horizontally under first one, coz the csv files have uneven rows and columns.
So two options,
Include 'header=None' in 'read_csv()', then I can't simply use round() as
df = df.round()
does not work, need to find a way to round only numeric values in each column
Also note that when include 'header=None',
All column types are 'object', by df.types
Not include 'header=None' in 'read_csv()', then I could round each dataframe, but having trouble to combine them horizontally with their headers.
Any suggestion?
csv example
import pandas as pd
import io
exp = io.StringIO("""
month;abc;cba;fef;sefe;yjy;gtht
100;0.45384534;0.43455;0.56385;0.5353;0.523453;0.53553
200;0.453453;0.453453;0.645396;0.76786;0.36327;0.453659
""")
df = pd.read_csv(exp, sep=";", header=None)
print(df.dtypes)
df = df.applymap(lambda x: round(x, 1)
if isinstance(x, (int, float)) else x)
print(df)
There is a simple way to loop over every single element in a dataframe using applymap. Combined with isinstance, which test for a specific type, you can get the following.
df = pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
df.dtypes
a object
b object
c object
dtype: object
df2 = df.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)
Obtaining the following dataframe:
a b c
0 1.1 bbbb 3.3
1 2.2 2.2 cccc
2 aaaa 3.3 4.4
With the following dtypes unchanged
df2.dtypes
a object
b object
c object
dtype: object
As for your other example in your question, I noticed that even the numbers are saved as strings. I noticed a method converting strings to floats pd.to_numeric for a Series.
From your exp, I get the following:
df = pd.read_csv(exp, sep=";", header=None)
df2 = df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1)
df3 = df2.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)