Pandas Dataframe column with both Strings and Floats

Pandas Dataframe column with both Strings and Floats - python

I have a dataframe where one of the columns holds strings and floats.
The column is named 'Value' has values like "AAA", "Korea, Republic of", "123,456.78" and "5000.00".
The first two values are obviously strings, and the last is obviously a float. The third value should be a float as well, but due to the commas, the next step of my code sees it as a string.
Is there an easy way for me to remove the commas for those values that are really floats but keep them for values that are really strings? So "Korea, Republic of" stays, but "123,456,78" converts to "123456.78".
Thanks.

To begin with, your Pandas column does not contain strings and floats, since columns contain homogeneous types. If one entry is a string, then all of them are. You can verify this by doing something like (assuming the DataFrame is df and the column is c):
>>> df.dtypes
and noticing that the type should be something like Object.
Having said that, you can convert the string column to a different string column, where the strings representing numbers, have the commas removed. This might be useful for further operations, e.g., when you wish to see which entries can be converted to floats. This can be done as follows.
First, write a function like:
import re
def remove_commas_from_numbers(n):
r = re.compile(r'^(\d+(?:,\d+)?.+)*$')
m = r.match(n)
if not m:
return n
return n.replace(',', '')
remove_commas_from_numbers('1,1.')
Then, you can do something like:
>>> df.c = df.c.apply(remove_commas_from_numbers)
Again, it's important to note that df.c's type will be string.

Related

Incorrectly Reading a Column Containing Lists in Pandas

I have a pandas data frame containing a column with a list that I am reading from a CSV. For example, the column in the CSV appears like so:
ColName2007
=============
['org1', 'org2']
['org2', 'org3']
...
So, when I read this column into Pandas, each entry of the columns is treated as a string, rather than a list of strings.
df['ColName2007'][0] returns "['org1', 'org2']". Notice this is being stored as a string, not a list of strings.
I want to be able to perform list operations on this data. What is a good way to quickly and efficiently convert this column of strings into a column of lists that contain strings?

I would use a strip/split :
df['ColName2007']= df['ColName2007'].str.strip("[]").str.split(",")
Otherwise, you can apply an ast.literal_eval as suggested by #Bjay Regmi in the comments.
import ast
df["ColName2007"] = df["ColName2007"].apply(ast.literal_eval)

Verifying the data format in columns for a pandas dataframe

I have a dataframe of strings representing numbers (integers and floats).
I want to implement a validation to make sure the strings in certain columns only represent integers.
Here is a dataframe containing two columns, with header str as ints and str as double, representing integers and floats in string format.
# Import pandas library
import pandas as pd
# initialize list elements
data = ['10','20','30','40','50','60']
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['str as ints'])
df['str as double'] = ['10.0', '20.0', '30.0', '40.0', '50.0', '60.0']
Here is a function I wrote that checks for the radix in the string to determine whether it is an integer or float.
def includes_dot(s):
return '.' in s
I want to see if I can use the apply function on this dataframe, or do I need to write another function where I pass in the name of the dataframe and the list of column headers and then call includes_dot like this:
def check_df(df, lst):
for val in lst:
apply(df[val]...?)
# then print out the results if certain columns fail the check
Or if there are better ways to do this problem altogether.
The expected output is a list of column headers that fails the criteria: if I have a list ['str as ints', 'str as double'], then str as double should be printed because that column does not contain all integers.

for col in df:
if df[col].str.contains('\.').any():
print(col, "contains a '.'")

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??

If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

How to Strip and Separate values from column in Pandas Dataframe to int values

I want to strip and separate values from a column to another column in Pandas dataframe. Te current values are like
df['column']
14.535.00
14.535.00
14.535.00
I want to remove the 00 after second dot(.) and store them in another column
df['new_column'] as int values so that I could perform arithmetic operations

Edit 1: Seems like apply is always bad, seems like a more accepted solution is to use list comprehensions.
df['new_column'] = [str(x).split('.')[-1] for x in df.iloc[:,0]]
DON'T DO WHAT'S BELOW
I think this is a good instance for using apply. You might not need the str call.
What this is doing is taking the values in your column (aka a Series) and applying a function to them. The function takes each item, makes it a string, splits on the period, and grabs the last value. We then store the results of all this into a new column.
df['new_column'] = df['column'].iloc[:,0].apply(lambda x: str(x).split('.')[-1])
should result in something like what you want

Why did pandas give "0.66-0.36" when I tried to add two columns?

I am trying to do a simple summation with column name Tangible Book Value and Earnings Per Share:
df['price_asset_EPS'] = (df["Tangible Book Value"]) + (df["Earnings Per Share"])
However, the result doesn't evaluate the numbers and also the plus is missing as below
0.66-0.36
1.440.0
What I have missed in between?

Looks like both columns are strings (not float):
0.66-0.36
1.440.0
see how '+' on those columns did string concatenation instead of addition? It concatenated "0.66" and "-0.36", then "1.44" and "0.0".
As to why those columns are strings not float, look at the dtype that pandas.read_csv gave them. There are many duplicate questions here telling you how to specify the right dtypes to read_csv.

Your columns are not being treated as numbers but strings. Try running df.dtypes. Against each column, you'll have its type. If you don't see a float or int, that means these columns have probably been read in as strings.
import pandas as pd
dff = pd.DataFrame([[1,'a'], [2, 'b']])
dff.dtypes
0 int64
1 object
Below I have created a dataframe with numbers inside quotes. Take a look at the dtypes.
dff = pd.DataFrame([['1','a'], ['2', 'b']])
dff.dtypes
0 object
1 object
Here you can see that numbers column is not marked int/float because of the quotes. Now, if I take the sum of the first column
dff.iloc[:,0].sum()
'12'
I get '12', which is the same case as yours. To convert these columns to numeric, look into pd.to_numeric
dff.iloc[:,0] = pd.to_numeric(dff.iloc[:,0], errors='ignore')
dff.iloc[:,0].sum()
3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe column with both Strings and Floats - python

Related

Incorrectly Reading a Column Containing Lists in Pandas

Verifying the data format in columns for a pandas dataframe

Converting strings to ints in a DataFrame

How to Strip and Separate values from column in Pandas Dataframe to int values

Why did pandas give "0.66-0.36" when I tried to add two columns?

Categories

Resources