Verifying the data format in columns for a pandas dataframe

Verifying the data format in columns for a pandas dataframe - python

I have a dataframe of strings representing numbers (integers and floats).
I want to implement a validation to make sure the strings in certain columns only represent integers.
Here is a dataframe containing two columns, with header str as ints and str as double, representing integers and floats in string format.
# Import pandas library
import pandas as pd
# initialize list elements
data = ['10','20','30','40','50','60']
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['str as ints'])
df['str as double'] = ['10.0', '20.0', '30.0', '40.0', '50.0', '60.0']
Here is a function I wrote that checks for the radix in the string to determine whether it is an integer or float.
def includes_dot(s):
return '.' in s
I want to see if I can use the apply function on this dataframe, or do I need to write another function where I pass in the name of the dataframe and the list of column headers and then call includes_dot like this:
def check_df(df, lst):
for val in lst:
apply(df[val]...?)
# then print out the results if certain columns fail the check
Or if there are better ways to do this problem altogether.
The expected output is a list of column headers that fails the criteria: if I have a list ['str as ints', 'str as double'], then str as double should be printed because that column does not contain all integers.

for col in df:
if df[col].str.contains('\.').any():
print(col, "contains a '.'")

Related

Changing dataframe column dtypes in Pandas

I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?

A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]

I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??

If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Why did pandas give "0.66-0.36" when I tried to add two columns?

I am trying to do a simple summation with column name Tangible Book Value and Earnings Per Share:
df['price_asset_EPS'] = (df["Tangible Book Value"]) + (df["Earnings Per Share"])
However, the result doesn't evaluate the numbers and also the plus is missing as below
0.66-0.36
1.440.0
What I have missed in between?

Looks like both columns are strings (not float):
0.66-0.36
1.440.0
see how '+' on those columns did string concatenation instead of addition? It concatenated "0.66" and "-0.36", then "1.44" and "0.0".
As to why those columns are strings not float, look at the dtype that pandas.read_csv gave them. There are many duplicate questions here telling you how to specify the right dtypes to read_csv.

Your columns are not being treated as numbers but strings. Try running df.dtypes. Against each column, you'll have its type. If you don't see a float or int, that means these columns have probably been read in as strings.
import pandas as pd
dff = pd.DataFrame([[1,'a'], [2, 'b']])
dff.dtypes
0 int64
1 object
Below I have created a dataframe with numbers inside quotes. Take a look at the dtypes.
dff = pd.DataFrame([['1','a'], ['2', 'b']])
dff.dtypes
0 object
1 object
Here you can see that numbers column is not marked int/float because of the quotes. Now, if I take the sum of the first column
dff.iloc[:,0].sum()
'12'
I get '12', which is the same case as yours. To convert these columns to numeric, look into pd.to_numeric
dff.iloc[:,0] = pd.to_numeric(dff.iloc[:,0], errors='ignore')
dff.iloc[:,0].sum()
3

python-How to change all values in a column from integers to strings

I have 10,000 rows in my dataset,With the last column containg 10 unique strings(i.e each string is repeated 1000 times).How can I convert these 10 strings to integers so that I can use them in neural network.

I guess that you are using pandas. For example, you have useful columns list as the following,
modelFeatures = [contains all the columns' names you use]
df = df[modelFeatures].astype(str)
print(df)
By using astype , you can convert any dataframe into int, float or str.

Pandas Dataframe column with both Strings and Floats

I have a dataframe where one of the columns holds strings and floats.
The column is named 'Value' has values like "AAA", "Korea, Republic of", "123,456.78" and "5000.00".
The first two values are obviously strings, and the last is obviously a float. The third value should be a float as well, but due to the commas, the next step of my code sees it as a string.
Is there an easy way for me to remove the commas for those values that are really floats but keep them for values that are really strings? So "Korea, Republic of" stays, but "123,456,78" converts to "123456.78".
Thanks.

To begin with, your Pandas column does not contain strings and floats, since columns contain homogeneous types. If one entry is a string, then all of them are. You can verify this by doing something like (assuming the DataFrame is df and the column is c):
>>> df.dtypes
and noticing that the type should be something like Object.
Having said that, you can convert the string column to a different string column, where the strings representing numbers, have the commas removed. This might be useful for further operations, e.g., when you wish to see which entries can be converted to floats. This can be done as follows.
First, write a function like:
import re
def remove_commas_from_numbers(n):
r = re.compile(r'^(\d+(?:,\d+)?.+)*$')
m = r.match(n)
if not m:
return n
return n.replace(',', '')
remove_commas_from_numbers('1,1.')
Then, you can do something like:
>>> df.c = df.c.apply(remove_commas_from_numbers)
Again, it's important to note that df.c's type will be string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Verifying the data format in columns for a pandas dataframe - python

for col in df: if df[col].str.contains('\.').any(): print(col, "contains a '.'")

Related

Changing dataframe column dtypes in Pandas

Converting strings to ints in a DataFrame

Why did pandas give "0.66-0.36" when I tried to add two columns?

python-How to change all values in a column from integers to strings

Pandas Dataframe column with both Strings and Floats

Categories

Resources