I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?
A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]
I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']
Related
I have a dataframe of strings representing numbers (integers and floats).
I want to implement a validation to make sure the strings in certain columns only represent integers.
Here is a dataframe containing two columns, with header str as ints and str as double, representing integers and floats in string format.
# Import pandas library
import pandas as pd
# initialize list elements
data = ['10','20','30','40','50','60']
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['str as ints'])
df['str as double'] = ['10.0', '20.0', '30.0', '40.0', '50.0', '60.0']
Here is a function I wrote that checks for the radix in the string to determine whether it is an integer or float.
def includes_dot(s):
return '.' in s
I want to see if I can use the apply function on this dataframe, or do I need to write another function where I pass in the name of the dataframe and the list of column headers and then call includes_dot like this:
def check_df(df, lst):
for val in lst:
apply(df[val]...?)
# then print out the results if certain columns fail the check
Or if there are better ways to do this problem altogether.
The expected output is a list of column headers that fails the criteria: if I have a list ['str as ints', 'str as double'], then str as double should be printed because that column does not contain all integers.
for col in df:
if df[col].str.contains('\.').any():
print(col, "contains a '.'")
I want to strip and separate values from a column to another column in Pandas dataframe. Te current values are like
df['column']
14.535.00
14.535.00
14.535.00
I want to remove the 00 after second dot(.) and store them in another column
df['new_column'] as int values so that I could perform arithmetic operations
Edit 1: Seems like apply is always bad, seems like a more accepted solution is to use list comprehensions.
df['new_column'] = [str(x).split('.')[-1] for x in df.iloc[:,0]]
DON'T DO WHAT'S BELOW
I think this is a good instance for using apply. You might not need the str call.
What this is doing is taking the values in your column (aka a Series) and applying a function to them. The function takes each item, makes it a string, splits on the period, and grabs the last value. We then store the results of all this into a new column.
df['new_column'] = df['column'].iloc[:,0].apply(lambda x: str(x).split('.')[-1])
should result in something like what you want
I am trying to do a simple summation with column name Tangible Book Value and Earnings Per Share:
df['price_asset_EPS'] = (df["Tangible Book Value"]) + (df["Earnings Per Share"])
However, the result doesn't evaluate the numbers and also the plus is missing as below
0.66-0.36
1.440.0
What I have missed in between?
Looks like both columns are strings (not float):
0.66-0.36
1.440.0
see how '+' on those columns did string concatenation instead of addition? It concatenated "0.66" and "-0.36", then "1.44" and "0.0".
As to why those columns are strings not float, look at the dtype that pandas.read_csv gave them. There are many duplicate questions here telling you how to specify the right dtypes to read_csv.
Your columns are not being treated as numbers but strings. Try running df.dtypes. Against each column, you'll have its type. If you don't see a float or int, that means these columns have probably been read in as strings.
import pandas as pd
dff = pd.DataFrame([[1,'a'], [2, 'b']])
dff.dtypes
0 int64
1 object
Below I have created a dataframe with numbers inside quotes. Take a look at the dtypes.
dff = pd.DataFrame([['1','a'], ['2', 'b']])
dff.dtypes
0 object
1 object
Here you can see that numbers column is not marked int/float because of the quotes. Now, if I take the sum of the first column
dff.iloc[:,0].sum()
'12'
I get '12', which is the same case as yours. To convert these columns to numeric, look into pd.to_numeric
dff.iloc[:,0] = pd.to_numeric(dff.iloc[:,0], errors='ignore')
dff.iloc[:,0].sum()
3
I have a Pandas DataFrame with a string column called title and I want to convert each row's entry to that string's length. So "abcd" would be converted to 4, etc.
I'm doing this:
result_df['title'] = result_df['title'].str.len()
But unfortunately, I get this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
which seems to imply that I don't actually have strings in my column...
How should I go about this?
Thanks!
You're either trying to convert the whole column to str and not the values or have mixed types in the column. Try:
result_df['title'] = result_df['title'].apply(lambda x: len(str(x)))
If your column has strings and numeric data, you can first convert everything to strings and then get the length.
result_df['title'] = result_df['title'].astype(str).str.len()
To find the data that is not a string/unicode, try this:
result_df.loc[result_df['title'].apply(
lambda x: not isinstance(x, (str, unicode))), 'title']
I have a small problem: I have a column in my DataFrame, which has multiple rows, and in each row it holds either 1 or more values starting with 'M' letter followed by 3 digits. If there is more than 1 value, they are separated by a comma.
I would like to print out a view of the DataFrame, only featuring rows where that 1 column holds values I specify (e.g. I want them to hold any item from list ['M111', 'M222'].
I have started to build my boolean mask in the following way:
df[df['Column'].apply(lambda x: x.split(', ').isin(['M111', 'M222']))]
In my mind, .apply() with .split() methods in there first convert 'Column' values to lists in each row with 1 or more values in it, and then .isin() method confirms whether or not any of items in list of items in each row are in the list of specified values ['M111', 'M222'].
In practice however, instead of getting a desired view of DataFrame, I get error
'TypeError: unhashable type: 'list'
What am I doing wrong?
Kind regards,
Greem
I think you need:
df2 = df[df['Column'].str.contains('|'.join(['M111', 'M222']))]
You can only access the isin() method with a Pandas object. But split() returns a list. Wrapping split() in a Series will work:
# sample data
data = {'Column':['M111, M000','M333, M444']}
df = pd.DataFrame(data)
print(df)
Column
0 M111, M000
1 M333, M444
Now wrap split() in a Series.
Note that isin() will return a list of boolean values, one for each element coming out of split(). You want to know "whether or not any of item in list...are in the list of specified values", so add any() to your apply function.
df[df['Column'].apply(lambda x: pd.Series(x.split(', ')).isin(['M111', 'M222']).any())]
Output:
Column
0 M111, M000
As others have pointed out, there are simpler ways to go about achieving your end goal. But this is how to resolve the specific issue you're encountering with isin().