AttributeError: 'float' object has no attribute 'split' - python

I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.

You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")

You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna

Related

Remove "?" from pandas column

I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)
So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

Drop row in Pandas dataframe if zero bordered by numbers (Python)

Due to some foibles in the API I'm using, sometimes a 'Zero' is returned when it should return a number; which works its way through to a Pandas dataframe that my script outputs (Python).
What would be a Pythonic way to drop a row if a zero is bordered both above and below by non-zero numbers? I can think of extensive loops to solve this, but that'd be quite an intensive way of going about this.
Note that elsewhere in the dataframe there'll be continuous rows of zeros, which are valid, so it's not simply a case of dropping all rows with zeros in them; I only want to drop rows with zero if they're bordered by rows with valid non-zero numbers.
Assuming col is the column you want to filter on, and it's type is str (drop " if it's float):
df = df.loc[~ (df["col"].shift(-1).ne("0.0") & df["col"].eq("0.0") & df["col"].shift(1).ne("0.0"))]

df.set_index returns key error python pandas dataframe

I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.
The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources