Remove "?" from pandas column - python

I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)

So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None

Related

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

Pandas str() method returns nan values - how to keep numeric values?

probably a trivial question: I have a pandas dataframe and a column with mixed dtypes. I would like to run various string methods on the column items, e.g. str.upper(), str.lower(), str.capitalize() etc. It works well for just string values in the column, however with numeric values (int/float) I get nan.
Example with str.upper():
output_table.iloc[:,0] = input_table.iloc[:,0].str.upper()
Justtext -> JUSTTEXT
Textwith500number -> TEXTWITH500NUMBER
500 -> nan
-11.6 -> nan
As the dataframe can become quite large (> 1m rows) I would like to have a fast routine to convert the input column by means of the respective string methods. How can I keep the numeric values untouched as they were (not returning nan) and only convert the string values? Something along the lines of pandas errors='ignore'.
Any help is much appreciated. Thank you!
You can use list comprehension:
df = pd.DataFrame({'desc': ['apple', "Textwidh500number", 500, -11.6]})
df["desc"] = [i.upper() if isinstance(i, str) else i for i in df["desc"]]
print (df)
desc
0 APPLE
1 TEXTWIDH500NUMBER
2 500
3 -11.6
I just did something similar with pd.to_numeric and passing errors='coerce' and .notnull(). Try this:
input_table.loc[(pd.to_numeric(input_table['Col_Name'], errors='coerce').notnull()),'Col_Name'].str.upper()

Can't Do Math on Column If Some Rows Are of Type String

Here is a sample of my df:
units price
0 143280.0 0.8567
1 4654.0 464.912
2 512210.0 607
3 Unknown 0
4 Unknown 0
I have the following code:
myDf.loc[(myDf["units"].str.isnumeric())&(myDf["price"].str.isnumeric()),'newValue']=(
myDf["price"].astype(float).fillna(0.0)*
myDf["units"].astype(float).fillna(0.0)/
1000)
As you can see, I'm trying to only do math to create the 'newValue' column for rows where the two source columns are both numeric. However, I get the following error:
ValueError: could not convert string to float: 'Unknown'
So it seems that even though I'm attempting to perform math only on the rows that don't have text, Pandas does not like that any of the rows have text.
Note that I need to maintain the instances of "Unknown" exactly as they are and so filling those with zero is not a good option.
This has be pretty stumped. Could not find any solutions by searching Google.
Would appreciate any help/solutions.
You can use the same condition you use on the left side of the = on the right side as follows (I set the condition in a variable is_num for readability):
is_num = (myDf["units"].astype(str).str.replace('.', '').str.isnumeric()) & (myDf["price"].astype(str).str.replace('.', '').str.isnumeric())
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float).fillna(0.0)*
myDf.loc[is_num, "units"].astype(float).fillna(0.0)/1000)
Also, you need to check with your read dataframe, but from this example, you can:
Remove the fillna(0.0), since there are no NaNs
Remove the checks on 'price' (as of your example, price is always numeric, so the check is not necessary)
Remove the astype(float) cast for price, since it's already numeric.
That would lead to the following somewhat more concise code:
is_num = myDf["units"].astype(str).str.replace('.', '').str.isnumeric()
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float)*
myDf.loc[is_num, "units"]/1000)

df.set_index returns key error python pandas dataframe

I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.
The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')

AttributeError: 'float' object has no attribute 'split'

I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.
You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")
You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna

Categories

Resources