pandas string noise string canceling - python

Say I have a list of drinks:
drinks=['coke','water','milk','yoghourt']
And I have a pandas series containing some of the items mixed with other noisy strings
s = pd.Series(['cokeabc',Nan,Nan,'water coke',Nan,'milk and yoghourt','only water'])
My purpose is filter out the noise first, fill in the missing value based on other column, and then get_dummies of the s column
My try was:
buff=[]
for i in material:
if df['drink'].str.contains(i):
buff.append(i)
kvkl['drink']=' '.join(buff)
but df['drink'].str.contains(i) returns the whole column of bools
should I try apply()?

You may easily make your code work by just adding .any() at the end of the code:
buff=[]
for i in material:
if df['drink'].str.contains(i).any():
buff.append(i)
kvkl['drink']=' '.join(buff)
This will check if any cell got True and deliver expected result.

OK, I figure it out
def drink_format(mtr):
drinks=['coke','water','milk','yoghourt']
buff=[]
for i in drinks:
if i in mtr:
buff.append(i)
return ' '.join(buff)
s=s.map(drink_format)

Related

Remove "?" from pandas column

I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)
So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None

How to delete an outlier from a np.where condition

I have this dataframe that has an outlier, which I recognized through a boxplot. Then, I caught the value of it through np.where but the thing is, I don't know how to delete this value and its whole row from my dataframe so that I can get rid of the outlier.
This is the code I used for it so far:
sns.boxplot(x=df_cor_inc['rt'].astype(float))
outlier = np.where(df_cor_inc['rt'].astype(float)>50000)
Any help would be great. Thanks.
No need for np.where, a simple boolean mask will do the trick:
df_cor_inc = df_cor_inc[df_cor_inc['rt'] <= 50000]]
Also, why are you casting df_cor_inc['rt'] as float? Is it not already numeric?
If you want to reset the indices of your dataframe, tack on a .reset_index(drop=True).
Try this:
df_cor_inc[np.where(df_cor_inc['rt'].astype(float)>50000,False,True)]

Drop row in Pandas dataframe if zero bordered by numbers (Python)

Due to some foibles in the API I'm using, sometimes a 'Zero' is returned when it should return a number; which works its way through to a Pandas dataframe that my script outputs (Python).
What would be a Pythonic way to drop a row if a zero is bordered both above and below by non-zero numbers? I can think of extensive loops to solve this, but that'd be quite an intensive way of going about this.
Note that elsewhere in the dataframe there'll be continuous rows of zeros, which are valid, so it's not simply a case of dropping all rows with zeros in them; I only want to drop rows with zero if they're bordered by rows with valid non-zero numbers.
Assuming col is the column you want to filter on, and it's type is str (drop " if it's float):
df = df.loc[~ (df["col"].shift(-1).ne("0.0") & df["col"].eq("0.0") & df["col"].shift(1).ne("0.0"))]

Filtering a Pandas DataFrame Using a Function

This question is related to the question I posted yesterday, which can be found here.
So, I went ahead and implemented the solution provided by Jan to the entire data set. The solution is as follows:
import re
def is_probably_english(row, threshold=0.90):
regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
ascii = [character for character in row['App'] if regular_expression.search(character)]
quotient = len(ascii) / len(row['App'])
passed = True if quotient >= threshold else False
return passed
google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)
google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]
So, from what I understand, we are filtering the google_play_store_no_duplicates DataFrame using the is_probably_english function and storing the result, which is a boolean, into another DataFrame (google_play_store_is_probably_english). The google_play_store_is_probably_english is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in a new DataFrame.
Does this make sense and does it seem like a sound way to approach the problem? Is there a better way to do this?
This makes sense, I think this is the best way to do it, the result of the function is a boolean as you said and then when you apply it in a pd.Series you end up with a pd.Series of booleans, which is usually called a boolean mask. This concept can be very useful in pandas when you want to filter rows by some parameters.
Here is an article about boolean masks in pandas.

Can't Do Math on Column If Some Rows Are of Type String

Here is a sample of my df:
units price
0 143280.0 0.8567
1 4654.0 464.912
2 512210.0 607
3 Unknown 0
4 Unknown 0
I have the following code:
myDf.loc[(myDf["units"].str.isnumeric())&(myDf["price"].str.isnumeric()),'newValue']=(
myDf["price"].astype(float).fillna(0.0)*
myDf["units"].astype(float).fillna(0.0)/
1000)
As you can see, I'm trying to only do math to create the 'newValue' column for rows where the two source columns are both numeric. However, I get the following error:
ValueError: could not convert string to float: 'Unknown'
So it seems that even though I'm attempting to perform math only on the rows that don't have text, Pandas does not like that any of the rows have text.
Note that I need to maintain the instances of "Unknown" exactly as they are and so filling those with zero is not a good option.
This has be pretty stumped. Could not find any solutions by searching Google.
Would appreciate any help/solutions.
You can use the same condition you use on the left side of the = on the right side as follows (I set the condition in a variable is_num for readability):
is_num = (myDf["units"].astype(str).str.replace('.', '').str.isnumeric()) & (myDf["price"].astype(str).str.replace('.', '').str.isnumeric())
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float).fillna(0.0)*
myDf.loc[is_num, "units"].astype(float).fillna(0.0)/1000)
Also, you need to check with your read dataframe, but from this example, you can:
Remove the fillna(0.0), since there are no NaNs
Remove the checks on 'price' (as of your example, price is always numeric, so the check is not necessary)
Remove the astype(float) cast for price, since it's already numeric.
That would lead to the following somewhat more concise code:
is_num = myDf["units"].astype(str).str.replace('.', '').str.isnumeric()
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float)*
myDf.loc[is_num, "units"]/1000)

Categories

Resources