I want to drop rows in my dataset using:
totes = df3.loc[(df3['Reporting Date'] != '18/08/2017') & (df3['Business Line'] != 'Bondy')]
However it is not what I expect; I know that the number of rows I want ot drop is 496 after using:
totes = df3.loc[(df3['Reporting Date'] == '18/08/2017') & (df3['Business Line'] == 'Bondy')]
When I run my drop function, it is giving back much less rows than my dataset minus 496.
Does anyone know how to fix this?
You are correct to use &, but it is being misused. This is a logic problem. Note:
(NOT X) AND (NOT Y) != NOT(X AND Y)
Instead, you can calculate the negative of a Boolean condition via the ~ operator:
totes = df3.loc[~((df3['Reporting Date'] == '18/08/2017') & (df3['Business Line'] == 'Bondy'))]
Those parentheses and masks can get confusing, so you can write this more clearly:
m1 = df3['Reporting Date'].eq('18/08/2017')
m2 = df3['Business Line'].eq('Bondy')
totes = df3.loc[~(m1 & m2)]
Alternatively, note that:
NOT(X & Y) == NOT(X) | NOT(Y)
So you can use:
m1 = df3['Reporting Date'].ne('18/08/2017')
m2 = df3['Business Line'].ne('Bondy')
totes = df3.loc[m1 | m2]
Related
df["load_weight"] = df.loc[(df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")].fillna(1000, inplace=True)
i want to change the NaN value in "load_weight" column, but only for the rows that contain "HORNSBY BEND" and "BRUSH", but above code gave me "none" to the whole "load_weight" column, what did i do wrong?
I would use a mask for boolean indexing:
m = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
df.loc[m, "load_weight"] = df.loc[m, 'load_weight'].fillna(1000)
NB. you can't keep inplace=True when you assign the output. This is what was causing your data to be replaced with None as methods called with inplace=True return nothing.
Alternative with only boolean indexing:
m1 = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
m2 = df['load_weight'].isna()
df.loc[m1&m2, "load_weight"] = 1000
Instead of fillna, you can directly use df.loc to do the required imputation
df.loc[((df['dropoff_site']=='HORNSBY BEND')&(df['load_type']=='BRUSH')
&(df['load_weight'].isnull())),'load_weight'] = 1000
I would like to print out the rows from Excel where either the data exists or does not under a specific column. Whenever I run the code, I get this:
`Series([], dtype: int64)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning:
Automatic reindexing on DataFrame vs Series comparisons is deprecated and will
raise ValueError in a future version. Do `left, right = left.align(right,
axis=1, copy=False)` before e.g. `left == right`
My snippet is:
'at5 = input("Erkély igen?: ")
if at5 == 'igen':
erkely = tables2[~tables2['balcony'].isnull()]
else:
erkely = tables2[~tables2['balcony'].notnull()]
#bt = tables2[(tables2['lakas_tipus'] ==at1) & (tables2['nm2'] >= at2) &
(tables2['nm2'] < at3 ) & (tables2['room'] == at4 ) & (tables2['balcony'] == erkely
)]'
Any idea how to approach this problem? I'm not getting the output I want.
I am trying to drop rows at specific minutes ( 05,10, 20 )
I have datetime as an index
df5['Year'] = df5.index.year
df5['Month'] = df5.index.month
df5['Day']= df5.index.day
df5['Day_of_Week']= df5.index.day_name()
df5['hour']= df5.index.strftime('%H')
df5['Min']= df5.index.strftime('%M')
df5
Then I run below
def clean(df5):
for i in range(len(df5)):
hour = pd.Timestamp(df5.index[i]).hour
minute = pd.Timestamp(df5.index[i]).minute
if df5 = df5[(df5.index.minute ==5) | (df5.index.minute == 10)| (df5.index.minute == 20)]
df.drop(axis=1, index=i, inplace=True)
it returnes invalid syntax error.
Here looping is not necessary, also not recommended.
Use DatetimeIndex.minute with Index.isin and inverted mask by ~ filtering in boolean indexing:
df5 = df5[~df5.index.minute.isin([5, 10, 20])]
For reuse column df5['Min'] use strings values:
df5 = df5[~df5['Min'].isin(['05', '10', '20'])]
All together:
def clean(df5):
return df5[~df5.index.minute.isin([5, 10, 20])]
You can just do it using boolean indexing, assuming that the index is already parsed as datetime.
df5 = df5[~((df5.index.minute == 5) | (df5.index.minute == 10) | (df5.index.minute == 20))]
Or the opposite of the same answer:
df5 = df5[(df5.index.minute != 5) | (df5.index.minute != 10) | (df5.index.minute != 20)]
Generally speaking, the right synthax to combine a logic OR inside an IF statement is the following:
today = 'Saturday'
if today=='Sunday' OR today=='Saturday':
print('Today is off. Rest at home')
In your case, you should probably use something like this:
if df5 == df5[(df5.index.minute ==5)] OR df5[(df5.index.minute ==10)]
......
FINAL NOTE:
You made some mistakes using == and =
In Python (and many other programming languages), a single equal mark = is used to assign a value to a variable, whereas two consecutive equal marks == is used to check whether 2 expressions give the same value .
= is an assignment operator
== is an equality operator
I know there are several hundred solutions on this, but I was wondering if there is a smarter way to fill the panda's data frame missing the age column based on lengthy certain conditions as folows.
mean_value = df[(df["Survived"]== 1) & (df["Pclass"] == 1) & (df["Sex"] == "male")
& (df["Embarked"] == "C") & (df["SibSp"] == 0) & (df["Parch"] == 0)].Age.mean().round(2)
df = df.assign(
Age=np.where(df.Survived.eq(1) & df.Pclass.eq(1) & df.Sex.eq("male") & df.Embarked.eq("C") &
df.SibSp.eq(0) & df.Parch.eq(0) & df.Age.isnull(), mean_value, df.Age)
)
Repeating the following for all 6 columns above, with all categorical combinations is too long and bulky, I was wondering if there is a smarter way to do this?
#Ben.T answer:
If I understood your method correctly, this is the "verbose version" of it ?
for a in np.unique(df.Survived):
for b in np.unique(df.Pclass):
for c in np.unique(df.Sex):
for d in np.unique(df.SibSp):
for e in np.unique(df.Parch):
for f in np.unique(df.Embarked):
mean_value = df[(df["Survived"] == a) & (df["Pclass"] == b) & (df["Sex"] == c)
& (df["SibSp"] == d) & (df["Parch"] == e) & (df["Embarked"] == f)].Age.mean()
df = df.assign(Age=np.where(df.Survived.eq(a) & df.Pclass.eq(b) & df.Sex.eq(c) & df.SibSp.eq(d) &
df.Parch.eq(e) & df.Embarked.eq(f) & df.Age.isnull(), mean_value, df.Age))
which is equivalent to this?
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))
You can create a variable that combines all of your criteria, and then you can use the ampersand to add more criteria later.
Note, in the seaborn titanic dataset, where I got the data from, the column names are lowercase.
criteria = ((df["survived"]== 1) &
(df["pclass"] == 1) &
(df["sex"] == "male") &
(df["embarked"] == "C") &
(df["sibsp"] == 0) &
(df["parch"] == 0))
fillin = df.loc[criteria, 'age'].mean()
df.loc[criteria & (df['age'].isnull()), 'age'] = fillin
I guess groupby.transform can do it. It creates for each row the mean over the group of all the columns in the groupby, and it does it for all the combinations possibles at once. Then using fillna with the serie created will fill missing value with the mean of the group with same charateristics.
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))
I have seen many ways to locate a data frame for one date range,
i.e.
mask = (df["TimeStamp"] > date_range[0]) & (df["TimeStamp"] < date_range[1])
df = df.loc[mask]
but I can't find out how to do this if I had multiple date ranges
i.e.
date_ranges = [date_range_1, date_range_2, date_range_3, ... , date_range_n]
I would need something like
mask = ()
for date_range in date_ranges:
sub_mask = (df["TimeStamp"] > date_range[0]) & (df["TimeStamp"] < date_range[1])
mask.append(sub_mask)
df = df.loc[mask]
but of course this doesn't work for a variety of reasons (you need an or statement between the ands, and you cant append these masks in this way)
could anybody give me a nudge in the right direction?
you could change your code using :
mask = 0
for date_range in date_ranges:
sub_mask = (df["TimeStamp"] > date_range[0]) & (df["TimeStamp"] < date_range[1])
mask = (mask | sub_mask)
df = df.loc[mask]
As mask = (1 | mask) will always return 1 on last bit
mask = 0