drop rows with multiple conditions based on multiple column in python

drop rows with multiple conditions based on multiple column in python - python

I have a dataset (df) as below:
I want to drop rows based on condition when SKU is "abc" and packing is "1KG" & "5KG".
I have tried using following code:
df.drop( df[ (df['SKU'] == "abc") & (df['Packing'] == "10KG") & (df['Packing'] == "5KG") ].index, inplace=True)
Getting following error while trying above code:
NameError Traceback (most recent call last)
<ipython-input-1-fb4743b43158> in <module>
----> 1 df.drop( df[ (df['SKU'] == "abc") & (df['Packing'] == "10KG") & (df['Packing'] == "5KG") ].index, inplace=True)
NameError: name 'df' is not defined
Any help on this will be greatly appreciated. Thanks.

I suggest trying this:
df = df.loc[~((df['SKU'] == 'abc') & (df['packing'].isin(['1KG', '5KG']))]
The .loc is to help define the conditions and using ~ basically means 'NOT' those conditions.

Related

Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`

I would like to print out the rows from Excel where either the data exists or does not under a specific column. Whenever I run the code, I get this:
`Series([], dtype: int64)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning:
Automatic reindexing on DataFrame vs Series comparisons is deprecated and will
raise ValueError in a future version. Do `left, right = left.align(right,
axis=1, copy=False)` before e.g. `left == right`
My snippet is:
'at5 = input("Erkély igen?: ")
if at5 == 'igen':
erkely = tables2[~tables2['balcony'].isnull()]
else:
erkely = tables2[~tables2['balcony'].notnull()]
#bt = tables2[(tables2['lakas_tipus'] ==at1) & (tables2['nm2'] >= at2) &
(tables2['nm2'] < at3 ) & (tables2['room'] == at4 ) & (tables2['balcony'] == erkely
)]'
Any idea how to approach this problem? I'm not getting the output I want.

How to deal with error creating pivot on empty dataframe?

I have a dataframe, which I need to filter and than do s.th. with the results (pivot...).
Sometimes the result is an empty dataframe and the call to pivot fails.
How can I deal with this.
The filtering is done like this:
df_sparen = df[(df['INCOME_EXPENSES'] == "Transaktion abbuchen") & (df['CATEGORY'] == "Trade Republic")]
than the pivot table call
table_sparen = df_sparen.pivot_table(values='AMOUNT', index=['INCOME_EXPENSES'],
columns=['MONTHYEAR'], aggfunc=np.sum, margins=True)
This breaks as df_sparen is empty with the error:
ValueError: No objects to concatenate
Any advice how to deal with this is very much appriciated?

You can use df.empty:
table_sparen = (
df_sparen.pivot_table('AMOUNT', 'INCOME_EXPENSES', 'MONTHYEAR',
aggfunc=np.sum, margins=True)
if not df_sparen.empty else pd.DataFrame({'All': {'All': 0}})
)

Just check if the dataframe isn't empty?
df_sparen = df[(df['INCOME_EXPENSES'] == "Transaktion abbuchen") & (df['CATEGORY'] == "Trade Republic")]
if len(df_sparen) > 0:
table_sparen = df_sparen.pivot_table(values='AMOUNT', index=['INCOME_EXPENSES'], columns=['MONTHYEAR'], aggfunc=np.sum, margins=True)
or use a try/except clause:
try:
df_sparen = df[(df['INCOME_EXPENSES'] == "Transaktion abbuchen") & (df['CATEGORY'] == "Trade Republic")]
table_sparen = df_sparen.pivot_table(values='AMOUNT', index=['INCOME_EXPENSES'], columns=['MONTHYEAR'], aggfunc=np.sum, margins=True)
except ValueError:
print(f'Empty DataFrame for {"Transaktion abbuchen"} and {"Trade Republic"}')

How to remove the NameError: name 'Dataset' is not defined

I just can't find what I am doing wrong in defining df1.
import pandas as pd
df = pd.read_csv(r"D:\Programming\Datasets\avocado.csv")
df1 = df[ df['region'] == 'Albany' ]
df1
NameError Traceback (most recent call last)
NameError: name 'df1' is not defined

Please use double equal while filtering,
import pandas as pd
df = pd.read_csv(r"D:\Programming\Datasets\avocado.csv")
df1 = df[df['region'] == 'Albany']
df1
I hope this helps,
Kind regards.

Please try to get results from code below. I wonder can you get filtered data,
filtered_region = df['region']=='Albany'
please check if filtered_region object is filled.
Than try like this
df1 = df[filtered_region]
df1

Is that your exact code? And you're running in Jupyter/Jupyterlabs, correct?
The code you pasted, with I'm assuming the Kaggle avocado.csv dataset, works for me. But I'm wondering if you are trying to call df1 before assignment. If I do either of these I get NameError: name 'df1' is not defined:
df = pd.read_csv('/Users/my_username/Downloads/avocado.csv')
df1 = df1[ df['region'] == 'Albany' ]
df1
or
df = pd.read_csv('/Users/my_username/Downloads/avocado.csv')
df1 = df[ df1['region'] == 'Albany' ]
df1
In both examples you can see how df1 is reference before it is assigned a value.

I used this command line and it solved my problem:
from netCDF4 import Dataset

For those using the PyTorch.
I solved my issue by, importing the Dataset class:
from torch.utils.data import Dataset

How can I run multiple filters in pandas?

I want to run multiple filters on different columns like 'Frequency', 'Decile' and 'Audience' as 'all' and 'Dimension' = 'campaign' and KPI name='honda_2018...' from an excel sheet imported in pandas. I am running the following code:
def filter_df(df, *args):
for 'Frequency', 'All' in args:
df = df[df['Frequency'] == 'All']
return df
It is giving me an error SyntaxError: can't assign to literal. Please help

You can try .loc
Sample Data:
my_frame = pd.DataFrame(data={'name' : ['alex5','martha1','collin4','cynthia9'],
'simulation1':[71,4.8,65,4.7],
'simulation2':[71,4.8,69,4.7],
'simulation3':[70,3.8,68,4.9],
'experiment':[70.3,3.5,65,4.4]})
my_frame
Running this code below will return the index [1,2,3]:
my_frame.loc[(my_frame["simulation1"] == 4.8)]
Then if you want to filter more column use &, this code below will return index [2,3]:
my_frame.loc[(my_frame["simulation1"] == 4.8) & \
(my_frame["simulation2"] == 69)
]
Rinse and repeat until you're satisfied.

As I know it's possible
df = df[df['Frequency'] == 'All' and df['Something'] == 'Something else']

python+pyspark: error on inner join with multiple column comparison in pyspark

Hi I have 2 dataframes to join
#df1
name genre count
satya drama 1
satya action 3
abc drame 2
abc comedy 2
def romance 1
#df2
name max_count
satya 3
abc 2
def 1
Now I want to join above 2 dfs on name and count==max_count, But i am getting an error
import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
from pyspark.sql.functions import struct
df = spark.read.csv('file',sep = '###', header=True)
df1 = df.groupBy("name", "genre").count()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count))
final_df.show() ###Error
#py4j.protocol.Py4JJavaError: An error occurred while calling o207.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
#Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: count(1)
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
But success with "left " join
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count), "left")
final_df.show() ###Success but i don't want left join , i want inner join
My question is why the above one fails, am I doing something wrong there???
I referred this link "Find maximum row per group in Spark DataFrame". Used the first answer (2 groupby method).But same error.
I am on spark-2.0.0-bin-hadoop2.7 and python 2.7.
Please suggest.Thanks.
Edit:
The above scenario works with spark 1.6 (which is quite surprising that what's wrong with spark 2.0 (or with my installation , I will reinstall, check and update here)).
Has anybody tried this on spark 2.0 and got success , by following Yaron's answer below???

I ran into the same problem when I tried to join two DataFrames where one of them was GroupedData. It worked for me when I cached the GroupedData DataFrame before the inner join. For your code, try:
df1 = df.groupBy("name", "genre").count().cache() # added cache()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count")).cache() # added cache()
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count)) # no change

Update: It seems like your code was failing also due to the use of "count" as column name.
count seems to be protected keyword in DataFrame API.
renaming count to "mycount" solved the problem. The below working code was modify to support spark version 1.5.2 which I used to test your issue.
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/tmp/fac_cal.csv")
df1 = df.groupBy("name", "genre").count()
df1 = df1.select(col("name"),col("genre"),col("count").alias("mycount"))
df2 = df1.groupby('name').agg(F.max("mycount").alias("max_count"))
df2 = df2.select(col('name').alias('name2'),col("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2,[df1.name == df2.name2 , df1.mycount == df2.max_count])
final_df.show()
+-----+---------+-------+-----+---------+
| name| genre|mycount|name2|max_count|
+-----+---------+-------+-----+---------+
|brata| comedy| 2|brata| 2|
|brata| drama| 2|brata| 2|
|panda|adventure| 1|panda| 1|
|panda| romance| 1|panda| 1|
|satya| action| 3|satya| 3|
+-----+---------+-------+-----+---------+
The example for complex condition in https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
can you try:
final_df = df1.join(df2, [df1.name == df2.name , df1.mycount == df2.max_count])
Note also, that according to the spec "left" is not part of the valid join types:
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.

My work-around in spark 2.0
I created a single column('combined') from columns in join comparision('name','mycount')in respective dfs, so now I have one column to compare and this is not creating any issue as I am comparing only one column.
def combine_func(*args):
data = '_'.join([str(x) for x in args]) ###converting nonstring to str tehn concatenation
return data
combine_func = udf(combine_func, StringType()) ##register the func as udf
df1 = df1.withColumn('combined_new_1', combine_new(df1['name'],df1['mycount'])) ###a col having concatenated value from name and mycount columns eg: 'satya_3'
df2 = df2.withColumn('combined_new_2', combine_new(df2['name2'],df2['max_count']))
#df1.columns == 'name','genre', 'mycount', 'combined_new_1'
#df2.columns == 'name2', 'max_count', 'combined_new_2'
#Now join
final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner')
#final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner').select('the columns you want')
final_df.show() ####It is showing the result, Trust me.
Please don't follow until unless you are in a hurry, Better search for a reliable solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop rows with multiple conditions based on multiple column in python - python

I suggest trying this: df = df.loc[~((df['SKU'] == 'abc') & (df['packing'].isin(['1KG', '5KG']))] The .loc is to help define the conditions and using ~ basically means 'NOT' those conditions.

Related

Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`

How to deal with error creating pivot on empty dataframe?

How to remove the NameError: name 'Dataset' is not defined

How can I run multiple filters in pandas?

python+pyspark: error on inner join with multiple column comparison in pyspark

Categories

Resources