comparing two columns of a row in python dataframe - python

I know that one can compare a whole column of a dataframe and making a list out of all rows that contain a certain value with:
values = parsedData[parsedData['column'] == valueToCompare]
But is there a possibility to make a list out of all rows, by comparing two columns with values like:
values = parsedData[parsedData['column01'] == valueToCompare01 and parsedData['column02'] == valueToCompare02]
Thank you!

It is completely possible, but I have never tried using and in order to mask the dataframe, rather using & would be of interest in this case. Note that, if you want your code to be more clear, use ( ) in each statement:
values = parsedData[(parsedData['column01'] == valueToCompare01) & (parsedData['column02'] == valueToCompare02)]

Related

Getting NaN Values after Splitting with Boolean Masking

I am trying to split a huge dataframe into smaller dataframes based on values on a specific column.
What I basically did was I created a for loop then assigned each dataframe to a dictionary.
However when I call the items from the dictionary all values are NaN except for the cell_id values that I used for splitting.
Why would this happen?
Also I would appreciate if there are more practical ways to do this.
df_sliced_dict = {}
for cell in ex_df['cell_id'].unique():
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
Replace
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
with
df_sliced_dict[cell] = ex_df[ex_df['cell_id'] == cell]
inside the for-loop and it will work as expected.
The problem is that ex_df.loc[:, ['cell_id']] (or ex_df[['cell_id']]) is a DataFrame, not a Series, and you want a Series to construct your boolean mask.

Python: Define subgroup of data with multiple conditions

I have a table with several dummy variables
I would now like to create a subgroup where I list the winpercent values of those rows where fruity=1 and hard=0. My first attempt was this one but it was unsuccesful:
df6=full_data[full_data['fruity'&'hard']==['1'&'0'].iloc[:,-1]
Can anyone help, please?
Please write the conditions one by one separated by the '&' operator:
full_data.loc[(full_data['fruity'] == 1) &
(full_data['hard'] == 0), 'winpercent']
You can also query it:
full_data.query("fruity == 1 and hard == 0", inplace=False)['winpercent']

Correct way of testing Pandas dataframe values and modifying them

I need to modify some values of a Pandas dataframe based on a test, and leave the others values intact. I also need to leave the order of the rows intact.
I have a working code, based on iterating on the dataframe's rows. But it's horrendously slow. Is there a quicker way to get it done?
Here are two examples of this very slow code
for index, row in df.iterrows():
if df.number[index].is_integer():
df.number[index] = int(df.number[index])
for index, row in df.iterrows():
if df.string[index] == "XXX":
df.string[index] = df.other_colum[index].split("\")[0] + df.other_colum[index].split("\")[1]
else:
df.string[index] = df.other_colum[index].split("\")[1] + df.other_colum[index].split("\")[0]
Thanks
Generally you want to avoid iterating through rows in a pandas dataframe as it is slower than other methods pandas has created for accomplishing the same thing. One way of getting around this is using apply. You would redefine the number column:
df["number"] = df["number"].apply(lambda x: int(x) if x.is_integer() else x)
And (re)define the string column:
df["string"] = df["other column"].apply(lambda x: x.split("\\")[0] + x.split("\\")[1] if x == r"XX\X" else x.split("\\")[1] + x.split("\\")[0])
Made some assumptions based off of the data you removed from the problem set up -- .split("\") is incorrect syntax, and "other column" above necessarily has to have a backslash in it in order for your code (and mine) to work, otherwise .split("\\")[1] will return an error.

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Python DataFrames: finding *almost" identical rows

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.
Problem, there is no unique key that can help me find back which order corresponds to which cancellation.
So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):
for i, item in cancelations.iterrows():
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data
So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?
thank you for your time !
edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.

Categories

Resources