Why does Pandas .loc achieve more hits than MultiIndex.intersection? - python

I have two dataframes, each with 49 layered multiindexes (made up of floats, strings, np.nan, etc) and I'm trying to find the intersection of those multindexes. My initial approach was:
df3 = df1.loc[df2.index]
Which gave me nearly 100% match rate, which was about what I was expecting. Using this method though, pandas was throwing a warning
FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
So, following the suggestion in the documentation that best suited my purpose, I re-implemented my solution to:
df3 = df1.loc[df1.index.intersection(df2.index)]
However, this achieved less than 10% match rate.
I know that the intersection method is missing expected index matches. I validated this with the following
df1.index[0] in df2.index[0:1] # returns true
while
df1.index[0:1].intersection(df2.index[0:1]) # returns empty
How does .loc achieve the appropriate number of matches in nearly the equivalent time while intersection cannot? How can I replicate .loc's performance while still being future proof?
For context, I started with two datetime indexed dataframes with 49 common columns. The data in one of the dataframes is almost a subset of the data in the other (it may have some additional data). Also, the ordering of their indexes cannot be guaranteed to match. I am trying to use the datetime index of the subset dataframe as a reference time for the equivalent data row in the larger dataframe. The solution for this needs to be efficient too. I would appreciate any input on alternative approaches to this problem as well.
EDIT: I avoided using reindex because my index is duplicated, however I realised I could use it to find the index intersection as follows:
temp_df = df1[~df1.index.duplicated()].reindex(df2.index.drop_duplicates())
index_intersection = temp_df[temp_df.SomeColumn.notnull()].index
df3 = df1.loc[index_intersection]

Related

Pandas dataframe upsert: FutureWarning when doing dataframe.update on dataframe containing datetime series/columns

I am trying to update one dataframe with data from another dataframe. in most cases this works fine. however when the dataframes contain at least one column of type datetime64, i get a FutureWarning.
This simplified code that replicates the issue:
import pandas as pd
index = pd.to_datetime(['01/03/2022', '01/04/2022'])
data1 ={'value': pd.to_datetime(['01/05/2022', '01/06/2022'])}
data2 ={'value': pd.to_datetime(['01/05/2022', '01/07/2022'])}
df1 = pd.DataFrame(data=data1, index=index)
df2 = pd.DataFrame(data=data2, index=index)
df1.update(df2)
And the warning:
FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
df1.update(df2)
This only happens for dataframes where one of the columns/series is of type datetime64
The recommendation in the warning on how to avoid doesnt seem like an really good solution in my case as i may have a variable number of columns (often i wont know about them). To that end..some questions:
IS there a recommended way to solve this problem EFFICIENTLY.
or can i ignore this warning.
Do these warnings cause performance issues with code if there are many of them

Understanding a complex one line code - Big Mart Sales Data Set Analysis

I have been trying to learn to analyze Big Mart Sales Data Set from this website. I am unable to decode a line of code which is little bit complex. I tried to understand demystify it but I wasn't able to. Kindly help me understand this line at
In [26]
df['Item_Visibility_MeanRatio'] = df.apply(lambda x: x['Item_Visibility']/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0],axis=1).astype(float)
Thankyou very much in advance. Happy coding
df['Item_Visibility_MeanRatio']
This is the new column name
= df.apply(lambda x:
applying a function to the dataframe
x['Item_Visibility']
take the Item_Visibility column from the original dataframe
/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
divide where the Item_Visibility column in the new pivot table where the Item_Identifier is equal to the Item_Identifier in the original dataframe
,axis=1)
apply along the columns (horizontally)
.astype(float)
convert to float type
Also, it looks like .apply is used a lot on the link you attached. It should be noted that apply is generally the slow way to do things, and there are usually alternatives to avoid using apply.
Lets go thorough it step by step:
df['Item_Visibility_MeanRatio']
This part is creating a column in the data frame and its name is Item_Visibility_MeanRatio.
df.apply(lambda...)
Apply a function along an axis of the Data frame.
x['Item_Visibility']
It is getting the data from Item_Visibility column in the data frame.
visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
This part finds the indexes that visibility_item_avg index is equal to df['Item_Identifier'].This will lead to a list. Then it will get the elements in visibility_item_avg['Item_Visibility'] that its index is equal to what was found in the previous part. [0] at the end is to find the first element of the outcome array.
axis=1
1 : apply function to each row.
astype(float)
This is for changing the value types to float.
To make the code easy to grab, you can always split it to separate parts and digest it little by little.
To make the code faster you can do Vectorization instead of applying lambda.
Refer to the link here.

Why does pandas need to reshape my boolean index, and how can I fix it to avoid the warning?

Background
I've got two DataFrames of timestamped-ids (the index is the id). I want to get all of the ids where the timestamps differ by, say, 5 minutes.
Code
time_delta = abs(df2.time - df1.time).dt.total_seconds()
ids_out_of_range = df1[time_delta > 300].index
This gives me the ids I want, so it is working code.
Problem
Like many, I face this warning:
file.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
ids_out_of_range = df1[time_delta > 300].index
Most explanations center on the "length" of the index differing from the "length" of the dataframe. But:
+(Pdb) time_delta.shape
(176,)
+(Pdb) df1.shape
(176, 1)
+(Pdb) sorted(time_delta.index.values.tolist()) == sorted(df1.index.values.tolist())
True
The shapes are the same, except that one is a Series and the other is a DataFrame. The indices (appear) to be the same; perhaps the ordering is the issue? They did not compare equal without sorted.
(I've tried wrapping time_delta in a DataFrame, to no avail.)
Long-term, I would like this warning to go away (and not with 2>/dev/null, thank you). It's visual clutter in the output of my script, and, well, it is a warning—so theoretically I should pay attention to it.
Question
What am I doing "wrong" that I get this warning, since the sizes seem to be right?
How do I fix (1) so I can avoid this warning?
The warning is saying that your time_delta index is different from df index.
But when I tried to reproduce the warning, it didn't show up. I'm using pandas 0.25.1. So if you are using a different version, there might be a warning.
Please refer to this page for suppressing warnings
The following fixed my issue:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
time_delta.sort_index(inplace=True)
This allowed the indices to align perfectly, so they must have not been in the same order with respect to each other.

Python dataframe; trouble changing value of column with multiple filters

I have a large dataframe I took off an ODBC database. The Dataframe has multiple columns; I'm trying to change the values of one column by filtering two other.
First, I filter my dataframe data_prem with both conditions which gives me the correct rows:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]
Then I use the replace function on the selection to change 'M' value to 'H' value:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]['Reinsurer'].replace(to_replace='M',value='H',inplace=True,regex=True)
Python warns me I'm trying to modify a copy of the dataframe, even though I'm clearly refering to the original dataframe (I'm posting image so you can see my results).
dataframe filtering
I also tried using .loc function in the following manner:
data_prem.loc[((data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))),'Reinsurer'] = 'H'
which changed all rows that fit the second condition (str.contains...), but it didn't apply the first condition. I got replacements in the 'Reinsurer' column for other 'PRODUCT_NAME' values as well.
I've been scouring the web for an answer to this for some time. I've seen some mentions of a bug in the pandas library, not sure if this is what they were talking about.
I would value any opinions you might have, would also be interesting in alternative ways to solving this problem. I filled the 'Reinsurer' column with the map function with 'PRODUCT_NAME' as the input (had a dictionary that connected all 'PRODUCT_NAME' values with 'Reinsurer' values).
Given your Boolean mask, you've demonstrated two ways of applying chained indexing. This is the cause of the warning and the reason why you aren't seeing your logic being applied as you anticipate.
mask = (data_prem['PRODUCT_NAME']=='ŽZ08') & df['BENEFIT'].str.contains('19.08.16')
Chained indexing: Example #1
df[mask]['Reinsurer'].replace(to_replace='M', value='H', inplace=True, regex=True)
Chained indexing: Example #2
df[mask].loc[mask, 'Reinsurer'] = 'H'
Avoid chained indexing
You can keep things simple by applying your mask once and using a single loc call:
df.loc[mask, 'Reinsurer'] = 'H'

Filtering a dataset on values not in another dataset

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.
Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Categories

Resources