Pandas: Preferred order of methods on a part of a table - python

I have started learning Pandas module ny "Data School" Q&A series and in his "How do I handle missing values in pandas?" video he wrote the following line of code:
ufo.isna().tail()
If I am not wrong, the following line would be more efficient:
ufo.tail().isna()
My question is not only in this case, but in general, is the order of methods on a part of a table matters? And if so then when exactly?

In my opinion, here should be used logic:
first filter for reduce number of rows and then apply some method only for filtered data
not like:
first apply method for all data and then filter
So for better performance is use first one - filter and apply method - here are tested missing values onle for first 5 rows:
ufo.tail().isna()
But here are tested all values and then filtered first 5 tows, so if 10M rows performance is much worse:
ufo.isna().tail()

Related

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

Why does Pandas .loc achieve more hits than MultiIndex.intersection?

I have two dataframes, each with 49 layered multiindexes (made up of floats, strings, np.nan, etc) and I'm trying to find the intersection of those multindexes. My initial approach was:
df3 = df1.loc[df2.index]
Which gave me nearly 100% match rate, which was about what I was expecting. Using this method though, pandas was throwing a warning
FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
So, following the suggestion in the documentation that best suited my purpose, I re-implemented my solution to:
df3 = df1.loc[df1.index.intersection(df2.index)]
However, this achieved less than 10% match rate.
I know that the intersection method is missing expected index matches. I validated this with the following
df1.index[0] in df2.index[0:1] # returns true
while
df1.index[0:1].intersection(df2.index[0:1]) # returns empty
How does .loc achieve the appropriate number of matches in nearly the equivalent time while intersection cannot? How can I replicate .loc's performance while still being future proof?
For context, I started with two datetime indexed dataframes with 49 common columns. The data in one of the dataframes is almost a subset of the data in the other (it may have some additional data). Also, the ordering of their indexes cannot be guaranteed to match. I am trying to use the datetime index of the subset dataframe as a reference time for the equivalent data row in the larger dataframe. The solution for this needs to be efficient too. I would appreciate any input on alternative approaches to this problem as well.
EDIT: I avoided using reindex because my index is duplicated, however I realised I could use it to find the index intersection as follows:
temp_df = df1[~df1.index.duplicated()].reindex(df2.index.drop_duplicates())
index_intersection = temp_df[temp_df.SomeColumn.notnull()].index
df3 = df1.loc[index_intersection]

Understanding a complex one line code - Big Mart Sales Data Set Analysis

I have been trying to learn to analyze Big Mart Sales Data Set from this website. I am unable to decode a line of code which is little bit complex. I tried to understand demystify it but I wasn't able to. Kindly help me understand this line at
In [26]
df['Item_Visibility_MeanRatio'] = df.apply(lambda x: x['Item_Visibility']/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0],axis=1).astype(float)
Thankyou very much in advance. Happy coding
df['Item_Visibility_MeanRatio']
This is the new column name
= df.apply(lambda x:
applying a function to the dataframe
x['Item_Visibility']
take the Item_Visibility column from the original dataframe
/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
divide where the Item_Visibility column in the new pivot table where the Item_Identifier is equal to the Item_Identifier in the original dataframe
,axis=1)
apply along the columns (horizontally)
.astype(float)
convert to float type
Also, it looks like .apply is used a lot on the link you attached. It should be noted that apply is generally the slow way to do things, and there are usually alternatives to avoid using apply.
Lets go thorough it step by step:
df['Item_Visibility_MeanRatio']
This part is creating a column in the data frame and its name is Item_Visibility_MeanRatio.
df.apply(lambda...)
Apply a function along an axis of the Data frame.
x['Item_Visibility']
It is getting the data from Item_Visibility column in the data frame.
visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
This part finds the indexes that visibility_item_avg index is equal to df['Item_Identifier'].This will lead to a list. Then it will get the elements in visibility_item_avg['Item_Visibility'] that its index is equal to what was found in the previous part. [0] at the end is to find the first element of the outcome array.
axis=1
1 : apply function to each row.
astype(float)
This is for changing the value types to float.
To make the code easy to grab, you can always split it to separate parts and digest it little by little.
To make the code faster you can do Vectorization instead of applying lambda.
Refer to the link here.

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Streamlining appending of boolean column in pandas dataframe

Disclaimer: My code is very amateurish as I am still undergoing course work activities. Please bear with me if my code is inefficient or of poor quality.
I have been learning the power of pandas in a recent Python tutorial and have been applying this to some of my course work. We have learnt how to use boolean filtering on Pandas so I decided to go one step further and try to append boolean values to a column in my data (efficiency).
The tutor has said we should focus on minimising code as much as we can -
I have attempted to do so for the below efficiency column.
The baseline efficiency value is 0.4805 (48.05%). If the values are above this, it is acceptable. If it is below this, it is a 'fail'.
I have appended this to my dataframe using the below code:
df['Classification'] = (df[['Efficiency_%']].sum(axis=1) > 0.4805)
df['Classification'] = (df['Classification'] == True).astype(int)
While this is only 2 lines of code - is there a way I can streamline this further into just one line?
I had considered using a 'lambda' function which I am currently reading into. I am interested if there are any other alternatives I could consider.
My approaches I have tried have been:
For Loops - Advised against using this due to it being inefficient.
If statements - I couldn't get this to work as I can't append a '1' or '0' to the df['Classification'] column as it is a dataframe and not a series.
if i > 0.4805:
df['Classification'].append('0') else:
df['Classification'].append('1')if test
Thank you.
This should do the same; It's unnecessary to sum a one column data frame by row, df[['Efficiency_%']].sum(axis=1) is the same as df['Efficiency_%'], and also Boolean Series == True is not necessary as it yields the same result as Boolean Series itself.
df['Classification'] = (df['Efficiency_%'] > 0.4805).astype(int)

Categories

Resources