Understanding a complex one line code - Big Mart Sales Data Set Analysis - python

I have been trying to learn to analyze Big Mart Sales Data Set from this website. I am unable to decode a line of code which is little bit complex. I tried to understand demystify it but I wasn't able to. Kindly help me understand this line at
In [26]
df['Item_Visibility_MeanRatio'] = df.apply(lambda x: x['Item_Visibility']/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0],axis=1).astype(float)
Thankyou very much in advance. Happy coding

df['Item_Visibility_MeanRatio']
This is the new column name
= df.apply(lambda x:
applying a function to the dataframe
x['Item_Visibility']
take the Item_Visibility column from the original dataframe
/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
divide where the Item_Visibility column in the new pivot table where the Item_Identifier is equal to the Item_Identifier in the original dataframe
,axis=1)
apply along the columns (horizontally)
.astype(float)
convert to float type
Also, it looks like .apply is used a lot on the link you attached. It should be noted that apply is generally the slow way to do things, and there are usually alternatives to avoid using apply.

Lets go thorough it step by step:
df['Item_Visibility_MeanRatio']
This part is creating a column in the data frame and its name is Item_Visibility_MeanRatio.
df.apply(lambda...)
Apply a function along an axis of the Data frame.
x['Item_Visibility']
It is getting the data from Item_Visibility column in the data frame.
visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
This part finds the indexes that visibility_item_avg index is equal to df['Item_Identifier'].This will lead to a list. Then it will get the elements in visibility_item_avg['Item_Visibility'] that its index is equal to what was found in the previous part. [0] at the end is to find the first element of the outcome array.
axis=1
1 : apply function to each row.
astype(float)
This is for changing the value types to float.
To make the code easy to grab, you can always split it to separate parts and digest it little by little.
To make the code faster you can do Vectorization instead of applying lambda.
Refer to the link here.

Related

Python: create new columns from rows based on multiple conditions

I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})
Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.

How to combine two columns in a df as a condition in np.where to check if nan in calculating new column

I want to check if either column contains a nan and if so the value in my new column to be "na" else 1.
I was able to get my code to work when checking a single column using .isnull() but I am unsure how to combine two. I tried using or as seen below. But it did not work. I know I can make the code a bit messy by testing one condition and then checking the next and from that producing the outcome I want but was hoping to make the code a bit cleaner by using some sort of any, or etc function instead but I do not know how to do that.
temp_df["xl"] = np.where(temp_df["x"].isnull() or temp_df["y"].isnull(), "na",1)
Let us do any more detail explain the situation you have link
temp_df["xl"] = np.where(temp_df[['x','y']].isnull().any(1), "na",1)

How to use parse from phonenumbers Python library on a pandas data frame?

How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.
Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Streamlining appending of boolean column in pandas dataframe

Disclaimer: My code is very amateurish as I am still undergoing course work activities. Please bear with me if my code is inefficient or of poor quality.
I have been learning the power of pandas in a recent Python tutorial and have been applying this to some of my course work. We have learnt how to use boolean filtering on Pandas so I decided to go one step further and try to append boolean values to a column in my data (efficiency).
The tutor has said we should focus on minimising code as much as we can -
I have attempted to do so for the below efficiency column.
The baseline efficiency value is 0.4805 (48.05%). If the values are above this, it is acceptable. If it is below this, it is a 'fail'.
I have appended this to my dataframe using the below code:
df['Classification'] = (df[['Efficiency_%']].sum(axis=1) > 0.4805)
df['Classification'] = (df['Classification'] == True).astype(int)
While this is only 2 lines of code - is there a way I can streamline this further into just one line?
I had considered using a 'lambda' function which I am currently reading into. I am interested if there are any other alternatives I could consider.
My approaches I have tried have been:
For Loops - Advised against using this due to it being inefficient.
If statements - I couldn't get this to work as I can't append a '1' or '0' to the df['Classification'] column as it is a dataframe and not a series.
if i > 0.4805:
df['Classification'].append('0') else:
df['Classification'].append('1')if test
Thank you.
This should do the same; It's unnecessary to sum a one column data frame by row, df[['Efficiency_%']].sum(axis=1) is the same as df['Efficiency_%'], and also Boolean Series == True is not necessary as it yields the same result as Boolean Series itself.
df['Classification'] = (df['Efficiency_%'] > 0.4805).astype(int)

Categories

Resources