I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})
Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.
Related
I'm using iloc method to see each row on a dataframe, but it's beeing exausting.
The problem is that I can't create many variables to colect the bigger value for each line, as I did before (the other DF I'm working now has more than 20 lines, and I'd like to see a method to find the greater value without using many variables):
alex=df2.iloc[0,5:16]
mv=df2.iloc[1,5:16]
mv2=df2.iloc[2,5:16]
I have been trying to learn to analyze Big Mart Sales Data Set from this website. I am unable to decode a line of code which is little bit complex. I tried to understand demystify it but I wasn't able to. Kindly help me understand this line at
In [26]
df['Item_Visibility_MeanRatio'] = df.apply(lambda x: x['Item_Visibility']/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0],axis=1).astype(float)
Thankyou very much in advance. Happy coding
df['Item_Visibility_MeanRatio']
This is the new column name
= df.apply(lambda x:
applying a function to the dataframe
x['Item_Visibility']
take the Item_Visibility column from the original dataframe
/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
divide where the Item_Visibility column in the new pivot table where the Item_Identifier is equal to the Item_Identifier in the original dataframe
,axis=1)
apply along the columns (horizontally)
.astype(float)
convert to float type
Also, it looks like .apply is used a lot on the link you attached. It should be noted that apply is generally the slow way to do things, and there are usually alternatives to avoid using apply.
Lets go thorough it step by step:
df['Item_Visibility_MeanRatio']
This part is creating a column in the data frame and its name is Item_Visibility_MeanRatio.
df.apply(lambda...)
Apply a function along an axis of the Data frame.
x['Item_Visibility']
It is getting the data from Item_Visibility column in the data frame.
visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
This part finds the indexes that visibility_item_avg index is equal to df['Item_Identifier'].This will lead to a list. Then it will get the elements in visibility_item_avg['Item_Visibility'] that its index is equal to what was found in the previous part. [0] at the end is to find the first element of the outcome array.
axis=1
1 : apply function to each row.
astype(float)
This is for changing the value types to float.
To make the code easy to grab, you can always split it to separate parts and digest it little by little.
To make the code faster you can do Vectorization instead of applying lambda.
Refer to the link here.
I think I might be over complicating this but essentially what I am trying to do is take the data-frame below and group by the unique values in the "MATNR_BATCH" column and create another data frame with the columns: "STORAGE_BIN", "FULL_IND" & "PRCNT_UTIL", "MAX_NO_SU_IN_SB", "NO_SU_IN_SB":
From something like this:
To something like this:
From here what I would like to do is only filter on the "groups" (MATNR_BATCH) that have a mix of "FULL" and "NF" values in the "FULL_IND" column. So basically, I would like to create a data-frame that only has the unique "MATNR_BATCH" (groups) that have a combination of both "FULL" and "NF" in them.
Can anyone please help me out with this? I have been struggling to come up with a way to do this in python. Is groupby the right function to use or should I try and take a different approach?
As a first pass do
df1 = df[(df.FULL_IND=='FULL')| (df.FULL_IND=='NF')]
And then carry on. I can't quite figure out what you want to do with the other columns.
I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow
I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values.
My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy.
Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it.
I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here).
Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5
Pandas looks perfect for your task:
import pandas as pd
df = pd.read_csv('data.csv')
df.nlargest(3, 'column name')