create in pandas matrix rfm - python

I'm calculating a rfv table in pandas, but im not able to find a tutorial or post that helps me building the matrix necessary for the graph.
The graph i want using this base from matplot
each square being the count of clients in that position.
The dataframe im using now has this columns:
COD_CLIENT
RECENCY
FRE_VAL
R
FV
RFV
RFV_Score
RFV_Level
59
87
45.45
3
3
33
6
Potential
1846
75
6.00
3
2
32
5
Seleepers
4380
92
37.95
2
3
23
5
Seleepers
object
int64
float64
int32
int32
int32
int64
object
What do you guys sugest i do?
I already tried using R and FV as columns and rows and using a function but this went badly.

You want to count the number of rows for each (R, FV) pair, so you could try using pandas.DataFrame.groupby and then counting the number of elements in each group by using .size() on the previous result. The resulting code should look like this:
import pandas as pd
>> df = ... # Assuming your DataFrame has the format you showed
>> df.groupby(["R", "FV"]).size()
R FV
1 1 2
2 1
3 1
2 1 1
2 1
dtype: int64
>> df.groupby(["R", "FV"]).size().unstack() # Add unstack to format the result
FV 1 2 3
R
1 2.0 1.0 1.0
2 1.0 1.0 NaN
This should match your expected output.

Related

Create averages of column in pandas dataframe based on values in another column [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

Convert continuous numerical data to discrete numerical data in Pandas

I have a pandas dataframe df with a column having continuous numerical data.
A
0 1.5
1 15.0
2 12.8
3 23.2
4 9.6
I want to replace the continuous variables with numerical value based on the following rules:
0-10=10
10-20=50
20-100=80
The final dataframe obtained should be like this:
A
0 10
1 50
2 50
3 80
4 10
I had tried to use pandas.cut(df, bins=[0,10,20,100], labels=[10,50,80]) but it returns a Categorical column. I need the output column to be numerical.
Adding to_numeric to your code
pd.to_numeric(pd.cut(df['A'], bins=[0,10,20,100], labels=[10,50,80]))
Out[54]:
0 10
1 50
2 50
3 80
4 10
Name: A, dtype: int64

Is there a way in Pandas to subtract two values that are in the same column that have the same name?

Here is a snippet of a dataframe I'm trying to analyze. What I want to do is simply subtract FP_FLOW FORMATTED_ENTRY values from D8_FLOW FORMATTED_ENTRY values only if the X_LOT_NAME is the same. For example, in the X_LOT_NAME column you can see MPACZX2. The D8_FLOW FORMATTED_ENTRY is 12.3%. The FP_FLOW FORMATTED_ENTRY value is 7.8% . The difference between the two would be 4.5%. I want to apply this logic across the whole data set
it is advisable to first convert your data into a format where the values to be added / subtracted are in the same row, and after that subtract / add the corresponding oclumns. You can do this using pd.pivot-table. The below example will demonstrate this using a sample dataframe similar to what you've shared:
wanted_data
X_LOT_NAME SPEC_TYPE FORMATTED_ENTRY
0 a FP_FLOW 1
1 a D8_FLOW 2
2 c FP_FLOW 3
3 c D8_FLOW 4
pivot_data = pd.pivot_table(wanted_data,values='FORMATTED_ENTRY',index='X_LOT_NAME',columns='SPEC_TYPE')
pivot_data
SPEC_TYPE D8_FLOW FP_FLOW
X_LOT_NAME
a 2 1
c 4 3
After this step, the resultant pivot_data contains the same data, but the columns are D8_FLOW and FP_FLOW, with X_LOT_NAME as the index. Now you can get the intended value in a new column using:
pivot_data['DIFF'] = pivot_data['D8_FLOW'] - pivot_data['FP_FLOW']
Is this what you are looking for?
df.groupby(['x_lot'])['value'].diff()
0 NaN
1 NaN
2 -5.0
3 8.0
4 -3.0
5 NaN
6 -10.0
Name: value, dtype: float64
This is the data i used to get the above results
x_lot type value
0 mpaczw1 fp 21
1 mpaczw2 d8 12
2 mpaczw2 fp 7
3 mpaczw2 d8 15
4 mpaczw2 fp 12
5 mpaczw3 d8 21
6 mpaczw3 fp 11

Normalize within groups in Pandas

I have read several similar questions and cannot for the life of me find an answer that works for what I'm trying to specifically even though the question is very simple. I have a set of data that has a grouping variable, a position, and a value at that position:
Sample Position Depth
A 1 2
A 2 3
A 3 4
B 1 1
B 2 3
B 3 2
I want to generate a new column that is an internally normalized depth as follows:
Sample Position Depth NormalizedDepth
A 1 2 0
A 2 3 0.5
A 3 4 1
B 1 1 0
B 2 3 1
B 3 2 0.5
This is essentially represented by the formula NormalizedDepth = (x - min(x))/(max(x)-min(x)) such that the minimum and maximum are of the group.
I know how to do this with dplyr in R with the following:
depths %>%
group_by(Sample) %>%
mutate(NormalizedDepth = 100 * (Depth - min(Depth))/(max(Depth) - min(Depth)))
I cannot figure out how to do this with pandas I've tried doing grouping and applying, but none of it seems to replicate what I am looking for.
We have transform (do the same as mutate in R dplyr ) with ptp (thes is get the diff between the max and min )
import numpy as np
g=df.groupby('Sample').Depth
df['new']=(df.Depth-g.transform('min'))/g.transform(np.ptp)
0 0.0
1 0.5
2 1.0
3 0.0
4 1.0
5 0.5
Name: Depth, dtype: float64
Group the Data Frame by Sample Series' values, apply an anonymous function to each value of the (split) Depth Series which performs min max normalisation, assign result to NormalizedDepth Series of df DataFrame (note unlikely to be as efficient as YOBEN_S' answer above):
import pandas as pd
df['NormalizedDepth'] = df.groupby('Sample').Depth.apply(lambda x: (x - min(x))/(max(x)-min(x)))

Use df.merge to populate a new column in df gives strange matchs

I just found the 2 issues causing this, see solution below
I want to create a new column in my dataframe (df) based on another dataframe.
Basically df2 contains updated informations that I want to plug into df.
In order to replicate my real case (>1m lines), I will just populate two random df with simple columns.
I use pandas.merge() to do this, but this is giving me strange results.
Here is a typical example. Let's create df randomly and create df2 with a simple relationship : "New Type" = "Type" + 1. I create this simple relationship so that we can check easily the ouput. In my real application I don't have such an easy relationship of course.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)),columns = ["Type"])
df.head()
Type
0 45
1 3
2 89
3 6
4 39
df1 = pd.DataFrame({"Type":range(1,100)})
df1["New Type"] = df1["Type"] + 1
print(df1.head())
Type New Type
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
Now let's say I want to update df "Type" based on the "New Type" on df1
df["Type2"] = df.merge(df1,on="Type")["New Type"]
print(df.head())
I get this strange output where we clearly see that it does not work
Type Type2
0 45 46.0
1 3 4.0
2 89 4.0
3 6 4.0
4 39 90.0
I would think output should be like
Type Type2
0 45 46.0
1 3 4.0
2 89 90.0
3 6 7.0
4 39 40.0
Only the first line is properly matched. Do you know what I've missed?
Solution
1.I need to do merge with how="left" otherwise the default choice is "inner" producing another table with a different dimension than df.
Also I need to use sort=false as attribute to my merge function. Otherwise the merge result is sorted before being applied to df.
One way you could do this using map, set_index, and squeeze:
df['Type2'] = df['Type'].map(df1.set_index('Type').squeeze())
Output:
Type Type2
0 22 23.0
1 56 57.0
2 63 64.0
3 33 34.0
4 25 26.0
First, I'd construct a Series of New Type indexed by the old Type from df1:
new_vals = df1.set_index('Type')['New Type']
Then it's simply:
df.replace(new_vals)
That will leave values which aren't mapped intact. If you want to instead have the output be NaN (null) where not mapped, do this:
new_vals[df.Type]

Categories

Resources