Convert continuous numerical data to discrete numerical data in Pandas

Convert continuous numerical data to discrete numerical data in Pandas - python

I have a pandas dataframe df with a column having continuous numerical data.
A
0 1.5
1 15.0
2 12.8
3 23.2
4 9.6
I want to replace the continuous variables with numerical value based on the following rules:
0-10=10
10-20=50
20-100=80
The final dataframe obtained should be like this:
A
0 10
1 50
2 50
3 80
4 10
I had tried to use pandas.cut(df, bins=[0,10,20,100], labels=[10,50,80]) but it returns a Categorical column. I need the output column to be numerical.

Adding to_numeric to your code
pd.to_numeric(pd.cut(df['A'], bins=[0,10,20,100], labels=[10,50,80]))
Out[54]:
0 10
1 50
2 50
3 80
4 10
Name: A, dtype: int64

Related

How to aggregate all values in a pandas dataframe columns in 2 values

I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:

Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1

create in pandas matrix rfm

I'm calculating a rfv table in pandas, but im not able to find a tutorial or post that helps me building the matrix necessary for the graph.
The graph i want using this base from matplot
each square being the count of clients in that position.
The dataframe im using now has this columns:
COD_CLIENT
RECENCY
FRE_VAL
R
FV
RFV
RFV_Score
RFV_Level
59
87
45.45
3
3
33
6
Potential
1846
75
6.00
3
2
32
5
Seleepers
4380
92
37.95
2
3
23
5
Seleepers
object
int64
float64
int32
int32
int32
int64
object
What do you guys sugest i do?
I already tried using R and FV as columns and rows and using a function but this went badly.

You want to count the number of rows for each (R, FV) pair, so you could try using pandas.DataFrame.groupby and then counting the number of elements in each group by using .size() on the previous result. The resulting code should look like this:
import pandas as pd
>> df = ... # Assuming your DataFrame has the format you showed
>> df.groupby(["R", "FV"]).size()
R FV
1 1 2
2 1
3 1
2 1 1
2 1
dtype: int64
>> df.groupby(["R", "FV"]).size().unstack() # Add unstack to format the result
FV 1 2 3
R
1 2.0 1.0 1.0
2 1.0 1.0 NaN
This should match your expected output.

Group seperated counting values in a pandas dataframe

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj

Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5

All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

Python - Pivot and create histograms from Pandas column, with missing values

Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.

Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.

To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")

Comparing multiple columns of data using pandas data frame

I have a pandas data frame df1
Time sat1 sat2 sat3 sat4 val1 val2 val3 val4
10 2 4 2 4 0.1 -1.0 1 2.0
20 3 1 1 3 1.6 0 2.1 -0.7
30 12 8 8 16 0.5 1.1 0.6 2.0
40 2 1 2 12 1.0 1.2 0.4 3.7
I want to compare sat1,sat2 with sat3 and sat4 at all time instant.
If there is match between these two columns ,I want to get number of matched
elements and subtract matched elements values columns.
Expected Output:
match_count Reslt_1 Reslt_2
2 val1-val3 val2-val4
2 val1-val4 val2-val3
1 Nan val2-val3
1 val1-val3 Nan ( w.r.t match found in sat1 or sat2)
These data are sample data and number of columns may increase . Data in sat1,sat2 are toggling in sat3 & sat4 and that is why subtraction will happen accordingly.
How can I obtain above expected output using pandas. I obtained above dataframe
using pandas concat function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert continuous numerical data to discrete numerical data in Pandas - python

Adding to_numeric to your code pd.to_numeric(pd.cut(df['A'], bins=[0,10,20,100], labels=[10,50,80])) Out[54]: 0 10 1 50 2 50 3 80 4 10 Name: A, dtype: int64

Related

How to aggregate all values in a pandas dataframe columns in 2 values

create in pandas matrix rfm

Group seperated counting values in a pandas dataframe

Python - Pivot and create histograms from Pandas column, with missing values

Comparing multiple columns of data using pandas data frame

Categories

Resources