Bin DataFrame in 2D array for seaborn heatmap - python

I have a dataframe that looks like.
+-----------+-------+
| A | B |
+-----------+-------+
| 1 | 1 |
| 2 | 2 |
| 5 | 3 |
| 20 | 4 |
| 25 | 3 |
| 123 | 5 |
| 125 | 6 |
+-----------+-------+
I want to bin the column A based on the ranges defined ranges with sum of the values in column B. This will then be feeded to seaborn to generate a heatmap.
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
| | 0-10 | 11-20 | 21-30 | 31-40 | 41-50 | 51-60 | 61-70 | 71-80 | 81-90 | 91-100 |
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
| 0-100 | 6 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 101-200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 201-300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 301-400 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 401-500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
One way to solve it by looping through the data and generate the array. I am looking at a pandas way if there is any.
I tried solving using seaborn.heatmap like so:
df.groupby([pd.cut(df.A, bins=[x for x in range(0,1001,100)], include_lowest=True, right=False),
pd.cut(df.A, bins=[x for x in range(0,101,10)], include_lowest=True, right=False)])
.B.sum().unstack()
But this only group's by the first 0-100 B values. Ignores the remaining.

In your solution is used maximal range for range(0,101,10) like 101, so not matched values in A column greater like 100 - output are NaNs, so after aggregate sum get 0.
EDIT:
#create helper column with integer and modulo division
df['A1'] = df.A % 100
bins1= range(0,df.A.max() // 100 * 100 + 101, 100)
bins2= range(0,df.A1.max() // 10 * 10 + 11, 10)
labels1 = [f'{i}-{j}' if i == 0 else f'{i + 1}-{j}' for i, j in zip(bins1[:-1], bins1[1:])]
labels2 = [f'{i}-{j}' if i == 0 else f'{i + 1}-{j}' for i, j in zip(bins2[:-1], bins2[1:])]
df['a'] = pd.cut(df.A, bins=bins1,labels=labels1, include_lowest=True, right=True)
df['b'] = pd.cut(df.A1, bins=bins2,labels=labels2, include_lowest=True, right=True)
print (df)
A B A1 a b
0 1 1 1 0-100 0-10
1 2 2 2 0-100 0-10
2 5 3 5 0-100 0-10
3 20 4 20 0-100 11-20
4 25 3 25 0-100 21-30
5 123 5 23 101-200 21-30
6 125 6 25 101-200 21-30
df1 = df.pivot_table(index='a', columns='b', values='B', aggfunc='sum')
print (df1)
b 0-10 11-20 21-30
a
0-100 6 4 3
101-200 0 0 11

Related

Assign a total value of 1 if any number is present in a column, else 0

I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64

Replacing values with ffill in pandas?

I have various columns in a pandas dataframe that have dummy values and I want to fill them as follows:
Input Columns
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
| 0 | 0 |
| 0 | 1 |
| 0 | 1 |
| 1 | 0 |
| 0 | 1 |
Output columns:
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 2 | 4 |
How can I get this output in pandas?
Here working if there are only 0 and 1 values cumulative sum - DataFrame.cumsum:
df1 = df.cumsum()
print (df1)
c1 c2
0 0 1
1 0 1
2 1 1
3 1 1
4 1 2
5 1 3
6 2 3
7 2 4
If there are 0 and another values is possible use cumulative sum for mask for test not equal 0 values:
df2 = df.ne(0).cumsum()

Joining DataFrames Horizontally

I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.

python/pandas - Transformed value_counts by Category

I have a table that looks something like this:
+------------+------------+------------+------------+
| Category_1 | Category_2 | Category_3 | Category_4 |
+------------+------------+------------+------------+
| a | b | b | y |
| a | a | c | y |
| c | c | c | n |
| b | b | c | n |
| a | a | a | y |
+------------+------------+------------+------------+
I'm hoping for a pivot_table like result, with the counts of the frequency for each category. Something like this:
+---+------------+----+----+----+
| | | a | b | c |
+---+------------+----+----+----+
| | Category_1 | 12 | 10 | 40 |
| y | Category_2 | 15 | 48 | 26 |
| | Category_3 | 10 | 2 | 4 |
| | Category_1 | 5 | 6 | 4 |
| n | Category_2 | 9 | 5 | 2 |
| | Category_3 | 8 | 4 | 3 |
+---+------------+----+----+----+
I know I could pull it off by splitting the table, assigning value_counts to column values, then rejoining. Is there any more simple, more 'pythonic' way of pulling this off? I figure it may along the lines of a pivot paired with a Transform, but tests so far have been ugly at best.
So we need to melt (or stack ) your original dataframe, then we doing pd.crosstab, you can using pd.pivot_table as well.
s=df.set_index('Category_4').stack().reset_index().rename(columns={0:'value'})
pd.crosstab([s.Category_4,s.level_1],s['value'])
Out[532]:
value a b c
Category_4 level_1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
Using get_dummies first, then summing across index levels
d = pd.get_dummies(df.set_index('Category_4'))
d.columns = d.columns.str.rsplit('_', 1, True)
d = d.stack(0)
# This shouldn't be necessary but is because the
# index gets bugged and I'm "resetting" it
d.index = pd.MultiIndex.from_tuples(d.index.values)
d.sum(level=[0, 1])
a b c
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2

Create a cumulated sum between a range of rows

First column cond contains either 1 or 0
Second column event contains either 1 or 0
I want to create a third column where each row is the (cumulated sum of cond % 4) of the COND column between two rows where event==1 (first row where event==1 must be included in the cumulated sum but not the last row)
+------+-------+--------+
| cond | event | Result |
+------+-------+--------+
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 0 | 0 | 2 |
| 1 | 0 | 3 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
+------+-------+--------+
This can be easily tackles by pandas.groupby.transform and cumsum
event_cum = df['event'].cumsum()
result = df['cond'].groupby(event_cum).transform('cumsum').mod(4)
result[event_cum == 0] = 0 # rows before the first event
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 0
8 1
9 2
10 1
Name: cond, dtype: int64

Categories

Resources