First column cond contains either 1 or 0
Second column event contains either 1 or 0
I want to create a third column where each row is the (cumulated sum of cond % 4) of the COND column between two rows where event==1 (first row where event==1 must be included in the cumulated sum but not the last row)
+------+-------+--------+
| cond | event | Result |
+------+-------+--------+
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 0 | 0 | 2 |
| 1 | 0 | 3 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
+------+-------+--------+
This can be easily tackles by pandas.groupby.transform and cumsum
event_cum = df['event'].cumsum()
result = df['cond'].groupby(event_cum).transform('cumsum').mod(4)
result[event_cum == 0] = 0 # rows before the first event
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 0
8 1
9 2
10 1
Name: cond, dtype: int64
Related
I have a dataframe that looks like.
+-----------+-------+
| A | B |
+-----------+-------+
| 1 | 1 |
| 2 | 2 |
| 5 | 3 |
| 20 | 4 |
| 25 | 3 |
| 123 | 5 |
| 125 | 6 |
+-----------+-------+
I want to bin the column A based on the ranges defined ranges with sum of the values in column B. This will then be feeded to seaborn to generate a heatmap.
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
| | 0-10 | 11-20 | 21-30 | 31-40 | 41-50 | 51-60 | 61-70 | 71-80 | 81-90 | 91-100 |
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
| 0-100 | 6 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 101-200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 201-300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 301-400 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 401-500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
One way to solve it by looping through the data and generate the array. I am looking at a pandas way if there is any.
I tried solving using seaborn.heatmap like so:
df.groupby([pd.cut(df.A, bins=[x for x in range(0,1001,100)], include_lowest=True, right=False),
pd.cut(df.A, bins=[x for x in range(0,101,10)], include_lowest=True, right=False)])
.B.sum().unstack()
But this only group's by the first 0-100 B values. Ignores the remaining.
In your solution is used maximal range for range(0,101,10) like 101, so not matched values in A column greater like 100 - output are NaNs, so after aggregate sum get 0.
EDIT:
#create helper column with integer and modulo division
df['A1'] = df.A % 100
bins1= range(0,df.A.max() // 100 * 100 + 101, 100)
bins2= range(0,df.A1.max() // 10 * 10 + 11, 10)
labels1 = [f'{i}-{j}' if i == 0 else f'{i + 1}-{j}' for i, j in zip(bins1[:-1], bins1[1:])]
labels2 = [f'{i}-{j}' if i == 0 else f'{i + 1}-{j}' for i, j in zip(bins2[:-1], bins2[1:])]
df['a'] = pd.cut(df.A, bins=bins1,labels=labels1, include_lowest=True, right=True)
df['b'] = pd.cut(df.A1, bins=bins2,labels=labels2, include_lowest=True, right=True)
print (df)
A B A1 a b
0 1 1 1 0-100 0-10
1 2 2 2 0-100 0-10
2 5 3 5 0-100 0-10
3 20 4 20 0-100 11-20
4 25 3 25 0-100 21-30
5 123 5 23 101-200 21-30
6 125 6 25 101-200 21-30
df1 = df.pivot_table(index='a', columns='b', values='B', aggfunc='sum')
print (df1)
b 0-10 11-20 21-30
a
0-100 6 4 3
101-200 0 0 11
The original table looks like this:
| s/n | i.d | T1 |
|------|-----|------|
| 0 | A | 2 |
| 1 | B | 2 |
| 2 | C | 0 |
| 3 | A | 2 |
| 4 | B | 0 |
| 5 | C | 2 |
| 6 | A | 1 |
| 7 | B | 0 |
| 8 | C | 1 |
and the final table like this:
| s/n | i.d | T1 |prev_total_T1 | prev_no_of_T1_2 |
|------|-----|------|--------------|------------------|
| 0 | A | 2 | 0 | 0 |
| 1 | B | 2 | 0 | 0 |
| 2 | C | 0 | 0 | 0 |
| 3 | A | 2 | 2 | 1 |
| 4 | B | 0 | 2 | 1 |
| 5 | C | 2 | 0 | 0 |
| 6 | A | 1 | 4 | 2 |
| 7 | B | 0 | 2 | 1 |
| 8 | C | 1 | 2 | 1 |
prev_total_T1 == (shift and total the previous record and update)
simply an addition of the total previous T1 number for the individual i.d.
i.e, for the first instance, i.d A,B,C has no previous T1 data so they are 0,0,0 respectively
the second instance, i.d A,B,C had 2,2,0 respectively
the third instance, i.d A,B,C had 2,0,2 and 2,2,0 data respectively and so we add them to give 4,2,2 respectively
prev_no_of_T1_2 == (shift and count the previous record and update)
i.e an increment of 1 for every previous number of times the number '2' appeared in T1 column
for the first instance, there was no previous record in A,B,C so we write 0,0,0 respectively
for the second instance the number '2' appeared previously in i.d A and B and not in i.d C so we write 1,1,0 respectively
for the third instance the number '2' appeared previously in i.d A and C but not in B and so it produces 1,0,1(i.d A,B,C respectively) but we add this with the previous individual result 1,1,0 + 1,0,1 and we have 2,1,1 for i.d A,B,C respectively and so on
You need to group the data with i.d column and perform shift using shift function and group the shifted data with i.d again and use cumsum to get prev_no_of_T1. For prev_no_of_T1_2 just divide prev_no_of_T1 by 2.
import pandas as pd
df = pd.read_csv('test2.csv')
#shift the data groupwise
df['shifted'] = df.groupby('i.d')['T1'].shift(1).fillna(0)
# take grouwise cumulative sum
df['prev_total_T1'] = df.groupby('i.d')['shifted'].cumsum().fillna(0)
# divide the prev_total_T1 with 2
df['prev_no_of_T1_2'] = df['prev_total_T1']/2
s/n
i.d
T1
shifted
prev_total_T1
prev_no_of_T1_2
0
A
2
0
0
0
1
B
2
0
0
0
2
C
0
0
0
0
3
A
2
2
2
1
4
B
0
2
2
1
5
C
2
0
0
0
6
A
1
2
4
2
7
B
0
0
2
1
8
C
1
2
2
1
I have various columns in a pandas dataframe that have dummy values and I want to fill them as follows:
Input Columns
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
| 0 | 0 |
| 0 | 1 |
| 0 | 1 |
| 1 | 0 |
| 0 | 1 |
Output columns:
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 2 | 4 |
How can I get this output in pandas?
Here working if there are only 0 and 1 values cumulative sum - DataFrame.cumsum:
df1 = df.cumsum()
print (df1)
c1 c2
0 0 1
1 0 1
2 1 1
3 1 1
4 1 2
5 1 3
6 2 3
7 2 4
If there are 0 and another values is possible use cumulative sum for mask for test not equal 0 values:
df2 = df.ne(0).cumsum()
I have one dataframe with 2 columns and I want to add a new column;
This new column should be updated based on a list that I have:
list = [0,1,2,3,6,7,9,10]
The new column is only updated with the list value if the flag (in col2) is 1.
If flag is 0, do not populate row in new column.
Current DF
+-------------+---------+
| context | flag |
+-------------+---------+
| 0 | 1 |
| 0 | 1 |
| 0 | 0 |
| 2 | 1 |
| 2 | 1 |
| 2 | 1 |
| 2 | 1 |
| 2 | 0 |
| 4 | 1 |
| 4 | 1 |
| 4 | 0 |
+-------------+---------+
Desired DF
+-------------+---------+-------------+
| context | flag | new_context |
+-------------+---------+-------------+
| 0 | 1 | 0 |
| 0 | 1 | 1 |
| 0 | 0 | |
| 2 | 1 | 2 |
| 2 | 1 | 3 |
| 2 | 1 | 6 |
| 2 | 1 | 7 |
| 2 | 0 | |
| 4 | 1 | 9 |
| 4 | 1 | 10 |
| 4 | 0 | |
+-------------+---------+-------------+
Right now, I loop through the indices of the list and assign the list value to the new_context column. Then I increment to go through the list.
The values are populated in the correct spots but they all say 0. I don't believe it's iterating through the list properly.
list_length = len(list)
i=0
for i in range(list_length])):
df["new_context"] = [list[i] if ele == 0 else "" for ele in df["flag"]]
if df["flag"] == 0: i+=1
I have also tried to iterate through the entire dataframe, however I think it's just applying the same list value (first list value of 0)
i=0
for index, row in df.iterrows():
df["new_context"] = [list[i] if ele == 0 else "" for ele in df["flag"]]
if row['flag'] == 0: i+=1
How can I use the next list value to populate the new column where the flag=1?
It seems i+=1 is not working.
Let us try
l = [0,1,2,3,6,7,9,10]
df['New']=''
df.loc[df.flag==1,'New']=l
df
Out[80]:
context flag New
0 0 1 0
1 0 1 1
2 0 0
3 2 1 2
4 2 1 3
5 2 1 6
6 2 1 7
7 2 0
8 4 1 9
9 4 1 10
10 4 0
I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.