I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64
I have various columns in a pandas dataframe that have dummy values and I want to fill them as follows:
Input Columns
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
| 0 | 0 |
| 0 | 1 |
| 0 | 1 |
| 1 | 0 |
| 0 | 1 |
Output columns:
+----+-----
| c1 | c2 |
+----+----+
| 0 | 1 |
| 0 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 2 | 4 |
How can I get this output in pandas?
Here working if there are only 0 and 1 values cumulative sum - DataFrame.cumsum:
df1 = df.cumsum()
print (df1)
c1 c2
0 0 1
1 0 1
2 1 1
3 1 1
4 1 2
5 1 3
6 2 3
7 2 4
If there are 0 and another values is possible use cumulative sum for mask for test not equal 0 values:
df2 = df.ne(0).cumsum()
I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.
I have a table that looks something like this:
+------------+------------+------------+------------+
| Category_1 | Category_2 | Category_3 | Category_4 |
+------------+------------+------------+------------+
| a | b | b | y |
| a | a | c | y |
| c | c | c | n |
| b | b | c | n |
| a | a | a | y |
+------------+------------+------------+------------+
I'm hoping for a pivot_table like result, with the counts of the frequency for each category. Something like this:
+---+------------+----+----+----+
| | | a | b | c |
+---+------------+----+----+----+
| | Category_1 | 12 | 10 | 40 |
| y | Category_2 | 15 | 48 | 26 |
| | Category_3 | 10 | 2 | 4 |
| | Category_1 | 5 | 6 | 4 |
| n | Category_2 | 9 | 5 | 2 |
| | Category_3 | 8 | 4 | 3 |
+---+------------+----+----+----+
I know I could pull it off by splitting the table, assigning value_counts to column values, then rejoining. Is there any more simple, more 'pythonic' way of pulling this off? I figure it may along the lines of a pivot paired with a Transform, but tests so far have been ugly at best.
So we need to melt (or stack ) your original dataframe, then we doing pd.crosstab, you can using pd.pivot_table as well.
s=df.set_index('Category_4').stack().reset_index().rename(columns={0:'value'})
pd.crosstab([s.Category_4,s.level_1],s['value'])
Out[532]:
value a b c
Category_4 level_1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
Using get_dummies first, then summing across index levels
d = pd.get_dummies(df.set_index('Category_4'))
d.columns = d.columns.str.rsplit('_', 1, True)
d = d.stack(0)
# This shouldn't be necessary but is because the
# index gets bugged and I'm "resetting" it
d.index = pd.MultiIndex.from_tuples(d.index.values)
d.sum(level=[0, 1])
a b c
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
First column cond contains either 1 or 0
Second column event contains either 1 or 0
I want to create a third column where each row is the (cumulated sum of cond % 4) of the COND column between two rows where event==1 (first row where event==1 must be included in the cumulated sum but not the last row)
+------+-------+--------+
| cond | event | Result |
+------+-------+--------+
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 0 | 0 | 2 |
| 1 | 0 | 3 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
+------+-------+--------+
This can be easily tackles by pandas.groupby.transform and cumsum
event_cum = df['event'].cumsum()
result = df['cond'].groupby(event_cum).transform('cumsum').mod(4)
result[event_cum == 0] = 0 # rows before the first event
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 0
8 1
9 2
10 1
Name: cond, dtype: int64