python/pandas - Transformed value_counts by Category - python

I have a table that looks something like this:
+------------+------------+------------+------------+
| Category_1 | Category_2 | Category_3 | Category_4 |
+------------+------------+------------+------------+
| a | b | b | y |
| a | a | c | y |
| c | c | c | n |
| b | b | c | n |
| a | a | a | y |
+------------+------------+------------+------------+
I'm hoping for a pivot_table like result, with the counts of the frequency for each category. Something like this:
+---+------------+----+----+----+
| | | a | b | c |
+---+------------+----+----+----+
| | Category_1 | 12 | 10 | 40 |
| y | Category_2 | 15 | 48 | 26 |
| | Category_3 | 10 | 2 | 4 |
| | Category_1 | 5 | 6 | 4 |
| n | Category_2 | 9 | 5 | 2 |
| | Category_3 | 8 | 4 | 3 |
+---+------------+----+----+----+
I know I could pull it off by splitting the table, assigning value_counts to column values, then rejoining. Is there any more simple, more 'pythonic' way of pulling this off? I figure it may along the lines of a pivot paired with a Transform, but tests so far have been ugly at best.

So we need to melt (or stack ) your original dataframe, then we doing pd.crosstab, you can using pd.pivot_table as well.
s=df.set_index('Category_4').stack().reset_index().rename(columns={0:'value'})
pd.crosstab([s.Category_4,s.level_1],s['value'])
Out[532]:
value a b c
Category_4 level_1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1

Using get_dummies first, then summing across index levels
d = pd.get_dummies(df.set_index('Category_4'))
d.columns = d.columns.str.rsplit('_', 1, True)
d = d.stack(0)
# This shouldn't be necessary but is because the
# index gets bugged and I'm "resetting" it
d.index = pd.MultiIndex.from_tuples(d.index.values)
d.sum(level=[0, 1])
a b c
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2

Related

Assign a total value of 1 if any number is present in a column, else 0

I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64

Count the number of duplicate grouped by ID pandas

I'm not sure if this is a duplicate question, but here it goes.
Assuming I have the following table:
import pandas
lst = [1,1,1,2,2,3,3,4,5]
lst2 = ['A','A','B','D','E','A','A','A','E']
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['ID', 'val'])
will output the following table
+----+-----+
| ID | Val |
+----+-----+
| 1 | A |
+----+-----+
| 1 | A |
+----+-----+
| 1 | B |
+----+-----+
| 2 | D |
+----+-----+
| 2 | E |
+----+-----+
| 3 | A |
+----+-----+
| 3 | A |
+----+-----+
| 4 | A |
+----+-----+
| 5 | E |
+----+-----+
The goal is count the duplicates on VAL grouped by ID:
+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | B | 0 |
+----+-----+--------------+
| 2 | D | 0 |
+----+-----+--------------+
| 2 | E | 0 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 4 | A | 0 |
+----+-----+--------------+
| 5 | E | 0 |
+----+-----+--------------+
I tried the following code but its counting the overall duplicates
df_grouped = df.groupby(['notes']).size().reset_index(name='count')
while the following code does only the duplicate count
df.duplicated(subset=['notes'])
what would be the best approach for this?
Let us try duplicated
df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]:
ID val is_dup
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0
8 5 E 0
You can use .groupby on the relevant columns and get the count. Then if you add >1 to the end, then that will mean the value for the specified group contains duplicates. The > 1 will create a boolean True/False data type. Finally, to change to 1 or 0, simply use .astype(int) to transform the data type from a boolean data type to an int, which changes True to 1 and False to 0:
df['is_duplicate'] = (df.groupby(['ID','val'])['val'].transform('count') > 1).astype(int)
Out[7]:
ID val is_duplicate
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0

Joining DataFrames Horizontally

I have a dataframe which consists of data that is indexed by the date. So the index has dates ranging from 6-1 to 6-18.
What I need to do is perform a "pivot" or a horizontal merge, based on the date.
So for example, lets say today is 6-18. I need to go through this dataframe, and find the dates which are 6-18, basically pivot/join them horizontally to the same dataframe.
Expected output (1 signifies there is data there, 0 signifies null/NaN):
Before the join, df:
date | x | y | z
6-15 | 1 | 1 | 1
6-15 | 2 | 2 | 2
6-18 | 3 | 3 | 3
6-18 | 3 | 3 | 3
Joining the df on 6-18:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
When I use append, or join or merge, what I get is this:
date | x | y | z | x (6-18) | y (6-18) | z (6-18)
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-15 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 0 | 0 | 0
6-18 | 1 | 1 | 1 | 1 | 1 | 1
6-18 | 1 | 1 | 1 | 1 | 1 | 1
What I've done is extract the date that I want, to a new dataframe using loc.
df_daily = df_metrics.loc[str(_date_map['daily']['start'].date())]
df_daily.columns = [str(cols) + " (Daily)" if cols in metric_names else cols for cols in df_daily.columns]
And then joining it to the master df:
df = df.join(df_daily, lsuffix=' (Daily)', rsuffix=' (Monthly)').reset_index()
When I try joining or merging, the dataset gets so big because I'm assuming it's doing a comparison of each row. So when 1 date of 1 row doesn't match, it will create a new row with NaN.
My dataset turns from a 30k row piece, to 2.8 million.

Panda's : Matrix from pd.crosstab()

I'm progressively learning pandas, I figured out that pd.crosstab() can do marvels but I've hard time to make it work in that case.
I have a list of objects obj tagged with an int, I want to have the matrix of the objects sharing the same tag (1 if it's the same, 0 else)
| obj | tag |
|-----|-----|
| a | 0 |
| b | 2 |
| c | 1 |
| ... | ... |
| z | 2 |
->
| | a | b | c | ... | z |
|-----|---|---|---|-----|---|
| a | 1 | 0 | 0 | . | 0 |
| b | 0 | 1 | 0 | . | 1 |
| c | 0 | 0 | 1 | . | 0 |
| ... | . | . | . | . | 0 |
| z | 0 | 1 | 0 | 0 | 1 |
There are some formidables ways to do it, is there one more panda-friendly ?
PS : Tried with pd.crosstab(df.obj, df.obj, values=df.tag, aggfunc=[np.sum]) but NaN filled.
You can use merge with crosstab and DataFrame.rename_axis:
df = df.merge(df, on='tag')
df = pd.crosstab(df.obj_x, df.obj_y).rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c z
a 1 0 0 0
b 0 1 0 1
c 0 0 1 0
z 0 1 0 1

Find all matching groups in a list of lists with pandas

I would like to find all cases for all ids in a Pandas DataFrame.
What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.
Input (Pandas DataFrame)
id | name |
-----------
1 | bla1 |
2 | bla2 |
2 | bla3 |
2 | bla4 |
3 | bla5 |
4 | bla9 |
5 | bla6 |
5 | bla7 |
6 | bla8 |
Cases
names [
[bla2, bla3, bla4], #case 1
[bla1, bla3, bla7], #case 2
[bla3, bla1, bla6], #case 3
[bla6, bla7] #case 4
]
Needed output (unless there is a more efficient way)
id | case1 | case2 | case3 | case4 |
------------------------------------
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 |
names = [
['bla2', 'bla3', 'bla4'], # case 1
['bla1', 'bla3', 'bla7'], # case 2
['bla3', 'bla1', 'bla6'], # case 3
['bla6', 'bla7'] # case 4
]
df = df.groupby('id').apply(lambda x: \
pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
.rename(columns=lambda x: 'case{}'.format(x + 1))
df
+------+---------+---------+---------+---------+
| id | case1 | case2 | case3 | case4 |
|------+---------+---------+---------+---------|
| 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 6 | 0 | 0 | 0 | 0 |
+------+---------+---------+---------+---------+
First, groupby id, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin in conjunction with the list comprehension. The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns.

Categories

Resources