Creating a Contingency table in Pandas - python

I want to create a contingency table in Pandas. I can do it with the following code but I wondered if there is a pandas function that would do it for me.
For a reproducible example:
toy_data #json
'{"Light":{"321":"no_light","476":"night_light","342":"lamp","454":"lamp","25":"night_light","53":"night_light","120":"night_light","346":"night_light","360":"lamp","55":"no_light","391":"night_light","243":"no_light","101":"night_light","377":"night_light","124":"no_light","368":"lamp","400":"no_light","247":"night_light","270":"lamp","208":"night_light"},"Nearsightedness":{"321":"No","476":"Yes","342":"Yes","454":"Yes","25":"No","53":"Yes","120":"Yes","346":"No","360":"No","55":"Yes","391":"Yes","243":"No","101":"No","377":"Yes","124":"No","368":"No","400":"No","247":"No","270":"Yes","208":"No"}}'
toy_data.head()
Light Nearsightedness
321 no_light No
476 night_light Yes
342 lamp Yes
454 lamp Yes
25 night_light No
df = pd.DataFrame(toy_data.groupby(['Light', 'Nearsightedness']).size())
df = df.unstack('Nearsightedness')
df.columns = df.columns.droplevel()
df
Nearsightedness No Yes
Light
lamp 2 3
night_light 5 5
no_light 4 1

pd.crosstab will do the trick:
pd.crosstab(df.Light, df.Nearsightedness)
Output:
Nearsightedness No Yes
Light
lamp 2 3
night_light 5 5
no_light 4 1

You can use pd.crosstab:
res = pd.crosstab(df['Light'], df['Nearsightedness'].eq('Yes'))
print(res)
Nearsightedness False True
Light
lamp 2 3
night_light 5 5
no_light 4 1

Related

Create A New DataFrame Based on Conditions of Multiple DataFrames

I have two datasets: one with cancer positive patients (df_pos), and the other with the cancer negative patients (df_neg).
df_pos
id
0 123
1 124
2 125
df_neg
id
0 234
1 235
2 236
I want to compile these datasets into one with an extra column if the patient has cancer or not (yes or no).
Here is my desired outcome:
id outcome
0 123 yes
1 124 yes
2 125 yes
3 234 no
4 235 no
5 236 no
What would be a smarter approach to compile these?
Any suggestions would be appreciated. Thanks!
Use pandas.DataFrame.append and pandas.DataFrame.assign:
>>> df_pos.assign(outcome='Yes').append(df_neg.assign(outcome='No'), ignore_index=True)
id outcome
0 123 Yes
1 124 Yes
2 125 Yes
3 234 No
4 235 No
5 236 No
df_pos['outcome'] = True
df_neg['outcome'] = False
df = pd.concat([df_pos, df_neg]).reset_index(drop=True)

Python convert long to wide data frame(Strings) [duplicate]

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.
Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')
A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)
Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN
A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.
pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis
An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

I am not abe to make accurate pivot table

I have one dataframe which contain many columns and i am trying to make pivot table
like this
Data sample
program | InWappTable | InLeadExportTrack
VIC | True | 1
VIC | True |1
VIC | True |1
VIC | True | 1
Here is my code
rec.groupby(['InWappTable', 'InLeadExportTrack','program']).size()
And Expected Output is
IIUC, you can try this:
df_new=df.groupby(['program'])['InWappTable','InLeadExporttrack'].count().reset_index()
total = df_new.sum()
total['program'] = 'Total'
df_new=df_new.append(total, ignore_index=True)
print(df_new)
I do not believe that you require a pivot_table here, though a pivot_table approach with aggfunc can also be used effectively.
Here is how I approached this
Generate some data
a = [['program','InWappTable','InLeadExportTrack'],
['VIC',True,1],
['Mall',False,15],
['VIC',True,101],
['VIC',True,1],
['Mall',True,74],
['Mall',True,11],
['VIC',False,44]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
program InWappTable InLeadExportTrack
0 VIC True 1
1 Mall False 15
2 VIC True 101
3 VIC True 1
4 Mall True 74
5 Mall True 11
6 VIC False 44
First do GROUP BY with count aggregation
df_grouped = df.groupby(['program']).count()
print(df_grouped)
InWappTable InLeadExportTrack
program
Mall 3 3
VIC 4 4
Then to get the sum of all columns
num_cols = ['InWappTable','InLeadExportTrack']
df_grouped[num_cols] = df_grouped[num_cols].astype(int)
df_grouped.loc['Total']= df_grouped.sum(axis=0)
df_grouped.reset_index(drop=False, inplace=True)
print(df_grouped)
program InWappTable InLeadExportTrack
0 Mall 3 3
1 VIC 4 4
2 Total 7 7
EDIT
Based on the comments in the OP, df_grouped = df.groupby(['program']).count() could be replaced by df_grouped = df.groupby(['program']).sum(). In this case, the output is shown below
program InWappTable InLeadExportTrack
0 Mall 2 100
1 VIC 3 147
2 Total 5 247

Cannot insert subtotals into pandas dataframe

I'm rather new to Python and to Pandas. With the help of Google and StackOverflow, I've been able to get most of what I'm after. However, this one has me stumped. I have a dataframe that looks like this:
SalesPerson 1 SalesPerson 2 SalesPerson 3
Revenue Number of Orders Revenue Number of Orders Revenue Number of Orders
In Process Stage 1 8347 8 9941 5 5105 7
In Process Stage 2 3879 2 3712 3 1350 10
In Process Stage 3 7885 4 6513 8 2218 2
Won Not Invoiced 4369 1 1736 5 4950 9
Won Invoiced 7169 5 5308 3 9832 2
Lost to Competitor 8780 1 3836 7 2851 3
Lost to No Action 2835 5 4653 1 1270 2
I would like to add subtotal rows for In Process, Won, and Lost, so that my data looks like:
SalesPerson 1 SalesPerson 2 SalesPerson 3
Revenue Number of Orders Revenue Number of Orders Revenue Number of Orders
In Process Stage 1 8347 8 9941 5 5105 7
In Process Stage 2 3879 2 3712 3 1350 10
In Process Stage 3 7885 4 6513 8 2218 2
In Process Subtotal 20111 14 20166 16 8673 19
Won Not Invoiced 4369 1 1736 5 4950 9
Won Invoiced 7169 5 5308 3 9832 2
Won Subtotal 11538 6 7044 8 14782 11
Won Percent 27% 23% 20% 25% 54% 31%
Lost to Competitor 8780 1 3836 7 2851 3
Lost to No Action 2835 5 4653 1 1270 2
Lost Subtotal 11615 6 8489 8 4121 5
Lost Percent 27% 23% 24% 25% 15% 14%
Total 43264 26 35699 32 27576 35
So far, my code looks like:
def create_win_lose_table(dataframe):
in_process_stagename_list = {'In Process Stage 1', 'In Process Stage 2', 'In Process Stage 3'}
won_stagename_list = {'Won Invoiced', 'Won Not Invoiced'}
lost_stagename_list = {'Lost to Competitor', 'Lost to No Action'}
temp_Pipeline_df = dataframe.copy()
for index, row in temp_Pipeline_df.iterrows():
if index not in in_process_stagename_list:
temp_Pipeline_df.drop([index], inplace = True)
Pipeline_sum = temp_Pipeline_df.sum()
#at the end I was going to concat the sum to the original dataframe, but that's where I'm stuck
I have only started to work on the in process dataframe. My thought was that once I figured that out I could then just duplicate that process for the Won and Lost categories. Any thoughts or approaches are welcome.
Thank you!
Jon
Simple Example for you.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5, 5))
df_total = pd.DataFrame(np.sum(df.iloc[:, :].values, axis=0)).transpose()
df_with_totals = df.append(df_total)
df_with_totals
0 1 2 3 4
0 0.743746 0.668769 0.894739 0.947641 0.753029
1 0.236587 0.862352 0.329624 0.637625 0.288876
2 0.817637 0.250593 0.363517 0.572789 0.785234
3 0.140941 0.221746 0.673470 0.792831 0.170667
4 0.965435 0.836577 0.790037 0.996000 0.229857
0 2.904346 2.840037 3.051388 3.946885 2.227662
You can use the rename argument in Pandas to call the summary row whatever you want.

How to count the number of times a value is in a column based on particular row values?

I have this data frame:
Outlook Temperature PlayTennis Value
0 Sunny 60 Yes 1
1 Sunny 70 Yes 1
2 Sunny 40 No 1
3 Overcast 40 No 1
4 Overcast 60 Yes 1
5 Overcast 50 Yes 1
6 Overcast 70 Yes 1
7 Overcast 80 Yes 1
8 Rain 65 No 1
9 Rain 70 Yes 1
and I want to get this
Outlook Yes No
Sunny 2 1
Overcast 4 1
Rain 1 1
Not sure what commands to use to count the yesses and nos based on Sunny/Overcast/Rain
How's this?
df.groupby('Outlook').apply(lambda g: g['PlayTennis'].value_counts())
or, for your exact spec:
df.groupby('Outlook').apply(lambda g: g['PlayTennis'].value_counts()).unstack(1)
or even shorter:
df.groupby('Outlook')['PlayTennis'].value_counts().unstack(1)
Here's something to start out with:
forecasts = [
["sunny", "yes"],
["sunny", "yes"],
["sunny", "no"],
["overcast", "no"],
# more forecasts ...
]
myForecasts = {}
for forecast in forecasts:
if forecast[0] not in myForecasts:
myForecasts[forecast[0]] = [0, 0]
if forecast[1] == "yes":
myForecasts[forecast[0]][0] += 1
else:
myForecasts[forecast[0]][1] += 1
print("Outlook | Yes | No")
for myForecast in myForecasts:
print("{} | {} | {}".format(myForecast, myForecasts[myForecast][0], myForecasts[myForecast][1]))
I hope this helps some. And next time, please show us that you've done your homework.
You could use pd.pivot_table to solve this
In [88]: pd.pivot_table(df, index='Outlook', cols='PlayTennis',
values='Value', aggfunc='sum')
Out[88]:
PlayTennis No Yes
Outlook
Overcast 1 4
Rain 1 1
Sunny 1 2
Also, You can groupby your data on 'Outlook', 'PlayTennis' get the count and use unstack('PlayTennis')
In [87]: df.groupby(['Outlook', 'PlayTennis']).size().unstack('PlayTennis')
Out[87]:
PlayTennis No Yes
Outlook
Overcast 1 4
Rain 1 1
Sunny 1 2

Categories

Resources