How to calculate counts on pandas pivot_table - python

I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0

You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0

If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.

The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.

EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64

Related

how to ffill a multi-index dataframe based on a first level mask

I have a multi-index dataframes.
import pandas as pd
from itertools import product
arrays = [['bar', 'baz','foo'], range(4)]
tuples = list(product(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
multi_ind=pd.DataFrame(np.random.randn(6, len(tuples)), index=range(6), columns=index)
Some values are nans:
multi_ind.loc[3,('bar',2)]=np.nan
multi_ind.loc[3,('bar',3)]=np.nan
multi_ind.loc[4,('bar',1)]=np.nan
For 'bar' I would like to fill all nans expect last, as described in:
Forward fill all except last value in python pandas dataframe
mask=multi_ind['bar']
last_valid_column_per_row = mask.apply(pd.Series.last_valid_index,axis=1)
mask=mask.apply(lambda series:series[:int(last_valid_column_per_row.loc[series.name])].ffill(),axis=1)
Then I would like to ffill() also the other first levels (.e.g baz,foo), using the same logic as for bar (up to last valid index from df['bar']), and I would like to set a nan also any value which was still nan in bar
How to achieve that in an efficient way?
Now I am doing the following, but it is very slow...
df_as_dict={}
df=df.ffill(axis=1) # start by ffilling
for first_level,gr in df.groupby(level=0,axis=1):
gr[first_level][(mask.isnull())]=np.nan # then remove the nans (they should be only at the end)
df_as_dict[first_level]=gr[first_level]
The code based on last_valid_index (in the indicated post) actually
fills NaN along the given axis:
without initial NaN cells (ffill has no previous value to
take as source),
without trailing NaN cells (whatever their number), just
because of last_valid_index, terminating the action just
before the trailing continuous sequence of NaNs,
but if you are happy with this arrangement, let it be.
I created the test DataFrame the following, more concise way:
arrays = [['bar', 'baz','foo'], range(4)]
cols = pd.MultiIndex.from_product(arrays, names=['first', 'second'])
np.random.seed(2)
arr = np.arange(1, 6 * 12 + 1, dtype=float).reshape(6, -1)
# Where to put NaN (x / y)
ind = (np.array([0, 0, 1, 2, 2, 2, 3, 4, 4, 5, 5]),
np.array([1, 2, 6, 1, 3, 5,10, 2, 3,10,11]))
arr[ind] = np.nan
multi_ind = pd.DataFrame(arr, columns=cols)
so that it contains:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 NaN NaN 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 NaN 20.0 21.0 22.0 23.0 24.0
2 25.0 NaN 27.0 NaN 29.0 NaN 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 NaN 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
To get your result, run:
result = multi_ind.stack(level=0).apply(
lambda row: row[: row.last_valid_index() + 1].ffill(), axis=1)\
.unstack(level=1).swaplevel(axis=1).reindex(columns=multi_ind.columns)
Note that your last_valid_column_per_row is not needed.
It is enough to pass axis=1 to operate on rows, instead of
columns (like in the indicated post).
The result is:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 18.0 20.0 21.0 22.0 23.0 24.0
2 25.0 25.0 27.0 NaN 29.0 29.0 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 46.0 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
Details:
stack(level=0) - put bar, baz and foo "fragments"
in consecutive rows.
apply(….ffill(), axis=1) - fill each row, without the trailing
sequence of NaN (if any). Note that I added + 1 in order to
include the last non-NaN value in the result. Otherwise the last
column would have been dropped.
unstack(level=1) - restore the previous ("wide") arangement,
but unfortunately the order of column MultiIndex levels is reversed.
swaplevel(axis=1) - restore the original order of column levels,
but unfortunately the order of column names is wrong.
reindex(…) - restore the original column order.

Adding column to dataframe based on another dataframe using pandas

I need to create a new column in dataframe based on intervals from another dataframe.
For example, I have a dataframe where in the time column I have values ​​and I want to create column in another dataframe based on the intervals in that time column.
I think a practical example is simpler to understand:
Dataframe with intervals
df1
time value var2
0 1.0 34.0 35.0
1 4.0 754.0 755.0
2 9.0 768.0 769.0
3 12.0 65.0 66.0
Dataframe that I need to filter
df2
time value var2
0 1.0 23.0 23.0
1 2.0 43.0 43.0
2 3.0 76.0 12.0
3 4.0 88.0 22.0
4 5.0 64.0 45.0
5 6.0 98.0 33.0
6 7.0 76.0 11.0
7 8.0 56.0 44.0
8 9.0 23.0 22.0
9 10.0 54.0 44.0
10 11.0 65.0 22.0
11 12.0 25.0 25.0
should result
df3
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
EDIT: As Shubham Sharma said, it's not a filter, I want to add a new column based on intervals in other dataframe.
You can use pd.cut to categorize the time in df2 into discrete intervals based on the time in df1 then use Series.factorize to obtain a numeric array identifying distinct ordered values.
df2['interval'] = pd.cut(df2['time'], df1['time'], include_lowest=True)\
.factorize(sort=True)[0] + 1
Result:
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3

Add a support column with sum of qty for each quantile

i have a table with month and qty:
and also i have calculated percentile in a separate table:
q = data2.groupby('Month').quantile([0.05, 0.25, 0.5, 0.75, 0.95, 1])
Now I need to add another column in percentile table which should show the number (count) of record fall under the percentile , what I have tried:
q['Count'] = q['Qty2'].count()
and the result is showing the same count for every row:
the result table should like;
I think you need cut for binning column Qty, tehn aggregate sum and size, reshape by DataFrame.stack and Series.unstack and last count new rows and columns. Because working with MultiIndex ccolumns are selected by tuples:
df = pd.read_excel('sample data.xlsx')
lab = ['<10km','10-25km','25-50km','50-75km','75-100km','>100km']
df['bins'] = (pd.cut(df['Qty'],
bins=[-np.inf, 10,25,50,75,100,np.inf],
labels=lab).astype(str))
# df = df.sort_values('Date')
df = (df.groupby([pd.Grouper(freq='MS', key='Date'), 'bins'], sort=False)
.agg(Qty=('Qty','sum'), Count=('Qty', 'size'))
.stack()
.unstack([1,2])
)
df = df.set_index(df.index.strftime('%b-%y'))
df[('','Total Qty')] = df.xs('Qty', axis=1, level=1).sum(axis=1)
df[('','Total Count')] = df.xs('Count', axis=1, level=1).sum(axis=1)
df.loc['Grand Total'] = df.sum()
df.loc['% share'] = (df.loc['Grand Total'].div(df.loc['Grand Total',('','Total Count')])
.mul(100).round())
df[('','%')] = (df[('','Total Count')].drop(['Grand Total','% share'])
.div(df.loc['Grand Total',('','Total Count')])
.mul(100).round())
print (df)
bins 50-75km 75-100km >100km 25-50km \
Qty Count Qty Count Qty Count Qty
Date
Jan-20 2252.515 36.0 2931.099 34.0 1963.314 16.0 2365.221
Feb-20 3201.651 51.0 1640.793 19.0 4085.809 30.0 1370.316
Mar-20 2098.092 34.0 1401.169 16.0 1539.441 13.0 1266.176
Apr-20 996.734 16.0 703.785 8.0 450.147 4.0 1054.756
May-20 1665.223 27.0 1074.167 12.0 1615.029 12.0 2645.278
Jun-20 3924.892 65.0 2132.259 25.0 2461.037 20.0 5364.342
Jul-20 3867.246 64.0 3588.282 41.0 3768.105 29.0 4004.760
Aug-20 3926.835 65.0 2620.992 31.0 3431.889 26.0 3269.309
Sep-20 2302.843 37.0 2012.938 24.0 4651.756 35.0 773.813
Grand Total 24236.031 395.0 18105.484 210.0 23966.527 185.0 22113.971
% share 1327.000 22.0 991.000 11.0 1312.000 10.0 1210.000
bins 10-25km <10km \
Count Qty Count Qty Count Total Qty Total Count
Date
Jan-20 64.0 490.34800 29.0 21.014 4.0 10023.51100 183.0
Feb-20 39.0 693.42200 38.0 11.019 2.0 11003.01000 179.0
Mar-20 35.0 516.79800 30.0 27.866 8.0 6849.54200 136.0
Apr-20 30.0 283.63600 16.0 17.933 3.0 3506.99100 77.0
May-20 75.0 497.96000 27.0 29.593 4.0 7527.25000 157.0
Jun-20 148.0 1477.66547 81.0 17.297 2.0 15377.49247 341.0
Jul-20 110.0 1642.40900 94.0 42.065 6.0 16912.86700 344.0
Aug-20 89.0 776.63000 43.0 77.330 13.0 14102.98500 267.0
Sep-20 21.0 351.31300 23.0 20.144 3.0 10112.80700 143.0
Grand Total 611.0 6730.18147 381.0 264.261 45.0 95416.45547 1827.0
% share 33.0 368.00000 21.0 14.000 2.0 5223.00000 100.0
bins
%
Date
Jan-20 10.0
Feb-20 10.0
Mar-20 7.0
Apr-20 4.0
May-20 9.0
Jun-20 19.0
Jul-20 19.0
Aug-20 15.0
Sep-20 8.0
Grand Total NaN
% share NaN

How to prioritise columns based on overall significance level of values?

How should I select columns based on significance level and which factors should I choose to decide the priority of the column of the given DataFrame
A B C D E F G H
5 61.0 77.0 40.0 46.0 60.0 73.0 66.0 1.0
4 60.0 51.0 49.0 59.0 59.0 67.0 69.0 3.0
3 35.0 32.0 48.0 57.0 43.0 34.0 34.0 4.0
2 17.0 16.0 22.0 12.0 15.0 5.0 5.0 1.0
1 10.0 7.0 11.0 3.0 4.0 5.0 8.0 4.0
At first, each index is a rating by the users for different functionalities (columns). I concatenated the count_values of each column and now trying to understand which functionalities(columns) are more relevant to the users. From the above df, I have to select the top three columns.
If anyone needs further clarification, please let me know.

Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? I.e. would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.

Categories

Resources