Pandas groupby aggregation with percentages - python

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?

Related

How to quickly sum across columns for every permutation of rows in Python

Suppose I have a n x k matrix X. And I want to get the sum across the columns, but for every permutation of the rows. So if my matrix is [[1,2],[3,4]] my desired output would be [1+2, 1+4, 3+2, 3+4]. I produce a MWE example with my first attempt at a solution. I'm hoping I can get some help to reduce the computation time.
My actual problem has n=160 and k=4, and it takes quite a while to run (as of writing this, it's still running).
import pandas as pd
import numpy as np
import itertools
n = 4
k = 3
X = np.random.randint(0, 10, (n, k))
df = pd.DataFrame(X)
df
0 1 2
0 2 9 2
1 7 6 4
2 3 7 0
3 5 0 0
ixi = df.index.tolist()
ixc = df.columns.tolist()
psum = np.array([df.lookup(i, ixc).sum() for i in
itertools.product(ixi, repeat=len(ixc))])
You can try functools.reduce:
from functools import reduce
reduce(np.add.outer, df.values.T).ravel()

Slice mutilple columns that are not next to each other in dataframe

I want to slice metope columns that are located several columns away from each other. I'm trying to write code that easy without having to write the code repeatedly:
df (See below for example) where columns are from A to H, with many rows containing some data (x).
How do I slice multiple randomly spaced columns, the say A, D, E, G, all in minimum amount of code. I don't want to rewrite loc code (df.loc['A'], df.loc['C:E'], df.loc['G'])?
Can I generate a list and loop through it or is there a shorter/quicker way?
Ultimately my goal would be to drop the selected columns from the main DataFrame.
A B C D E F G H
0 x x x x x x x x
1 x x x x x x x x
2 x x x x x x x x
3 x x x x x x x x
4 x x x x x x x x
You might harness .iloc method to get columns by their position rather than name, for example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9],'D':[10,11,12],'E':[13,14,15]})
df2 = df.iloc[:, [0,2,4]]
print(df2)
output:
A C E
0 1 7 13
1 2 8 14
2 3 9 15
If you need just x random columns from your df which has y columns, you might use random.sample for example if you want 3 column out of 5:
import random
cols = sorted(random.sample(range(0,5),k=3))
gives cols which is sorted list of three numbers (thanks to sorted order of columns will be preserved)

Why this case pandas dataframe assign raise TypeError

Environment:
Python 3.6.4
pandas 0.23.4
My code is below.
from math import sqrt
import pandas as pd
df = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]})
df = df.assign(d = lambda z: sqrt(z.x**2 + z.y**2))
The bottom line raise TypeError like below.
...
TypeError: cannot convert the series to <class 'float'>
Without sqrt, it works.
df = df.assign(d2 = lambda z: z.x**2 + z.y**2)
df
Out[6]:
x y d2
0 1 4 17
1 2 5 29
2 3 6 45
And apply also works.
df['d3'] = df.apply(lambda z: sqrt(z.x**2 + z.y**2), axis=1)
df
Out[8]:
x y d2 d3
0 1 4 17 4.123106
1 2 5 29 5.385165
2 3 6 45 6.708204
What's the matter with the first?
Use numpy.sqrt - it works also with 1d arrays, while sqrt from math works only with scalars:
df = df.assign(d = lambda z: np.sqrt(z.x**2 + z.y**2))
Another solution is use **(1/2):
df = df.assign(d = lambda z: (z.x**2 + z.y**2)**(1/2))
print (df)
x y d
0 1 4 4.123106
1 2 5 5.385165
2 3 6 6.708204
Your solution working, because axis=1 in apply working by scalars, but like #jpp mentioned, apply should not be preferred as it involves a Python-level row-wise loop.
df.apply(lambda z: print(z.x), axis=1)
1
2
3
pandas series object is like a numpy array, you cannot operate a math module
that is searching for a single object, and not a series.
the default math operations are valid , but not functions that do not work on arrays/ series.
what you can do is:
df = df.assign(d = lambda z: (z.x**0.5 + z.y**0.5))
or
df['d'] = df.z.x**0.5 + df.y.x**0.5
which is defined in pandas standard operations.

add a different random number to every cell in a pandas dataframe

I need to add some 'noise' to my data, so I would like to add a different random number to every cell in my pandas dataframe. This code works, but seems unpythonic. Is there a better way?
import pandas as pd
import numpy as np
df = pd.DataFrame(0.0, index=[1,2,3,4,5], columns=list('ABC') )
print df
for x,line in df.iterrows():
for col in df:
line[col] = line[col] + (np.random.rand()-0.5)/1000.0
print df
df + np.random.rand(*df.shape) / 10000.0
OR
Let's use applymap:
df = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABC') )
df.applymap(lambda x: x + np.random.rand()/10000.0)
output:
A \
1 [[1.00006953418, 1.00009164785, 1.00003177706]...
2 [[1.00007291245, 1.00004186046, 1.00006935173]...
3 [[1.00000490127, 1.0000633115, 1.00004117181],...
4 [[1.00007159622, 1.0000559506, 1.00007038891],...
5 [[1.00000980335, 1.00004760836, 1.00004214422]...
B \
1 [[1.00000320322, 1.00006981682, 1.00008912557]...
2 [[1.00007443802, 1.00009270815, 1.00007225764]...
3 [[1.00001371778, 1.00001512412, 1.00007986851]...
4 [[1.00005883343, 1.00007936509, 1.00009523334]...
5 [[1.00009329606, 1.00003174878, 1.00006187704]...
C
1 [[1.00005894836, 1.00006592776, 1.0000171843],...
2 [[1.00009085391, 1.00006606979, 1.00001755092]...
3 [[1.00009736701, 1.00007240762, 1.00004558753]...
4 [[1.00003981393, 1.00007505714, 1.00007209959]...
5 [[1.0000031608, 1.00009372917, 1.00001960112],...
This would be the more succinct method and equivalent:
In [147]:
df = pd.DataFrame((np.random.rand(5,3) - 0.5)/1000.0, columns=list('ABC'))
df
Out[147]:
A B C
0 0.000381 -0.000167 0.000020
1 0.000482 0.000007 -0.000281
2 -0.000032 -0.000402 -0.000251
3 -0.000037 -0.000319 0.000260
4 -0.000035 0.000178 0.000166
If you're doing this to an existing df with non-zero values then add:
In [149]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[149]:
A B C
0 -1.705644 0.149067 0.835378
1 -0.956335 -0.586120 0.212981
2 0.550727 -0.401768 1.421064
3 0.348885 0.879210 0.136858
4 0.271063 0.132579 1.233789
In [154]:
df.add((np.random.rand(df.shape[0], df.shape[1]) - 0.5)/1000.0)
Out[154]:
A B C
0 -1.705459 0.148671 0.835761
1 -0.956745 -0.586382 0.213339
2 0.550368 -0.401651 1.421515
3 0.348938 0.878923 0.136914
4 0.270864 0.132864 1.233622
For nonzero data:
df + (np.random.rand(df.shape)-0.5)*0.001
OR
df + np.random.uniform(-0.01,0.01,(df.shape)))
For cases where your data frame contains zeros that you wish to keep as zero:
df * (1 + (np.random.rand(df.shape)-0.5)*0.001)
OR
df * (1 + np.random.uniform(-0.01,0.01,(df.shape)))
I think either of these should work, its a case of generating a same size "dataframe" (or perhaps array of arrays) as your existing df and adding it to your existing df (multiplying by 1 + random for cases where you wish zeros to remain zero). With the uniform function you can determine the scale of your noise by altering the 0.01 variable.

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

Categories

Resources