How to implement a (more complex) R plyr chain in python?

How to implement a (more complex) R plyr chain in python? - python

I've been trying to implement the following plyr chain in python:
# Data
data_L1
X Y r2 contact_id acknowledge_issues
a c 100 xyzx 0
b d 100 fsdjkfl 0
a c 80 ejrkl 20
b d 60 fdsdl 40
b d 80 gsdkf 20
# Transformation
test <- ddply(data_L1,
.(X,Y),
summarize,
avg_r2 = mean(r2),
tickets = length(unique(contact_id)),
er_ai =length(acknowledge_issues[which(acknowledge_issues>0)])/length(acknowledge_issues)
)
# Output
test
X Y avg_r2 tickets er_ai
a c 90 2 0.5
b d 80 3 0.6667
However I only came this far in python:
test = data_L1.groupby(['X','Y']).agg({'r2': 'mean', 'contact_id' : 'count'})
I can't figure out how to create the variables er_ai in Python. Do you have suggestions for solutions in pandas or other libraries?

Use instead count function nunique and for er_ai get mean of all values by condition:
cols = {'r2':'avg_r2', 'contact_id':'tickets', 'acknowledge_issues':'er_ai'}
test = (data_L1.groupby(['X','Y'], as_index=False)
.agg({'r2': 'mean',
'contact_id' : 'nunique',
'acknowledge_issues': lambda x: (x>0).mean()})
.rename(columns=cols))
print (test)
X Y tickets er_ai avg_r2
0 a c 2 0.500000 90
1 b d 3 0.666667 80

Related

Aggregating and plotting multiple columns using matplotlib

I've got data in a pandas dataframe that looks like this:
ID A B C D
100 0 1 0 1
101 1 1 0 1
102 0 0 0 1
...
The idea is to create a barchart plot that shows the total of each (sum of the total number of A's, B's, etc.). Something like:
X
X X
x X X
A B C D
This should be so simple...

Set 'ID' aside, sum, and plot.bar:
df.set_index('ID').sum().plot.bar()
# or
df.drop(columns=['ID']).sum().plot.bar()
output:
just for fun
print(df.drop(columns='ID')
.replace({0: ' ', 1: 'X'})
.apply(sorted, reverse=True)
.to_string(index=False)
)
Output:
A B C D
X X X
X X
X

Calculating % for value in column based on condition or value

I have table below and I wanted to get the % for each type that is >= 10 seconds or more. What is an efficient modular code for that? I would normally just filter for each type and then divid, but wanted to know if better way to calculate the percentage of each value in type column that is >= 10 seconds or more.
Thanks
Type | Seconds
A 23
V 10
V 10
A 7
B 1
V 10
B 72
A 11
V 19
V 3
expected output:
type %
A .67
V .80
B .50

A slightly more efficient option is to create a boolean mask of Seconds.ge(10) and use groupby.mean() on the mask:
df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
# Type %
# 0 A 0.666667
# 1 B 0.500000
# 2 V 0.800000
Given these functions:
mask_groupby_mean = lambda df: df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
groupby_apply = lambda df: df.groupby('Type').Seconds.apply(lambda x: (x.ge(10).sum() / len(x)) * 100).reset_index(name='%')
set_index_mean = lambda df: df.set_index('Type').ge(10).mean(level=0).rename(columns={'Seconds': '%'}).reset_index()

You can use .groupby:
x = (
df.groupby("Type")["Seconds"]
.apply(lambda x: (x.ge(10).sum() / len(x)) * 100)
.reset_index(name="%")
)
print(x)
Prints:
Type %
0 A 66.666667
1 B 50.000000
2 V 80.000000

Another other option set_index + ge then mean on level=0:
new_df = (
df.set_index('Type')['Seconds'].ge(10).mean(level=0)
.round(2)
.reset_index(name='%')
)
new_df:
Type %
0 A 0.67
1 V 0.80
2 B 0.50

How to compute a selective norm for a Multi-Index DF

I have measurement data in MultiIndex spreadsheet format and need to compute a norm by dividing each value of a column by its corresponding reference value.
How can this be done efficiently and 'readable' using Python Pandas, i.e. how do I filter the correct reference value in order to compute the normed values?
Here's the input data:
result
var run ID
10 1 A 10
B 50
2 A 30
B 70
20 1 A 100
B 500
2 A 300
B 700
30 1 A 1000
B 5000
2 A 3000
B 7000
and this is the desired result:
normed
var run ID
10 1 A 0.1
B 0.1
2 A 0.1
B 0.1
20 1 A 1.0
B 1.0
2 A 1.0
B 1.0
30 1 A 10.0
B 10.0
2 A 10.0
B 10.0
As can be seen, var = 20 is the reference, but it gets even more complicated since there are two runs (1 and 2) and two devices under test.
I can create a mask df[df['var' == 20] when the DF is flattened using df.reset_index() (see comment #1), but I don't know how to proceed from here.
Any help is deeply appreciated!
Update
I have found a solution using query() in a for loop:
df_norm = pd.DataFrame()
df_flat = df.reset_index()
var_ref = 20
for ident in 'A','B':
for run in 1,2:
q = f'var == {var_ref} & run == {run} & ID == "{ident}"'
ref = df_flat.query(q)
#ref
#ref.result
#ref.result.iloc[0]
q = f'run == {run} & ID == "{ident}"'
df_m = df_flat.query(q)
norm = df_m.result / ref.result.iloc[0]
#norm
df__ = pd.DataFrame(norm.rename('norm'))
df__ = df_flat.merge(df__, left_index=True, right_index=True)
df_norm = pd.concat([df_norm, df__])
df_norm.sort_index()
Maybe there's a more elegant way to do it?

pandas apply and applymap functions are taking long time to run on large dataset

I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.

I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))

Calculate within categories: Equivalent of R's ddply in Python?

I have some R code I need to port to python. However, R's magic data.frame and ddply are keeping me from finding a good way to do this in python.
Sample data (R):
x <- data.frame(d=c(1,1,1,2,2,2),c=c(rep(c('a','b','c'),2)),v=1:6)
Sample computation:
y <- ddply(x, 'd', transform, v2=(v-min(v))/(max(v)-min(v)))
Sample output:
d c v v2
1 1 a 1 0.0
2 1 b 2 0.5
3 1 c 3 1.0
4 2 a 4 0.0
5 2 b 5 0.5
6 2 c 6 1.0
So here's my question for the pythonistas out there: how would you do the same? You have a data structure with a couple of important dimensions.
For each (c), and each(d) compute (v-min(v))/(max(v)-min(v))) and associate it with the corresponding (d,c) pair.
Feel free to use whatever data structures you want, so long as they're quick on reasonably large datasets (those that fit in memory).

Indeed pandas is the right (and only, I believe) tool for this in Python. It's a bit less magical than plyr but here's how to do this using the groupby functionality:
df = DataFrame({'d' : [1.,1.,1.,2.,2.,2.],
'c' : np.tile(['a','b','c'], 2),
'v' : np.arange(1., 7.)})
# in IPython
In [34]: df
Out[34]:
c d v
0 a 1 1
1 b 1 2
2 c 1 3
3 a 2 4
4 b 2 5
5 c 2 6
Now write a small transform function:
def f(group):
v = group['v']
group['v2'] = (v - v.min()) / (v.max() - v.min())
return group
Note that this also handles NAs since the v variable is a pandas Series object.
Now group by the d column and apply f:
In [36]: df.groupby('d').apply(f)
Out[36]:
c d v v2
0 a 1 1 0
1 b 1 2 0.5
2 c 1 3 1
3 a 2 4 0
4 b 2 5 0.5
5 c 2 6 1

Sounds like you want pandas and group by or aggregate.

You can also achieve a more performance if you use numpy and scipy.
Despite some ugly code it will be faster, pandas way will be slow if number of groups is very large and may even be worse than R. This will always be faster than R:
import numpy as np
import numpy.lib.recfunctions
from scipy import ndimage
x = np.rec.fromarrays(([1,1,1,2,2,2],['a','b','c']*2,range(1, 7)), names='d,c,v')
unique, groups = np.unique(x['d'], False, True)
uniques = range(unique.size)
mins = ndimage.minimum(x['v'], groups, uniques)[groups]
maxs = ndimage.maximum(x['v'], groups, uniques)[groups]
x2 = np.lib.recfunctions.append_fields(x, 'v2', (x['v'] - mins)/(maxs - mins + 0.0))
#save as csv
np.savetxt('file.csv', x2, delimiter=';')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement a (more complex) R plyr chain in python? - python

Related

Aggregating and plotting multiple columns using matplotlib

Calculating % for value in column based on condition or value

How to compute a selective norm for a Multi-Index DF

pandas apply and applymap functions are taking long time to run on large dataset

Calculate within categories: Equivalent of R's ddply in Python?

Categories

Resources