I'm trying to do some data crunching in Python and I have a nested loop that does some arithmetic calculations. The inner loop is executed 20.000 times so the following piece of code takes a long time:
for foo in foo_list:
# get bar_list for foo
for bar in bar_list:
# do calculations w/ foo & bar
Could this loop be faster using Numpy or Scipy?
Use Numpy:
import numpy as np
foo = np.array(foo_list)[:,None]
bar = np.array(bar_list)[None,:]
Then
foo + bar
or other operation creates an array len(foo) * len(bar) with the respective results.
Example:
>>> foo_list = [10, 20, 30]
>>> bar_list = [4, 5]
>>> foo = np.array(foo_list)[:,None]
>>> bar = np.array(bar_list)[None,:]
>>> 2 * foo + bar
array([[24, 25],
[44, 45],
[64, 65]])
I've used numpy for image processing. Before I was using for(x in row) { for y in column} (or vice-versa, you get the idea).
That was fine for small images but would happily consume ram. Instead I switched to numpy.array. Much faster.
Depending on what is actually going on in your loop, yes.
numpy allows use of arrays and matrices, which allows indexing making execution of your code faster and, in some cases, can eliminate looping.
Indexing example:
import magic_square as ms
a = ms.magic(5)
print a # a is an array
[[17 24 1 8 15]
[23 5 7 14 16]
[ 4 6 13 20 22]
[10 12 19 21 3]
[11 18 25 2 9]]
# Indexing example.
b = a[a[:,1]>10]*10
print b
[[170, 240, 10, 80, 150],
[100, 120, 190, 210, 30],
[110, 180, 250, 20, 90]]
It should be clear how indexing can substantially improve your speed when analyzing one or more arrays. It's a powerful tool...
If these are aggregation statistics, consider using Python Pandas. For example, if you want to do something to all the different (foo, bar) pairs, you can just group-by those items and then apply the vectorized NumPy operations:
import pandas, numpy as np
df = pandas.DataFrame(
{'foo':[1,2,3,3,5,5],
'bar':['a', 'b', 'b', 'b', 'c', 'c'],
'colA':[1,2,3,4,5,6],
'colB':[7,8,9,10,11,12]})
print df.to_string()
# Computed average of 'colA' weighted by values in 'colB', for each unique
# group of (foo, bar).
weighted_avgs = df.groupby(['foo', 'bar']).apply(lambda x: (1.0*x['colA']*x['colB']).sum()/x['colB'].sum())
print weighted_avgs.to_string()
This prints the following for just the data object:
bar colA colB foo
0 a 1 7 1
1 b 2 8 2
2 b 3 9 3
3 b 4 10 3
4 c 5 11 5
5 c 6 12 5
And this is the grouped, aggregated output
foo bar
1 a 1.000000
2 b 2.000000
3 b 3.526316
5 c 5.521739
Related
I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).
Would appreciate recommendations!
Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
testing speed
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase+ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N+1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Can try this. It should generalize to arbitrary number of columns.
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
[2, 'E', 3, 4, 5, 6, 7],
[3, 'A', 9, 6, 3, 8, 1],
[4, 'C', 13, 5, 88, 9, 0],
[5, 'B', 6, 4, 6, 9, 4]])
df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']
for val in np.unique(df['bar'].values):
df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']
To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long (instead of melt/stack) because of the patterned nature of your column names:
out = (
pd.wide_to_long(
df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w+'
)
.loc[lambda d:
d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
'ID'
]
.droplevel(-1)
.rename('baz')
.reset_index()
)
print(out)
foo bar baz
0 1 B 2.3
1 2 E 7.0
2 3 A 9.0
3 4 C 88.0
4 5 B 4.0
An alternative syntax to the above leverages .melt & .query to shorten the syntax.
out = (
df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
.assign(id=lambda d: d['id'].str.get(-1))
.query('bar == id')
)
print(out)
foo bar id baz
2 3 A A 9.0
5 1 B B 2.3
9 5 B B 4.0
13 4 C C 88.0
21 2 E E 7.0
I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python
Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.
I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)
df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
I'd like to split df into a specified number of groups and sum all elements in each group. For example, dividing df into 4 groups
1,4,1,3 2,8,3,6 3,7,3,1 2,9
would result in
9
19
14
11
I could do df.groupby(np.arange(len(df))//4).sum(), but this won't work for larger dataframes
For example
df1=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
df1.groupby(np.arange(len(df1))//4).sum()
creates 5 groups instead of 4
You can use numpy.array_split:
df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
a = pd.Series([x.values.sum() for x in np.array_split(df, 4)])
print (a)
0 11
1 27
2 15
3 13
dtype: int64
Solution with concat and sum:
a = pd.concat(np.array_split(df, 4), keys=np.arange(4)).sum(level=0)
print (a)
0
0 11
1 27
2 15
3 13
Say you have this data frame:
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
You can achive it using list comprehension and loc:
group_size = 4
[df.loc[i:i+group_size-1].values.sum() for i in range(0, len(df), group_size)]
Output:
[9, 19, 14, 11]
I looked in the comments, and i thought that you can use some explicit python code when the "usual" pandas functions can't fulfill your needs.
So:
import pandas as pd
def get_sum(a, chunks):
for k in range(0, len(df), chunks):
yield a[k:k+chunks].values.sum()
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
group_size = list(get_sum(df, 4))
print(group_size)
Output:
[9, 19, 14, 11]
I want to get a 2d-numpy array from a column of a pandas dataframe df having a numpy vector in each row. But if I do
df.values.shape
I get: (3,) instead of getting: (3,5)
(assuming that each numpy vector in the dataframe has 5 dimensions, and that the dataframe has 3 rows)
what is the correct method?
Ideally, avoid getting into this situation by finding a different way to define the DataFrame in the first place. However, if your DataFrame looks like this:
s = pd.Series([np.random.randint(20, size=(5,)) for i in range(3)])
df = pd.DataFrame(s, columns=['foo'])
# foo
# 0 [4, 14, 9, 16, 5]
# 1 [16, 16, 5, 4, 19]
# 2 [7, 10, 15, 13, 2]
then you could convert it to a DataFrame of shape (3,5) by calling pd.DataFrame on a list of arrays:
pd.DataFrame(df['foo'].tolist())
# 0 1 2 3 4
# 0 4 14 9 16 5
# 1 16 16 5 4 19
# 2 7 10 15 13 2
pd.DataFrame(df['foo'].tolist()).values.shape
# (3, 5)
I am not sure what you want. But df.values.shape seems to be giving the correct result.
import pandas as pd
import numpy as np
from pandas import DataFrame
df3 = DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
print df3
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711
#1 0.246188 0.628944 0.528552 0.179939 -0.019213
#2 0.080049 0.579549 1.790376 -1.301700 1.372702
df3.values.shape
#(3L, 5L)
df3["a"]
#0 -0.221059
#1 0.246188
#2 0.080049
df3[:1]
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711