Python: fast subsetting and looping dataframe - python

I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?

Avoid computing the sub-DataFrame df[df['D'] == i] more than once. The original code computes this len(myunique)**2 times. Instead you can compute this once for each i (that is, len(myunique) times in total), store the results, and then pair them together later. For example,
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
def using_orig():
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
def using_groupby():
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop
Regarding the comment:
I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that
The answer above addresses how to compute itemp and jtemp more efficiently. If itemp and jtemp are not central to your real calculation, then we would need to better understand what you really want to compute in order to suggest (if possible) a way to compute it faster.

Here's a vectorized approach to form the groups based on unique elements from "D" column -
# Sort the dataframe based on the sorted indices of column 'D'
df_sorted = df.iloc[df['D'].argsort()]
# In the sorted dataframe's 'D' column find the shift/cut indces
# (places where elements change values, indicating change of groups).
# Cut the dataframe at those indices for the final groups with NumPy Split.
cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
df_split = np.split(df_sorted,cut_idx)
Sample testing
1] Form a sample dataframe with random elements :
>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
>>> df
A B C D
0 68 68 90 39
1 53 99 20 85
2 64 76 21 19
3 90 91 32 36
4 24 9 89 19
2] Run the original code and print the results :
>>> myunique = df['D'].unique()
>>> for i in myunique:
... itemp = df[df['D'] == i]
... print itemp
...
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
3] Run the proposed code and print the results :
>>> df_sorted = df.iloc[df['D'].argsort()]
>>> cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
>>> df_split = np.split(df_sorted,cut_idx)
>>> for split in df_split:
... print split
...
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85

Related

Select columns based on a range and single integer using iloc()

Given the data frame below I want to select columns A and D to F.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(5, 6)), columns=list('ABCDEF'))
In R I could say
df[ , c(1, 4:6)]
I want something similar for pandas. With
df.iloc[:, slice(3, 6, 1)]
I get columns D to F. But how can I add column A?
You use np.r_ to pass combinations of slices and indices. Since you seem to know the labels, you could can use get_loc to obtain the iloc indices.
import numpy as np
idx = np.r_[df.columns.get_loc('A'),
df.columns.get_loc('D'):df.columns.get_loc('F')+1]
# idx = np.r_[0, 3:6]
df.iloc[:, idx]
A D E F
0 38 71 62 63
1 60 93 72 94
2 57 33 30 51
3 88 54 21 39
4 0 53 41 20
Another option is np.split
df_split=pd.concat([np.split(df, [1],axis=1)[0],np.split(df,[3],axis=1)[1]],axis=1)
There you don't need to know the column names, just their positions.

How to reshape/"topple" a pandas dataframe

Topple is most likely the wrong name for the operation I want, but I cannot think of a better one.
I have N dataframes of shape (100,3), Each row of the original dataframe is the name of a test and the two results it produces. I want to reshape a single dataframe to a (1,200) shape, with all of the values of the tests as a single row. After that I'll append all of the N dataframes into a single one, ending with a (N,200) dataframe.
Here's an example with dummy data:
import pandas as pd
import numpy as np
import random
import string
np.random.seed(42)
tests = np.random.choice(list(string.ascii_letters),size=(100,1))
results = np.random.randint(0,100,size=(100, 2))
df = pd.DataFrame(np.concatenate([tests, results], axis=1), columns=["Test Name", "ValueA", "ValueB"])
toppled_df = pd.DataFrame(np.random.randint(0,100,size=(1,5)),columns=["Z Value A", "Z ValueB", "t ValueA", "t ValueB", "..."])
toppled_df = pd.DataFrame([[44,64,88,70,"..."]],columns=["M Value A", "M ValueB", "Z ValueA", "Z ValueB", "..."])
toppled_df.head()
A more pythonic way
df_out = df.set_index('Test Name').stack().to_frame().T
df_out.columns = df_out.columns.map(' '.join).str.strip()
You can melt df to a long format, join the test name and value type columns, then transpose it.
tests = np.random.choice(list(string.ascii_letters),size=(100,1))
results = np.random.randint(0,100,size=(100, 2))
df = pd.DataFrame(np.concatenate([tests, results], axis=1),
columns=["Test Name", "ValueA", "ValueB"])
df2['key'] = df['Test Name'] + ' ' + df['variable']
df2['key'] = df2['Test Name'] + ' ' + df2['variable']
df2[['key', 'value']].set_index('key').T
Loop through each dataframe to create the melted dataframe and then concatenate.
df2 = df.set_index('Test Name').unstack()
result = pd.DataFrame(data=df2.values.reshape(1,-1), columns=df2.index)
Output:
>>> result
ValueA ... ValueB
Test Name M Z C o Q h u M s w k k x J ... w N u p S r U x z y S O C o
0 44 88 8 0 87 10 7 34 4 27 72 11 32 22 ... 49 30 41 6 89 1 47 68 31 98 47 2 23 32
You can access individual results like this:
result['ValueA', 'M']
# or
result['ValueA']['M']

Improper to reference MultiIndex values of DataFrame?

Pandas seems to resist efforts to use DataFrame index values as if they are column values. As a result I am often copying them into a column so that I can reference them for calculations. Is this a good practice? Or am I missing a "correct" way to reference index values?
Consider the following example:
j = [(a, b) for a in ['A','B','C'] for b in random.sample(range(1, 100), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
Now suppose I want to add a column 'SmallestNum' to the DataFrame that lists the smallest index Num for each associated index Name.
Presently the only way I can find to get this to work (assuming that the MultiIndex is large and I don't have it handy as tuples) is to:
First: Copy both index levels into columns of the DataFrame:
df['NameCol'] = df.index.get_level_values(0)
df['NumCol'] = df.index.get_level_values(1)
Otherwise, I can't figure out how I would get the smallest Num value for each Name. At least now I can via:
smallest = pd.DataFrame(df.groupby(['Name'])['NumCol'].min())
Finally, I can merge these data back into the DataFrame as a new column, but only because I can reference the NameCol:
df.merge(smallest.rename(columns={'NumCol' : 'SmallestNum'}), how='left', right_index=True, left_on=['NameCol'])
So is there a way to do this without creating the NameCol and NumCol column copies of the MultiIndex values?
This works:
## get smallest values per Name
vals = df.reset_index(level=1).groupby('Name')['Num'].min()
## map the values to df
df['SmallestNum'] = pd.Series(df.index.get_level_values(0)).map(vals).values
You can use transform:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in np.random.randint(1, 100, size=5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
print (df)
Vals
Name Num
A 28 1.180140
44 0.984257
90 1.835646
43 -1.886823
29 0.424763
B 80 -0.433105
61 -0.166838
46 0.754634
38 1.966975
93 0.200671
C 40 0.742752
82 -1.264271
12 -0.112787
78 0.667358
70 0.357900
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
Or:
df['SmallestNum'] = df.groupby('Name').transform(lambda x: x.index.get_level_values(1).min())
print (df)
Vals SmallestNum
Name Num
A 28 1.180140 28
44 0.984257 28
90 1.835646 28
43 -1.886823 28
29 0.424763 28
B 80 -0.433105 38
61 -0.166838 38
46 0.754634 38
38 1.966975 38
93 0.200671 38
C 40 0.742752 12
82 -1.264271 12
12 -0.112787 12
78 0.667358 12
70 0.357900 12

Sample rows of pandas dataframe in proportion to counts in a column

I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column called group_id. I would like to randomly sample 10% say of the rows but in proportion to the numbers of each group_id.
For example, if the group_id's are A, B, A, C, A, B then I would like half of my sampled rows to have group_id A, two sixths to have group_id B and one sixth to have group_id C.
I can see the pandas function sample but I am not sure how to use it to achieve this goal.
You can use groupby and sample
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index
using:
df = pd.DataFrame(dict(
A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
B=range(20)
))
Short and sweet:
df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)
Long version
df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
I was looking for similar solution. The code provided by #Vaishali works absolutely fine. What #Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.
# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))
This is not as simple as just grouping and using .sample. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A will end up with 1/20 for a fraction of the total number of rows, group B will get 1/30 and group C ends up with 1/60. You can put these fractions in a dictionary and then use .groupby and pd.concat to concatenate the number of rows* from each group into a dataframe. You will be using the n parameter from the .sample method instead of the frac parameter.
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))
Edit:
This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.
Starting with equal portions: each group starts with 40 rows
df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))
# group_id vals
# 12 A -0.175109
# 51 A -1.936231
# 81 A 2.057427
# 111 A 0.851301
# 114 A 0.669910
# 60 A 1.226954
# 73 B -0.166516
# 82 B 0.662789
# 94 B -0.863640
# 31 B 0.188097
# 101 C 1.802802
# 53 C 0.696984
print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 24 A 0.161328
# 21 A -1.399320
# 30 A -0.115725
# 114 A 0.669910
# B 34 B -0.348558
# 7 B -0.855432
# 106 B -1.163899
# 79 B 0.532049
# C 65 C -2.836438
# 95 C 1.701192
# 80 C -0.421549
# 74 C -1.089400
First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).
Second solution: 4 rows for each group (each one third of the sampled rows)
Working with differently sized groups: 40 for A, 60 for B and 20 for C
df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))
# group_id vals
# 29 A 0.306738
# 35 A 1.785479
# 21 A -0.119405
# 4 A 2.579824
# 5 A 1.138887
# 11 A 0.566093
# 80 B 1.207676
# 41 B -0.577513
# 44 B 0.286967
# 77 B 0.402427
# 103 C -1.760442
# 114 C 0.717776
print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 4 A 2.579824
# 32 A 0.451882
# 5 A 1.138887
# 17 A -0.614331
# B 47 B -0.308123
# 52 B -1.504321
# 42 B -0.547335
# 84 B -1.398953
# 61 B 1.679014
# 66 B 0.546688
# C 105 C 0.988320
# 107 C 0.698790
First solution: consistent
Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.
Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C
df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))
# group_id vals
# 48 A 1.214525
# 19 A -0.237562
# 0 A 3.385037
# 11 A 1.948405
# 8 A 0.696629
# 39 A -0.422851
# 62 B 1.669020
# 94 B 0.037814
# 67 B 0.627173
# 93 B 0.696366
# 104 C 0.616140
# 113 C 0.577033
print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 4 A 0.284448
# 11 A 1.948405
# 8 A 0.696629
# 0 A 3.385037
# 31 A 0.579405
# 24 A -0.309709
# B 70 B -0.480442
# 69 B -0.317613
# 96 B -0.930522
# 80 B -1.184937
# C 101 C 0.420421
# 106 C 0.058900
This is the only time the second solution offered some consistency (out of sheer luck, I might add).
I hope this proves useful.

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Categories

Resources