Given the data frame below I want to select columns A and D to F.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(5, 6)), columns=list('ABCDEF'))
In R I could say
df[ , c(1, 4:6)]
I want something similar for pandas. With
df.iloc[:, slice(3, 6, 1)]
I get columns D to F. But how can I add column A?
You use np.r_ to pass combinations of slices and indices. Since you seem to know the labels, you could can use get_loc to obtain the iloc indices.
import numpy as np
idx = np.r_[df.columns.get_loc('A'),
df.columns.get_loc('D'):df.columns.get_loc('F')+1]
# idx = np.r_[0, 3:6]
df.iloc[:, idx]
A D E F
0 38 71 62 63
1 60 93 72 94
2 57 33 30 51
3 88 54 21 39
4 0 53 41 20
Another option is np.split
df_split=pd.concat([np.split(df, [1],axis=1)[0],np.split(df,[3],axis=1)[1]],axis=1)
There you don't need to know the column names, just their positions.
How to convert a pandas dataframe to namedtuple? This task is going towards multiprocessing work.
def df2namedtuple(df):
return tuple(df.row)
itertuples has option name and index. You may use them to return exact output as your posted function:
sample df:
df:
A B C D
0 32 70 39 66
1 89 30 31 80
2 21 5 74 63
list(df.itertuples(name='Row', index=False))
Out[1130]:
[Row(A=32, B=70, C=39, D=66),
Row(A=89, B=30, C=31, D=80),
Row(A=21, B=5, C=74, D=63)]
Answer from Dan in https://groups.google.com/forum/#!topic/pydata/UaF6Y1LE5TI
from collections import namedtuple
def iternamedtuples(df):
Row = namedtuple('Row', df.columns)
for row in df.itertuples():
yield Row(*row[1:])
I am trying to create a defined function using a for-loop that receives as input a list of lists, and converts them to separate lists.
def convert_to_sep_lists(listoflists):
for i in range(len(listoflists)):
newlst=listoflists[i]
This would obviously return the very last list in the list of lists. How can I save every iteration and return all the lists (within that list) separately?
you can try pandas to_json function
import pandas as pd
import numpy as np
n=20
columns_name = list('abcd')
df = pd.DataFrame(data = np.random.randint(1,100,size=(5,4)),
columns= columns_name)
print(df)
df.sum().to_json("result.json")
The dataframe df content will be:
a b c d
0 56 91 65 82
1 63 65 50 78
2 46 43 75 3
3 37 96 84 13
4 40 59 61 66
the output file content will be
{"a":165,"b":230,"c":234,"d":336}
If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))
I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?
Avoid computing the sub-DataFrame df[df['D'] == i] more than once. The original code computes this len(myunique)**2 times. Instead you can compute this once for each i (that is, len(myunique) times in total), store the results, and then pair them together later. For example,
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
def using_orig():
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
def using_groupby():
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop
Regarding the comment:
I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that
The answer above addresses how to compute itemp and jtemp more efficiently. If itemp and jtemp are not central to your real calculation, then we would need to better understand what you really want to compute in order to suggest (if possible) a way to compute it faster.
Here's a vectorized approach to form the groups based on unique elements from "D" column -
# Sort the dataframe based on the sorted indices of column 'D'
df_sorted = df.iloc[df['D'].argsort()]
# In the sorted dataframe's 'D' column find the shift/cut indces
# (places where elements change values, indicating change of groups).
# Cut the dataframe at those indices for the final groups with NumPy Split.
cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
df_split = np.split(df_sorted,cut_idx)
Sample testing
1] Form a sample dataframe with random elements :
>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
>>> df
A B C D
0 68 68 90 39
1 53 99 20 85
2 64 76 21 19
3 90 91 32 36
4 24 9 89 19
2] Run the original code and print the results :
>>> myunique = df['D'].unique()
>>> for i in myunique:
... itemp = df[df['D'] == i]
... print itemp
...
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
3] Run the proposed code and print the results :
>>> df_sorted = df.iloc[df['D'].argsort()]
>>> cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
>>> df_split = np.split(df_sorted,cut_idx)
>>> for split in df_split:
... print split
...
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85