Multiplying multiple columns in a DataFrame - python

I'm trying to multiply N columns in a DataFrame by N columns in the same DataFrame, and then divide the results by a single column. I'm having trouble with the first part, see example below.
import pandas as pd
from numpy import random
foo = pd.DataFrame({'A':random.rand(10),
'B':random.rand(10),
'C':random.rand(10),
'N':random.randint(1,100,10),
'X':random.rand(10),
'Y':random.rand(10),
'Z':random.rand(10), })
foo[['A','B','C']].multiply(foo[['X','Y','Z']], axis=0).divide(foo['N'], axis=0)
What I'm trying to get at is column-wise multiplication (i.e. A*X, B*Y, C*Z)
The result is not an N column matrix but a 2N one, where the columns I'm trying to multiply by are added to the DataFrame, and all the entries have NaN values, like so:
A B C X Y Z
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
What's going on here, and how do I do column-wise multiplication?

This will work using the values from columns X, Y, Z and N, but perhaps it will help you see what the issue is:
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0))
A B C
0 0.000452 0.004049 0.010364
1 0.004716 0.001566 0.012881
2 0.001488 0.000296 0.004415
3 0.000269 0.001168 0.000327
4 0.001386 0.008267 0.012048
5 0.000084 0.009588 0.003189
6 0.000099 0.001063 0.006493
7 0.009958 0.035766 0.012618
8 0.001252 0.000860 0.000420
9 0.006422 0.005013 0.004108
The result is indexed on columns A, B, C. It is unclear what the resulting columns should be, which is why you are getting the NaNs.
Appending the function above with .values will give you the result you desire, but it is then up to you to replace the index and columns.
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0)).values
array([[ 4.51754797e-04, 4.04911292e-03, 1.03638836e-02],
[ 4.71588457e-03, 1.56556402e-03, 1.28805803e-02],
[ 1.48820116e-03, 2.95700572e-04, 4.41516179e-03],
[ 2.68791866e-04, 1.16836123e-03, 3.27217820e-04],
[ 1.38648301e-03, 8.26692582e-03, 1.20482313e-02],
[ 8.38762247e-05, 9.58768066e-03, 3.18903965e-03],
[ 9.94132918e-05, 1.06267623e-03, 6.49315435e-03],
[ 9.95764539e-03, 3.57657737e-02, 1.26179014e-02],
[ 1.25210929e-03, 8.59735215e-04, 4.20124326e-04],
[ 6.42175897e-03, 5.01250179e-03, 4.10783492e-03]])

Related

How to manipulate the value of a pandas multiindex on a specific level?

Given a dataframe with row and column multiindex, how would you copy a row index "object" and manipulate a specific index value on a chosen level? Ultimately I would like to add a new row to the dataframe with this manipulated index.
Taking this dataframe df as an example:
col_index = pd.MultiIndex.from_product([['A','B'], [1,2,3,4]], names=['cInd1', 'cInd2'])
row_index = pd.MultiIndex.from_arrays([['2010','2011','2009'],['a','r','t'],[45,34,35]], names=["rInd1", "rInd2", 'rInd3'])
df = pd.DataFrame(data=None, index=row_index, columns=col_index)
df
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
I would like to take the index of the first row, manipulate the "rInd2" value and use this index to insert another row.
Pseudo code would be something like this:
#Get Index
idx = df.index[0]
#Manipulate Value
idx[1] = "L" #or idx["rInd2"]
#Make new row with new index
df.loc[idx, slice(None)] = None
The desired output would look like this:
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
2010 L 45 NaN NaN NaN NaN NaN NaN NaN NaN
What would be the most efficient way to achieve this?
Is there a way to do the same procedure with column index?
Thanks

How to delete nan/null values in lists in a list in Python?

So I have a dataframe with NaN values and I tranfsform all the rows in that dataframe in a list which then is added to another list.
Index 1 2 3 4 5 6 7 8 9 10 ... 71 72 73 74 75 76 77 78 79 80
orderid
20000765 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000766 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000768 1305984 1305985 1305983 1306021 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
records = []
for i in range(0, 60550):
records.append([str(dfpivot.values[i,j]) for j in range(0, 10)])
However, a lot of rows contain NaN values which I want to delete from the list, before I put it in the list of lists. Where do I need to insert that code and how do I do this?
I thought that this code would do the trick, but I guess it looks only to the direct values in the 'list of lists':
records = [x for x in records if str(x) != 'nan']
I'm new to Python, so I'm still figuring out the basics.
One way is to take advantage of the fact that stack removes NaNs to generate the nested list:
df.stack().groupby(level=0).apply(list).values.tolist()
# [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
IF you want to keep rows with nans you can do it like this:
In [5457]: df.T.dropna(how='all').T
Out[5457]:
Index 1 2 3 4
0 20000765.000 624380.000 nan nan nan
1 20000766.000 624380.000 nan nan nan
2 20000768.000 1305984.000 1305985.000 1305983.000 1306021.000
if you don't want any columns with nans you can drop them like this:
In [5458]: df.T.dropna().T
Out[5458]:
Index 1
0 20000765.000 624380.000
1 20000766.000 624380.000
2 20000768.000 1305984.000
To create the array:
In [5464]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[5464]:
[[20000765.0, 624380.0],
[20000766.0, 624380.0],
[20000768.0, 1305984.0, 1305985.0, 1305983.0, 1306021.0]]
or
df.T[1:].apply(lambda x: x.dropna().tolist()).tolist()
Out[5471]: [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
depending on how you want the array
One way to do this would be with a nested list comprehension:
[[j for j in i if not pd.isna(j)] for i in dfpivot.values]
EDIT
it looks like you want strings - in which case,
[[str(j) for j in i if not pd.isna(j)] for i in dfpivot.values]

assigning multiple values to different cells in a dataframe

This is probably an easy question, but I couldn't find any simple way to do that. Imagine the following dataframe:
df = pd.DataFrame(index=range(10), columns=range(5))
and three lists that contain indices, columns, and values of the defined dataframe that I intend to change:
idx_list = [1,5,3,7] # the indices of the cells that I want to change
col_list = [1,4,3,1] # the columns of the cells that I want to change
value_list = [9,8,7,6] # the final value of whose cells`
I was wondering if there exist a function in pandas that does the following efficiently:
for i in range(len(idx_list)):
df.loc[idx_list[i], col_list[i]] = value_list[i]
Thanks.
Using .values
df.values[idx_list,col_list]=value_list
df
Out[205]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 9 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 7 NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 8
6 NaN NaN NaN NaN NaN
7 NaN 6 NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Or another way less efficient
updatedf=pd.Series(value_list,index=pd.MultiIndex.from_arrays([idx_list,col_list])).unstack()
df.update(updatedf)
try df.applymap() function, you can use lambda to do your required operations.

Broadcasting Error Pandas

I have a dataframe with 4 columns. I want to do an element-wise division of the first 3 columns by the value in 4th column
I tried:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].values / df.col4.values)
And I got this error:
ValueError: operands could not be broadcast together with shapes (19,3) (19,)
My solution was:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].values / df.col4.values.reshape(19,1))
This worked as I wanted, but to be robust for different numbers of rows I would need to do:
.reshape(len(df),1)
It just seems an ugly way to have to do something - is there a better way around the array shape being (19,) it seems odd that it has no second dimension.
Best Regards,
Ben
You can just do div and pass axis=0 to force the division to be performed column-wise:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].div(df.col4, axis=0))
Your error is because the division using / is being performed on the minor axis which in this case is the row axis and there is no direct alignment, see this example:
In [220]:
df = pd.DataFrame(columns=list('abcd'), data = np.random.randn(8,4))
df
Out[220]:
a b c d
0 1.074803 0.173520 0.211027 1.357138
1 1.418757 -1.879024 0.536826 1.006160
2 -0.029716 -1.146178 0.100900 -1.035018
3 0.314665 -0.773723 -1.170653 0.648740
4 -0.179666 1.291836 -0.009614 0.392149
5 0.264599 -0.057409 -1.425638 1.024098
6 -0.106062 1.824375 0.595974 1.167115
7 0.601544 -1.237881 0.106854 -1.276829
In [221]:
df.ix[:,['a', 'b', 'c']]/df['d']
Out[221]:
a b c 0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
This isn't obvious until you understand how broadcasting works.

Pandas.DataFrame select by interval of indexes

I would like to know, in a pythonic way, how could I select elements in the Pandas.Dataframe inside a given interval in their indexes. Basically I wish to know if there is a command like pandas.Series.between for DataFrame.index .
example:
df1 = pd.DataFrame(x, index=(1,2,...,100000000), columns=['A','B','C'])
df2 = df1.between(start=10, stop=100000)
I think it is curious not easily finding anything related to this.
You can just use the subscript notation with loc which is label based indexing:
In [3]:
df2 = df1.loc[10:100000]
df2
Out[3]:
A B C
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
.....
99994 NaN NaN NaN
99995 NaN NaN NaN
99996 NaN NaN NaN
99997 NaN NaN NaN
99998 NaN NaN NaN
99999 NaN NaN NaN
10000 NaN NaN NaN
[99991 rows x 3 columns]
You also mention not being able to find documentation about this but it's pretty easy to find and clear: http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories

Resources