Multiplying pandas dataframe and series, element wise - python

Lets say I have a pandas series:
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
y = pd.Series([-1, 1, -1])
I want to multiply x and y in such a way that I get z:
z = pd.DataFrame({0: [-1,2,-3], 1: [-4,5,-6], 2: [-7,8,-9] })
In other words, if element j of the series is -1, then all elements of the j-th row of x get multiplied by -1. If element k of the series is 1, then all elements of the j-th row of x get multiplied by 1.
How do I do this?

You can do that:
>>> new_x = x.mul(y, axis=0)
>>> new_x
0 1 2
0 -1 -4 -7
1 2 5 8
2 -3 -6 -9

Adding to the best answer: if the function returns a bunch of nonsensical NaNs, you should multiply by the values of the series in question like so:
new_x = df.mul(s.values, axis=0)

As Abdou points out, the answer is
z = x.apply(lambda col: col*y)
Moreover, if you instead have a DataFrame, e.g.
y = pandas.DataFrame({"colname": [1,-1,-1]})
Then you can do
z = x.apply(lambda z: z*y["colname"])

You can multiply the dataframes directly.
x * y

Related

How to get all values in an array whose added index size is greater than value?

I have a 5x5 ndarray and want to sum up all values whose added index size is greater than a given value.
For example, I have the following array
x = np.random.random([5, 5])
and want to sum all values whose row and column index is, combined, larger than 6.
If I do this manually, I would calculate
idx_gr_8 = x[4, 3] + x[4, 4] + x[3, 4]
because 4 + 3, 4 + 4 and 3 + 4 are the only indices larger than 6.
However this is cumbersome for larger array.
Is there a numpy command or more efficient way to do this?
you can use meshgrid do get row and col indices:
a = np.random.rand(5, 5)
min_ind = 6
row_i, col_i = np.meshgrid(range(a.shape[0]), range(a.shape[1]), indexing='ij')
valid_mask = (row_i + col_i) > min_ind
res = a[valid_mask].sum()

Slice mutilple columns that are not next to each other in dataframe

I want to slice metope columns that are located several columns away from each other. I'm trying to write code that easy without having to write the code repeatedly:
df (See below for example) where columns are from A to H, with many rows containing some data (x).
How do I slice multiple randomly spaced columns, the say A, D, E, G, all in minimum amount of code. I don't want to rewrite loc code (df.loc['A'], df.loc['C:E'], df.loc['G'])?
Can I generate a list and loop through it or is there a shorter/quicker way?
Ultimately my goal would be to drop the selected columns from the main DataFrame.
A B C D E F G H
0 x x x x x x x x
1 x x x x x x x x
2 x x x x x x x x
3 x x x x x x x x
4 x x x x x x x x
You might harness .iloc method to get columns by their position rather than name, for example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9],'D':[10,11,12],'E':[13,14,15]})
df2 = df.iloc[:, [0,2,4]]
print(df2)
output:
A C E
0 1 7 13
1 2 8 14
2 3 9 15
If you need just x random columns from your df which has y columns, you might use random.sample for example if you want 3 column out of 5:
import random
cols = sorted(random.sample(range(0,5),k=3))
gives cols which is sorted list of three numbers (thanks to sorted order of columns will be preserved)

Python/ Pandas If statement inside a function explained

I have the following example and I cannot understand why it doesn't work.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def balh(a, b):
z = a + b
if z.any() > 1:
return z + 1
else:
return z
df['col3'] = balh(df.col1, df.col2)
Output:
My expected output would be see 5 and 7 not 4 and 6 in col3, since 4 and 6 are grater than 1 and my intention is to add 1 if a + b are grater than 1
The any method will evaluate if any element of the pandas.Series or pandas.DataFrame is True. A non-null integer is evaluated as True. So essentially by if z.any() > 1 you are comparing the True returned by the method with the 1 integer.
You need to condition directly the pandas.Series which will return a boolean pandas.Series where you can safely apply the any method.
This will be the same for the all method.
def balh(a, b):
z = a + b
if (z > 1).any():
return z + 1
else:
return z
As #arhr clearly explained the issue was the incorrect call to z.any(), which returns True when there is at least one non-zero element in z. It resulted in a True > 1 which is a False expression.
A one line alternative to avoid the if statement and the custom function call would be the following:
df['col3'] = df.iloc[:, :2].sum(1).transform(lambda x: x + int(x > 1))
This gets the first two columns in the dataframe then sums the elements along each row and transforms the new column according to the lambda function.
The iloc can also be omitted because the dataframe is instantiated with only two columns col1 and col2, thus the line can be refactored to:
df['col3'] = df.sum(1).transform(lambda x: x + int(x > 1))
Example output:
col1 col2 col3
0 1 3 5
1 2 4 7

How to compare each dataframe row to each point of a tuple and assign the closest point's index to a new column?

Imagine the following dataset:
X Y
0 2 4
1 5 6
2 3 4
Now, imagine the following tuple of points: ((2,4), (6,5), (1,14))
How can I find the closest point to each row and assign the index of the point to a new column?
For example, since the closest point to the first row is the point with index 0, the first row would become:
X Y Closest_Point
0 2 4 0
Try with scipy , the logic here is broadcast
from scipy.spatial import distance
ary = distance.cdist(df.values, np.array(l), metric='euclidean')
ary.argmin(1)
Out[326]: array([0, 1, 0], dtype=int32)
I would for sure use Numpy to make both the tuple and the dataset into numpy arrays.
For the examples you gave:
import numpy as np
dataset = np.array([[2,4],[5,6],[3,4]])
points = np.array([[2,4],[6,5],[1,14]])
dataset_indexed = []
for i in range(dataset.shape[0]):
temp= (((dataset[i,0]-points[0,0])**2 +(dataset[i,1]-points[0,1])**2)**(1/2))
index=0
for n in range(points.shape[0]):
print(((dataset[i,0]-points[n,0])**2 +(dataset[i,1]-points[n,1])**2)**(1/2))
if(((dataset[i,0]-points[n,0])**2 +(dataset[i,1]-points[n,1])**2)**(1/2)<=temp):
temp= ((dataset[i,0]-points[n,0])**2 +(dataset[i,1]-points[n,1])**2)**(1/2)
index = n
dataset_indexed.append([dataset[i,0],dataset[i,1],index])

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.

Categories

Resources