How to create a new dataframe and add new variables in Python? - python

I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df

There are a lot of ways to go about this.
According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248

Related

Create matrix 100x100 each row with next ordinal number

I try to create a matrix 100x100 which should have in each row next ordinal number like below:
I created a vector from 1 to 100 and then using for loop I copied this vector 100 times. I received an array with correct data so I tried to sort arrays using np.argsort, but it didn't worked as I want (I don't know even why there are zeros in after sorting).
Is there any option to get this matrix using another functions? I tried many approaches, but the final layout was not what I expected.
max_x = 101
z = np.arange(1,101)
print(z)
x = []
for i in range(1,max_x):
x.append(z.copy())
print(x)
y = np.argsort(x)
y
argsort returns the indices to sort by, that's why you get zeros. You don't need that, what you want is to transpose the array.
Make x a numpy array and use T
y = np.array(x).T
Output
[[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
[ 3 3 3 ... 3 3 3]
...
[ 98 98 98 ... 98 98 98]
[ 99 99 99 ... 99 99 99]
[100 100 100 ... 100 100 100]]
You also don't need to loop to copy the array, use np.tile instead
z = np.arange(1, 101)
x = np.tile(z, (100, 1))
y = x.T
# or one liner
y = np.tile(np.arange(1, 101), (100, 1)).T
import numpy as np
np.asarray([ (k+1)*np.ones(100) for k in range(100) ])
Or simply
np.tile(np.arange(1,101),(100,1)).T

Multi-column interpolation in Python

I want to use scipy or pandas to interpolate on a table like this one:
df = pd.DataFrame({'x':[1,1,1,2,2,2],'y':[1,2,3,1,2,3],'z':[10,20,30,40,50,60] })
df =
x y z
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 40
4 2 2 50
5 2 3 60
I want to be able to interpolate for a x value of 1.5 and a y value of 2.5 and obtain a 40.
The process would be:
Starting from the first interpolation parameter (x), find the values that surround the target value. In this case the target is 1.5 and the surrounding values are 1 and 2.
Interpolate in y for a target of 2.5 considering x=1. In this case between rows 1 and 2, obtaining a 25
Interpolate in y for a target of 2.5 considering x=2. In this case between rows 4 and 5, obtaining a 55
Interpolate the values form previous steps to the target x value. In this case I have 25 for x=1 and 55 for x=2. The interpolated value for 1.5 is 40
The order in which interpolation is to be performed is fixed and the data will be correctly sorted.
I've found this question but I'm wondering if there is a standard solution already available in those libraries.
You can use scipy.interpolate.interp2d:
import scipy.interpolate
f = scipy.interpolate.interp2d(df.x, df.y, df.z)
f([1.5], [2.5])
[40.]
The first line creates an interpolation function z = f(x, y) using three arrays for x, y, and z. The second line uses this function to interpolate for z given values for x and y. The default is linear interpolation.
Define your interpolate function:
def interpolate(x, y, df):
cond = df.x.between(int(x), int(x) + 1) & df.y.between(int(y), int(y) + 1)
return df.loc[cond].z.mean()
interpolate(1.5,2.5,df)
40.0

Convert 3 Dim to 2 Dim and ajust Class?

Data:
I have an Array with Size (2200, 1000, 12). The first value (2200) is the index, in each index there are 1000 records.
I have another Array for the Class with Size (2200). Each variable here represents a label for the 1000 records in each index.
I want:
How can I in the first array put everything together to transform from 3 dimensions to 2 dimensions?
And how can I put each class variable in the 1000 records?
Desired result:
Dataframe Size (2200000,13)
The 2200000 would be the combined amount of the 1000 records in the 2200 index. And column 13 would be the junction with the Class, where each variable of the class would be repeated a thousand times to keep the same number of lines.
Let us first import the necessary modules and generate mock data:
import numpy as np
import pandas as pd
M = 2200
N = 1000
P = 12
data = np.random.rand(M, N, P)
classes = np.arange(M)
How can I transform from 3 dimensions to 2 dimensions?
data.reshape(M*N, P)
How can I put each class variable in the 1000 records?
np.repeat(classes, N)
Desired result: Dataframe Size (2200000,13)
arr = np.hstack([data.reshape(M*N, P), np.repeat(classes, N)[:, None]])
df = pd.DataFrame(arr)
print(df)
The code above outputs:
0 0.371495 0.598211 0.038224 ... 0.777405 0.193472 0.0
1 0.356371 0.636690 0.841467 ... 0.403570 0.330145 0.0
2 0.793879 0.008617 0.701122 ... 0.021139 0.514559 0.0
3 0.318618 0.798823 0.844345 ... 0.931606 0.467469 0.0
4 0.307109 0.076505 0.865164 ... 0.809495 0.914563 0.0
... ... ... ... ... ... ... ...
2199995 0.215133 0.239560 0.477092 ... 0.050997 0.727986 2199.0
2199996 0.249206 0.881694 0.985973 ... 0.897410 0.564516 2199.0
2199997 0.378455 0.697581 0.016306 ... 0.985966 0.638413 2199.0
2199998 0.233829 0.158274 0.478611 ... 0.825343 0.215944 2199.0
2199999 0.351320 0.980258 0.677298 ... 0.791046 0.736788 2199.0
Does this help you:
array = array.reshape(2200000,13)

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.

Categories

Resources