I want to one-hot encode the sex column in my dataframe df. I reshaped it using reshape(-1, 1) but still get an error Expected 2D array, got 1D array instead.
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('C:/Users/User/Downloads/suicide_rates.csv')
df['sex'] = ohe.fit_transform(df['sex']).toarray().reshape(-1, 1)
Traceback
ValueError: Expected 2D array, got 1D array instead: array=['male'
'male' 'female' ... 'male' 'female' 'female']. Reshape your data
either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample.
You are closing the parenthesis before actually reshaping the array.
The other problem is that the one hot encoder is creating several columns, you cannot assign that to a single column:
In [1]: from sklearn.preprocessing import OneHotEncoder
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"sex": list("mmmfff")})
In [4]: ohe = OneHotEncoder()
In [5]: ohe.fit_transform(df["sex"].to_numpy().reshape(-1, 1))
Out[5]:
<9x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
In [6]: _.toarray()
Out[6]:
array([[0., 1.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.],
[1., 0.]])
You see that we have 2 columns. If you are sure that you have only 2 values, you can use the drop parameter of OneHotEncoder, it will drop the first value, and you can assign that to the dataframe:
In [11]: ohe_with_drop = OneHotEncoder(drop="first")
In [12]: ohe_with_drop.fit_transform(df["sex"].to_numpy().reshape(-1, 1)).toarray()
Out[12]:
array([[1.],
[1.],
[1.],
[0.],
[0.],
[0.]])
In [13]: df["sex_ohe"] = ohe_with_drop.fit_transform(df["sex"].to_numpy().reshape(-1, 1)).toarray()
In [14]: df
Out[14]:
sex sex_ohe
0 m 1.0
1 m 1.0
2 m 1.0
3 f 0.0
4 f 0.0
5 f 0.0
See the scikit-learn documentation for more about one hot encoding.
As an alternative, you can use pandas.get_dummies:
In [18]: pd.get_dummies(df["sex"])
Out[18]:
f m
0 0 1
1 0 1
2 0 1
3 1 0
4 1 0
5 1 0
See this answer for more.
Related
I have a ton of data file in the same format as described below and I'm trying to make a colormesh plot from this:
0 0 1 2
-3 1 7 7
-2 1 2 3
-1 1 7 3
[0 1 2] of the first row are values for the y axis of the plot, and [-3 -2 -1] of the first column are values for the x values of the same plot. The first 0 is only for spacing
these are the numbers that I really want inside the pcolormesh:
1 7 7
1 2 3
1 7 3
I'm trying to read these values and store into a 2D matrix as:
Matrix = [[1. 7. 7.]
[1. 2. 3.]
[1. 7. 3.]]
Here is a figure ilustrating it further:
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
# ------------- Input Data Files ------------- #
data = np.loadtxt('my_colormesh_data.dat') # Load Data File
# ------ Transform Data into 2D Matrix ------- #
Matrix = []
n_row = 4 # Number of rows counting 0 from file #
n_column = 4 # Number of columns couting 0 from file #
x = data[range(1,n_row),0] # Read x axis values from data file and store in a variable #
y = data[0, range(1,n_column)] # Read y axis values from data file and store in a variable #
print(data)
print('\n', x) # print values of x (for checking)
print('\n', y) # print values of y (for checking)
for i in range (2, n_row):
for j in range(2, n_column):
print(i, j, data[i,j]) # print values of i, j and data (for checking)
Matrix[i,j] = data[i,j]
print(Matrix)
and results in this error:
Matrix[i,j] = data[i,j]
TypeError: list indices must be integers or slices, not tuple
Could you clarify what i'm doing wrong?
Thanks in advance!
You are getting the error because Matrix is a list and you are trying to index it using a tuple, i,j. And that is not a valid operation. You can index a list oly with integers or slices
Secondly your data variable is already a 2D array. You don't have to any further conversions.
In order to skip the first row and first column you can simply use index slicing.
>>> input_data = """0 0 1 2
... -3 1 7 7
... -2 1 2 3
... -1 1 7 3 """
>>>
>>> data = np.loadtxt(StringIO(input_data))
>>> data
array([[ 0., 0., 1., 2.],
[-3., 1., 7., 7.],
[-2., 1., 2., 3.],
[-1., 1., 7., 3.]])
>>> data[1:,1:]
array([[1., 7., 7.],
[1., 2., 3.],
[1., 7., 3.]])
This is what a model.predic returns. ¿How can i convert this tuple in columns of a dataframe?
(array([1., 1., 1., ..., 1., 1., 1.]), array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
...,
[0.47139825, 0.52860175],
[0.46367829, 0.53632171],
[0.46586898, 0.53413102]]))
<class 'tuple'>
Nothing of those is working for me
pd.DataFrame(dict(class_pred=tuple[0], prob_0=tuple[1], prob_1=tuple[2]))
pd.DataFrame(np.column_stack(tuple),columns=['class_pred','prob_0','prob_1'])
I would like to obtain something like this:
class_pred prob_0 prob_1
1 0.470728 0.5292713
AniSkywalker solution works perfectly.
type(data)
print(data)
tuple
(array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
[0.46511921, 0.53488079],
[0.46739934, 0.53260066],
[0.47387646, 0.52612354],
[0.4737461 , 0.5262539 ],
[0.47052631, 0.52947369],
[0.47658316, 0.52341684],
[0.47222654, 0.52777346]]))
df_pred = pd.DataFrame(data=dict(pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
print(df_pred)
pred prob_0 prob_1
0 1.0 0.465023 0.534977
1 1.0 0.470729 0.529271
2 1.0 0.469656 0.530344
3 1.0 0.465119 0.534881
4 1.0 0.467399 0.532601
5 1.0 0.473876 0.526124
6 1.0 0.473746 0.526254
7 1.0 0.470526 0.529474
8 1.0 0.476583 0.523417
9 1.0 0.472227 0.527773
I'm assuming your data is of the form ((n), (n, 2)) so that:
import numpy as np
n = 5
data = (np.random.rand(n), np.random.rand(n, 2))
provides a reasonable estimate of what your output looks like.
Let's say that data is:
(array([0.27856312, 0.66255123, 0.47976175, 0.59381106, 0.82096555]), array([[0.53719357, 0.55803381],
[0.5749893 , 0.09712089],
[0.91607789, 0.21579499],
[0.50163898, 0.39188127],
[0.60427654, 0.07801227]]))
Your dict method actually works with one modification:
import pandas as pd
df = pd.DataFrame(data=dict(class_pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
Notice that prob_0 and prob_1 are both derived from the second tuple element, but using Numpy's column indexing we can split the individual arrays as you described.
Let's take data[1][:,0], for example: first, we select the second element of the data tuple, which is the (n, 2) matrix. Then, we select the first column (0) from all rows (:). The result is a vector of the first element of every row in that matrix.
Using my made-up numbers, df.head() should give you:
class_pred prob_0 prob_1
0 0.278563 0.537194 0.558034
1 0.662551 0.574989 0.097121
2 0.479762 0.916078 0.215795
3 0.593811 0.501639 0.391881
4 0.820966 0.604277 0.078012
Apologize if this has been asked before, somehow I am not able to find the answer to this.
Let's say I have two lists of values:
rows = [0,1,2]
cols = [0,2,3]
that represents indexes of rows and columns respectively. The two lists combined signified sort of coordinates in the matrix, i.e (0,0), (1,2), (2,3).
I would like to use those coordinates to change specific cells of the dataframe without using a loop.
In numpy, this is trivial:
data = np.ones((4,4))
data[rows, cols] = np.nan
array([[nan, 1., 1., 1.],
[ 1., 1., nan, 1.],
[ 1., 1., 1., nan],
[ 1., 1., 1., 1.]])
But in pandas, it seems I am stuck with a loop:
df = pd.DataFrame(np.ones((4,4)))
for _r, _c in zip(rows, cols):
df.iat[_r, _c] = np.nan
Is there a way to use to vectors that lists coordinate-like index to directly modify cells in pandas?
Please note that the answer is not to use iloc instead, this selects the intersection of entire rows and columns.
Very simple! Exploit the fact that pandas is built on top of numpy and use DataFrame.values
df.values[rows, cols] = np.nan
Output:
0 1 2 3
0 NaN 1.0 1.0 1.0
1 1.0 1.0 NaN 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 1.0 1.0
New to python here (but have experience in R, SQL).
I tried googling this, however was unable to generate new ideas.
My main purpose is to generate a matrix using my csv data, but I'd like to transpose my 2nd column into a row for the matrix. I'd then like to populate that matrix with the data in my 3rd column, but wasn't able to get anywhere.
After a couple of days, I have come up with this code :
import csv
def readcsv(csvfile_name):
with open(csvfile_name) as csvfile:
file=csv.reader(csvfile, delimiter=",")
#remove rubbish data in first few rows
skiprows = int(input('Number of rows to skip? '))
for i in range(skiprows):
_ = next(file)
#change strings into integers/floats
for z in file:
z[:2]=map(int, z[:2])
z[2:]=map(float, z[2:])
print(z[:2])
return
This just cleans up my data, however what I'd like to do is to transpose the data into a matrix. The data I have is like this (imagine x,y,z,d and other letters are floats):
1 1 x
1 2 y
1 3 z
1 4 d
. . .
. . .
However, I'd like to turn this data into a matrix like this: i.e. I'd like to populate that matrix with data in the 3rd column (letters here just to make it easier to read for you guys) and convert the 2nd column into a row for the matrix. So in essence, the first and second columns of that CSV file are co-ordinates for the matrix.
1 2 3 4 . .
1 x y z d
1 a b c u
1 e f e y
.
.
I tried learning numpy, however it appears like it requires my data to already be in a matrix form.
If you want to use numpy you've got two options depending on how your data is stored.
If it is GUARANTEED that your keys increase consistently, e.g:
THIS NOT THIS
------ --------
1 1 a 1 1 a
1 2 b 1 3 b
1 3 c 2 1 c
1 4 d 3 1 d
2 1 e 1 2 e
2 2 f 1 4 f
2 3 g 8 8 g
2 4 h 2 2 h
Then simply take all the values in the far right column and chuck them into a flat numpy array and reshape according to the maximum values in the left and middle column.
import numpy as np
m = np.array(right_column)
# For the sake of example:
#: array([1., 2., 3., 4., 5., 6., 7., 8.])
m = m.reshape(max(left_column), max(middle_column))
#: array([[1., 2., 3., 4.],
#: [5., 6., 7., 8.]])
If it is not guaranteed, you could either sort it so that it is (probably easiest), OR create a zero array of the correct shape and cycle through each element.
# Example data
left_column = [1, 2, 1, 2, 1, 2, 1, 2]
middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
right_column = [1., 5., 3., 7., 2., 6., 4., 8.]
import numpy as np
m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
for x, y, z in zip(left_column, middle_column, right_column):
x -= 1 # Because the indicies are 1-based
y -= 1 # Need to be 0-based
m[x, y] = z
print(m)
#: array([[ 1., 2., 3., 4.],
#: [ 5., 6., 7., 8.]])
I am unable to find the entry on the method dot() in the official documentation. However the method is there and I can use it. Why is this?
On this topic, is there a way compute an element-wise multiplication of every row in a data frame with another vector? (and obtain a dataframe back?), i.e. similar to dot() but rather than computing the dot product, one computes the element-wise product.
mul is doing essentially an outer-product, while dot is an inner product. Let me expand on the accepted answer:
In [13]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [14]: v1 = np.array([2,2,2,3,3,3])
In [15]: v2 = np.array([2,3])
In [16]: df.shape
Out[16]: (6, 2)
In [17]: v1.shape
Out[17]: (6,)
In [18]: v2.shape
Out[18]: (2,)
In [24]: df.mul(v2)
Out[24]:
A B
0 2 3
1 2 6
2 2 9
3 4 12
4 4 15
5 4 18
In [26]: df.dot(v2)
Out[26]:
0 5
1 8
2 11
3 16
4 19
5 22
dtype: float64
So:
df.mul takes matrix of shape (6,2) and vector (6, 1) and returns matrix shape (6,2)
While:
df.dot takes matrix of shape (6,2) and vector (2,1) and returns (6,1).
These are not the same operation, they are outer and inner products, respectively.
Here is an example of how to multiply a DataFrame by a vector:
In [60]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [61]: vector = np.array([2,2,2,3,3,3])
In [62]: df.mul(vector, axis=0)
Out[62]:
A B
0 2 2
1 2 4
2 2 6
3 6 12
4 6 15
5 6 18
It's quite hard to say with a degree of accuracy.
Often, a method exists and is undocumented because it's considered internal by the vendor, and may be subject to change.
It could, of course, be a simple oversight by the folks who put together the documentation.
Regarding your second question; I don't really know about that - but it might be better to make a new S/O question for it.
Just scanning the the API, could you do something with the DataFrame's .applymap(function) feature ?