Say I have a 1D array:
import numpy as np
my_array = np.arange(0,10)
my_array.shape
(10, )
In Pandas I would like to create a DataFrame with only one row and 10 columns using this array. FOr example:
import pandas as pd
import random, string
# Random list of characters to be used as columns
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
But when I try:
pd.DataFrame(my_array, columns = cols)
I get:
ValueError: Shape of passed values is (1,10), indices imply (10,10)
I presume this is because Pandas expects a 2D array, and I have a (flat) 1D array. Is there a way to inflate my 1D array into a 2D array or have Panda use a 1D array in the creation of the dataframe?
Note: I am using the latest stable version of Pandas (0.11.0)
Your value array has length 9, (values from 1 till 9), and your cols list has length 10.
I dont understand your error message, based on your code, i get:
ValueError: Shape of passed values is (1, 9), indices imply (10, 9)
Which makes sense.
Try:
my_array = np.arange(10).reshape(1,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame(my_array, columns=cols)
Which results in:
F H L N M X B R S N
0 0 1 2 3 4 5 6 7 8 9
Either these should do it:
my_array2 = my_array[None] # same as myarray2 = my_array[numpy.newaxis]
or
my_array2 = my_array.reshape((1,10))
A single-row, many-columned DataFrame is unusual. A more natural, idiomatic choice would be a Series indexed by what you call cols:
pd.Series(my_array, index=cols)
But, to answer your question, the DataFrame constructor is assuming that my_array is a column of 10 data points. Try DataFrame(my_array.reshape((1, 10)), columns=cols). That works for me.
By using one of the alternate DataFrame constructors it is possible to create a DataFrame without needing to reshape my_array.
import numpy as np
import pandas as pd
import random, string
my_array = np.arange(0,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame.from_records([my_array], columns=cols)
Out[22]:
H H P Q C A G N T W
0 0 1 2 3 4 5 6 7 8 9
Related
I have a dataframe with 20 rows and 500000 columns. Each row is a unique model consisting of 500000 numbers (columns). Therefore, we have 20 unique models. I want to convert this dataframe to a dataframe with only one column as "values", and the rows should consists of 20 * 500000 rows stacked on top of each other, such that the first 500000 rows should belong to the 500000 numbers of the first model, followed by the 500000 numbers of the second model, and so on. I used pd.melt() but that is not what I am looking for, as it does not put them in order of the models.
import pandas as pd
import numpy as np
my_df = pd.DataFrame(np.random.randint(0,100,size=(20, 500000)))
#reshaped_my_df = pd.melt(my_df)
#Update: I used the df.stack() and it worked.
df_stacked = my_df.stack().reset_index()
All you need to do is reshape the underlying numpy array and recreate the dataframe:
import pandas as pd
import numpy as np
my_df = pd.DataFrame(np.random.randint(0,100,size=(20, 500000)))
reshaped_my_df = pd.DataFrame(my_df.values.T.reshape((-1, 1)), columns=['values'])
The -1 in the reshape arguments stands for "make this axis whatever size needed to make the reshaping work". Since your code produces a dataframe that can't be easily visualized, here's a more readable example. This should clear any confusion over what reshape is doing
>>> df = pd.DataFrame({'A':['a']*5, 'B':['b']*5,'C':['c']*5,})
>>> df.shape
(5, 3)
>>> df
A B C
0 a b c
1 a b c
2 a b c
3 a b c
4 a b c
>>> reshaped = pd.DataFrame(df.values.T.reshape((-1, 1)), columns=['values'])
>>> reshaped.shape
(15, 1)
>>> reshaped
values
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
10 c
11 c
12 c
13 c
14 c
So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])
Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!
In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.
I have a large csv file ~90k rows and 355 columns. The first 354 columns correspond to the presence of different words, showing a 1 or 0 and the last column to a numerical value.
Eg:
table, box, cups, glasses, total
1,0,0,1,30
0,1,1,1,28
1,1,0,1,55
When I use:
d = np.recfromcsv('clean.csv', dtype=None, delimiter=',', names=True)
d.shape
# I get: (89460,)
So my question is:
How do I get a 2d array/matrix? Does it matter?
How can I separate the 'total' column so I can create train,
cross_validation and test sets and train a model?
np.recfromcsv returns a 1-dimensional record array.
When you have a structured array, you can access the columns by field title:
d['total']
returns the totals column.
You can access rows using integer indexing:
d[0]
returns the first row, for example.
If you wish to select all the columns except the last row, then you'd be better off using a 2D plain NumPy array. With a plain NumPy array (as opposed to a structured array) you can select all the rows except the last on using integer indexing:
You could use np.genfromtxt to load the data into a 2D array:
import numpy as np
d = np.genfromtxt('data', dtype=None, delimiter=',', skiprows=1)
print(d.shape)
# (3, 5)
print(d)
# [[ 1 0 0 1 30]
# [ 0 1 1 1 28]
# [ 1 1 0 1 55]]
This select the last column:
print(d[:,-1])
# [30 28 55]
This select everything but the last column:
print(d[:,:-1])
# [[1 0 0 1]
# [0 1 1 1]
# [1 1 0 1]]
Ok after much googling and time wasting this is what anyone trying to get numpy out of the way so they can read a CSV and get on with Scikit Learn needs to do:
# Say your csv file has 10 columns, 1-9 are features and 10
# is the Y you're trying to predict.
cols = range(0,10)
X = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=cols, ndmin=2, skiprows=1)
Y = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=(9,), ndmin=2, skiprows=1)
# note how for Y the usecols argument only takes a sequence,
# even though I only want 1 column I have to give it a sequence.