I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row
data can be basically depicted as follows
column1 column2 column3 column4 column5 ....
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7
..... .... ..... ..... ....
..... .... ..... ..... ....
..... .... ..... ..... ....
There're unwanted fields in the middle that I want to remove. Suppose I don't want column3.
How do I remove only that column from my array? Or copy only relevant columns to another array?
Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.
from StringIO import StringIO
from pandas import *
import numpy as np
data = """column1 column2 column3 column4 column5
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7"""
data = StringIO(data)
print read_csv(data, delim_whitespace=True).drop('column3',axis =1)
out:
column1 column2 column4 column5
0 1 none 'gona' 5.3
1 2 34 'gina' 5.5
2 3 none 'gana' 5.1
3 4 43 'gena' 5.0
4 5 none 'guna' 5.7
If you need an array instead of DataFrame, use the to_records() method:
df.to_records(index = False)
#output:
rec.array([(1L, 'none', "'gona'", 5.3),
(2L, '34', "'gina'", 5.5),
(3L, 'none', "'gana'", 5.1),
(4L, '43', "'gena'", 5.0),
(5L, 'none', "'guna'", 5.7)],
dtype=[('column1', '<i8'), ('column2', '|O4'),
('column4', '|O4'), ('column5', '<f8')])
Assuming you're reading the CSV rows and sticking them into a numpy array, the easiest and best solution is almost definitely preprocessing the data before it gets to the array, as Maciek D.'s answer shows. (If you want to do something more complicated than "remove column 3" you might want something like [value for i, value in enumerate(row) if i not in (1, 3, 5)], but the idea is still the same.)
However, if you've already imported the array and you want to filter it after the fact, you probably want take or delete:
>>> d=np.array([[1,None,2,'gona',5.3],[2,34,2,'gina',5.5],[3,None,2,'gana',5.1],[4,43,2,'gena',5.0],[5,None,2,'guna',5.7]])
>>> np.delete(d, 2, 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
>>> np.take(d, [0, 1, 3, 4], 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
For the simple case of "remove column 3", delete makes more sense; for a more complicated case, take probably makes more sense.
If you haven't yet worked out how to import the data in the first place, you could either use the built-in csv module and something like Maciek D.'s code and process as you go, or use something like pandas.read_csv and post-process the result, as root's answer shows.
But it might be better to use a native numpy data format in the first place instead of CSV.
You can use range selection. Eg. to remove column3, you can use:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row[:2] + row[3:]
This will work, assuming that csv_file_object yields lists. If it is e.g. a simple file object created with csv_file_object = open("file.cvs"), add split in your loop:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
row = row.split()
data[i] = row[:2] + row[3:]
Related
So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])
I have text file something like this:
0 0 0 1 2
0 0 1 3 1
0 1 0 4 1
0 1 1 2 3
1 0 0 5 3
1 0 1 1 3
1 1 0 4 5
1 1 1 6 1
Let label these columns as:
s1 a s2 r t
I also have another array with dummy values (for simplicity)
>>> V = np.array([10.,20.])
I want to do certain calculation on these numbers with good performance. The calculation I want to perform is: for each s1, I want max sum t*(r+V[s1]) for each a.
For example,
for s1=0, a=0, we will have sum = 2*(1+10)+1*(3+10) = 35
for s1=0, a=1, we will have sum = 1*(4+10)+3*(2+10) = 50
So max of this is 50, which is what I want to obtain as an output for s1=0.
Also, note that, in above calculation, 10 is V[s1].
If, I dont have last three lines in file, then, for s1=1, I will simply return 3*(5+20)=75, where 20 is V[s1]. So the final desire result is [50,75]
So I thought it will be good for numpy to load it as follows (consider values only for s1=0 for simplicity)
>>> c1=[[ [ [0,1,2],[1,3,1] ],[ [0,4,1],[1,2,3] ] ]]
>>> import numpy as np
>>> c1arr = np.array(c1)
>>> c1arr #when I actually load from file, its not loading as this (check Q2 below)
array([[[[0, 1, 2],
[1, 3, 1]],
[[0, 4, 1],
[1, 2, 3]]]])
>>> np.sum(c1arr[0,0][:,2]*(c1arr[0,0][:,1]+V)) #sum over t*(r+V)
45.0
Q1. I am not able to guess, how can I modify above to get numpy array [45.0,80.0], so that I can get numpy.max over it.
Q2. When I actually load the file, I am not able load it as c1arr as stated in comment above. Instead, am getting it as follows:
>>> type(a) #a is populated by parsing file
<class 'list'>
>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]
>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
[list([[1, 0.2, 1.0]]),
list([[0, -0.8, 1.0]])]], dtype=object)
How I can fix this?
Q3. Is there any overall better approach, say by laying out the numpy array differently? (Given I am not allowed to use pandas, but only numpy)
In my opinion, the most intuitive and maintainable approach
is to use Pandas, where you can assign names to columns.
Another important factor is that grouping is much easier just in Pandas.
As your input sample contains only integers, I defined V
also as an array of integers:
V = np.array([10, 20])
I read your input file as follows:
df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])
(print it to see what has been read).
Then, to get results for each combination of s1 and a,
you can run:
result = df.groupby(['s1', 'a']).apply(lambda grp:
(grp.t * (grp.r + V[grp.s1])).sum())
Note that as you refer to named columns, this code is easy to read.
The result is:
s1 a
0 0 35
1 50
1 0 138
1 146
dtype: int64
Each result is integer because V is also an array of
int type. But if you define it just as in your post (an
array of float), the result will be also of float type
(your choice).
If you want the max result for each s1, run:
result.max(level=0)
This time the result is:
s1
0 50
1 146
dtype: int64
The Numpy version
If you really are restricted to Numpy, there is also a solution,
although more difficult to read and update.
Read your input file:
data = np.genfromtxt('Input.txt')
Initially I tried int type, just like in the pandasonic solution,
but one of your comments states that 2 rightmost columns are float.
So, because Numpy arrays must be of a single type, the whole
array must be of float type.
Run the following code:
res = []
# First level grouping - by "s1" (column 0)
for s1 in np.unique(data[:,0]).astype(int):
dat1 = data[np.where(data[:,0] == s1)]
res2 = []
# Second level grouping - by "a" (column 1)
for a in np.unique(dat1[:,1]):
dat2 = dat1[np.where(dat1[:,1] == a)]
# t - column 4, r - column 3
res2.append((dat2[:,4] * (dat2[:,3] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)
The result (a Numpy array) is:
array([[ 0., 50.],
[ 1., 146.]])
The left column contains s1 values and the right - maximum
group values from the second level grouping.
The Numpy version with a structured array
Actually, you can also use a Numpy structured array.
Then the code is at least more readable, because you refer to column names,
not to column numbers.
Read the array passing dtype with column names and types:
data = np.genfromtxt(io.StringIO(txt), dtype=[('s1', '<i4'),
('a', '<i4'), ('s2', '<i4'), ('r', '<f8'), ('t', '<f8')])
Then run:
res = []
# First level grouping - by "s1"
for s1 in np.unique(data['s1']):
dat1 = data[np.where(data['s1'] == s1)]
res2 = []
# Second level grouping - by "a"
for a in np.unique(dat1['a']):
dat2 = dat1[np.where(dat1['a'] == a)]
res2.append((dat2['t'] * (dat2['r'] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)
I'm using python pandas to organize some measurements values in a DataFrame.
One of the columns is a value which I want to convert in a 2D-vector so let's say the column contains such values:
col1
25
12
14
21
I want to have the values of this column changed one by one (in a for loop):
for value in values:
df.['col1'][value] = convert2Vector(df.['col1'][value])
So that the column col1 becomes:
col1
[-1. 21.]
[-1. -2.]
[-15. 54.]
[11. 2.]
The values are only examples and the function convert2Vector() converts the angle to a 2D-vector.
With the for-loop that I wrote it doesn't work .. I get the error:
ValueError: setting an array element with a sequence.
Which I can understand.
So the question is: How to do it?
That exception comes from the fact that you want to insert a list or array in a column (array) that stores ints. And arrays in Pandas and NumPy can't have a "ragged shape" so you can't have 2 elements in one row and 1 element in all the others (except maybe with masking).
To make it work you need to store "general" objects. For example:
import pandas as pd
df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
df.col1[0] = [1, 2]
# ValueError: setting an array element with a sequence.
But this works:
>>> df.col1 = df.col1.astype(object)
>>> df.col1[0] = [1, 2]
>>> df
col1
0 [1, 2]
1 12
2 14
3 21
Note: I wouldn't recommend doing that because object columns are much slower than specifically typed columns. But since you're iterating over the Column with a for loop it seems you don't need the performance so you can also use an object array.
What you should be doing if you want it fast is vectorize the convert2vector function and assign the result to two columns:
import pandas as pd
import numpy as np
def convert2Vector(angle):
"""I don't know what your function does so this is just something that
calculates the sin and cos of the input..."""
ret = np.zeros((angle.size, 2), dtype=float)
ret[:, 0] = np.sin(angle)
ret[:, 1] = np.cos(angle)
return ret
>>> df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
>>> df['col2'] = [0]*len(df)
>>> df[['col1', 'col2']] = convert2Vector(df.col1)
>>> df
col1 col2
0 -0.132352 0.991203
1 -0.536573 0.843854
2 0.990607 0.136737
3 0.836656 -0.547729
You should call a first order function like df.apply or df.transform which creates a new column which you then assign back:
In [1022]: df.col1.apply(lambda x: [x, x // 2])
Out[1022]:
0 [25, 12]
1 [12, 6]
2 [14, 7]
3 [21, 10]
Name: col1, dtype: object
In your case, you would do:
df['col1'] = df.col1.apply(convert2vector)
I have a large dataframe where I store various metadata in a multiindex (see also here).
Essentially my dataframe looks like this:
location zero A B C and so on
type zero MUR RHE DUJ RHE RHE
name zero foo bar baz boo far
1930-03-01 0 2.1 3.4 9.4 5.4 5.5
1930-04-01 0 3.1 3.6 7.3 6.7 9.5
1930-05-01 0 2.5 9.1 8.0 1.1 8.1
and so on
So that I can easily select for example all DUJ datatypes with mydf.xs('DUJ', level = 'type', axis = 1).
But how can I access the strings in the type index and eliminate doubles and maybe get some statictics?
I am looking for an output like
types('MUR', 'RHE', 'DUJ')
and/or
types:
DUJ 1
MUR 1
RHE 3
giving me a list of the datatypes and how often they occur.
I can access the index with
[In]mytypes = mydf.columns.get_level_values(1)
[In]mytypes
[Out]Index([u'zero', u'MUR', u'RHE', u'DUJ', u'RHE', u'RHE'], dtype='object')
but I cant think of any easy way to do something with this information, especially considering that my real dataset will return 1500 entries. My first idea was a simple mytypes.sort() but apparently I Cannot sort an 'Index' object.
Being able to describe your dataset seems like a rather important thing to me, so I would expect that there is something built in in pandas, but I cant seem to find it. And the MultiIndex documentation seems only to be concerned with constructing and setting indexes, but not analyzing them.
Index objects have a method for this value_counts so you can just call:
mytypes.value_counts()
And this will return the index values in the index and the counts as the values.
Example from your linked question:
In [3]:
header = [np.array(['location','location','location','location2','location2','location2']),
np.array(['S1','S2','S3','S1','S2','S3'])]
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header )
df.columns
Out[3]:
MultiIndex(levels=[['location', 'location2'], ['S1', 'S2', 'S3']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
In [4]:
df.columns.get_level_values(1).value_counts()
Out[4]:
S1 2
S2 2
S3 2
dtype: int64
I am trying to convert categorical data into binary to be able to classify with an algorithm like logistic regression. I thought of using OneHotEncoder from 'sklearn.preprocessing' module but the problem is the dataframe entries are A, B pairs of arrays with different lengths, each row has pair of same-length arrays not equal to array lengths in other rows.
OneHotEncoder does not accept dataframe like mine
In [34]: data.index
Out[34]: Index([train1, train2, train3, ..., train7829, train7830,
train7831], dtype=object)
In [35]: data.columns
Out[35]: Index([A, B], dtype=object)
SampleID A B
train1 [2092.0, 1143.0, 390.0, ...] [5651.0, 4449.0, 4012.0...]
train2 [3158.0, 3158.0, 3684.0, 3684.0....] [2.0, 4.0, 2.0, 1.0...]
train3 [1699.0, 1808.0 ,...] [0.0, 1.0...]
So, I want to highlight again that each A and B pair has the same length but the length is variable across different pairs. Dataframe contains numerical, categorical and binary values.
I have another csv file with the information about every entry type. I read the file filter out categorical entries in both columns like this:
info=data_io.read_train_info()
col1=info.columns[0]
col2=info.columns[1]
info=info[(info[col1]=='Categorical')&(info[col2]=='Categorical')]
Then I use info.index to filter my training dataframe
filtered = data.loc[info.index]
Than I wrote an utility function to change dimensions of each array so that I can encode them later
def setDim(df):
for item in x[x.columns[0]].index:
df[df.columns[0]][item].shape=(1,df[df.columns[0]][item].shape[0])
df[df.columns[1]][item].shape=(1,df[df.columns[1]][item].shape[0])
setDim(filtered)
Then I thought to combine each pair of arrays into 2-row matrix so that I can pass it to encoder then to separate them again after encoding, like this:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def makeSparse(df):
enc = OneHotEncoder()
for i in df.index:
cd=np.append(df['A'][i],df['B'][i],axis=0)
a=enc.fit_transform(cd)
df['A'][i] = a[0,:]
df['B'][i] = a[1,:]
makeSparse(filtered)
After all these steps get a sparse dataframe. My questions are:
is this a right way to encode this dataframe?(I highly doubt it)
if no, then what alternatives do you offer?
Thanks a lot for your time helping me.
This is a nice way to transform your data to a better repr to deal with; uses some
neat apply tricks
In [72]: df
Out[72]:
A B
train1 [2092, 1143, 390] [5651, 449, 4012]
train2 [3158, 3158, 3684, 3684] [2, 4, 2, 1]
train3 [1699, 1808] [0, 1]
In [73]: concat(dict([ (x[0],x[1].apply(lambda y: Series(y))) for x in df.iterrows() ]))
Out[73]:
0 1 2 3
train1 A 2092 1143 390 NaN
B 5651 449 4012 NaN
train2 A 3158 3158 3684 3684
B 2 4 2 1
train3 A 1699 1808 NaN NaN
B 0 1 NaN NaN
Some 9 years later or so, as redirected to this thread from the official Pandas docs (namely the cookbook), I came upp with a probably even neater implementation of the transformation from the most upvoted answer.
To go from this:
A B
train1 [2092, 1143, 390] [5651, 449, 4012]
train2 [3158, 3158, 3684, 3684] [2, 4, 2, 1]
train3 [1699, 1808] [0, 1]
To this:
0 1 2 3
train1 A 2092.0 1143.0 390.0 NaN
B 5651.0 449.0 4012.0 NaN
train2 A 3158.0 3158.0 3684.0 3684.0
B 2.0 4.0 2.0 1.0
train3 A 1699.0 1808.0 NaN NaN
B 0.0 1.0 NaN NaN
...one can simply use:
df.transpose().unstack().apply(pd.Series)