Getting and analyzing the index columns of a dataframe - python

I have a large dataframe where I store various metadata in a multiindex (see also here).
Essentially my dataframe looks like this:
location zero A B C and so on
type zero MUR RHE DUJ RHE RHE
name zero foo bar baz boo far
1930-03-01 0 2.1 3.4 9.4 5.4 5.5
1930-04-01 0 3.1 3.6 7.3 6.7 9.5
1930-05-01 0 2.5 9.1 8.0 1.1 8.1
and so on
So that I can easily select for example all DUJ datatypes with mydf.xs('DUJ', level = 'type', axis = 1).
But how can I access the strings in the type index and eliminate doubles and maybe get some statictics?
I am looking for an output like
types('MUR', 'RHE', 'DUJ')
and/or
types:
DUJ 1
MUR 1
RHE 3
giving me a list of the datatypes and how often they occur.
I can access the index with
[In]mytypes = mydf.columns.get_level_values(1)
[In]mytypes
[Out]Index([u'zero', u'MUR', u'RHE', u'DUJ', u'RHE', u'RHE'], dtype='object')
but I cant think of any easy way to do something with this information, especially considering that my real dataset will return 1500 entries. My first idea was a simple mytypes.sort() but apparently I Cannot sort an 'Index' object.
Being able to describe your dataset seems like a rather important thing to me, so I would expect that there is something built in in pandas, but I cant seem to find it. And the MultiIndex documentation seems only to be concerned with constructing and setting indexes, but not analyzing them.

Index objects have a method for this value_counts so you can just call:
mytypes.value_counts()
And this will return the index values in the index and the counts as the values.
Example from your linked question:
In [3]:
header = [np.array(['location','location','location','location2','location2','location2']),
np.array(['S1','S2','S3','S1','S2','S3'])]
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header )
df.columns
Out[3]:
MultiIndex(levels=[['location', 'location2'], ['S1', 'S2', 'S3']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
In [4]:
df.columns.get_level_values(1).value_counts()
Out[4]:
S1 2
S2 2
S3 2
dtype: int64

Related

Calculate mean for selected rows for selected columns in pandas data frame,but end up with some weird number

I am trying to find the mean for certain column of the data frame in python, but then I ended up with some really weird number. Can someone explain this to me?
I want the mean for column a,b,c
k = pd.DataFrame(np.array([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
columns=['a', 'b', 'c','type'])
k
which returns
a b c type
0 1 0 3 kk
1 4 5 6 kk
2 7 20 9 k
3 3 2 9 k
I want the mean for each column except the column 'type'
k[['a','b','c']].mean()
and this give me
a 368.25
b 1300.50
c 924.75
dtype: float64
I am so confused! Can someone explain this to me ?
This is the problem with creating the numpy array with mixed datatypes. Each sub-lists are now have a data type of Object and the same is being converted into data frame.
So, now DataFrame also will hold the same data type as in array.
See the below snippet:
k = pd.DataFrame(np.array([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
columns=['a', 'b', 'c','type'])
print(k.dtypes)
a object
b object
c object
type object
dtype: object
But you can think, how the mean is getting calculated over the string objects. This is again the power of numpy.
For example, take column a:
when you apply mean, it is trying the below operation,
np.sum(array) / len(array)
print(np.sum(k["a"]))
'1473'
print(np.len(k["a"]))
4
print(np.mean(k["a"]))
368.25
Now, 368.25 is nothing but 1473 / 4.
For Column b, it will be 05202 / 4 = 1300.5.
So, when you create a Dataframe, create with list of lists or in a dictionary form which will assign the data types according to the elements.
k = pd.DataFrame(([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
columns=['a', 'b', 'c','type'])
print(k.dtypes)
a int64
b int64
c int64
type object
dtype: object
print(k.mean())
a 3.75
b 6.75
c 6.75
dtype: float64
If we look at the type of variables stored in the dataframe, we find that they're stored as objects.
print(k.dtypes)
a object
b object
c object
d object
dtype: object
This means that since you've stored a string in one of the columns, the entire dataframe is being stored as objects. There is a number associated to each character, and I believe you're getting a mean of some of those numbers (although I haven't been able to figure out how you got those numbers).
For example, if we look at the numerical value assigned to the string '0' :
ord('0')
48
We see it has the numerical value of 48.
In order to get the mean you're looking for, you'll need to change the type.
Try :
b = k[['a', 'b', 'c']].astype(int)
print(b.mean())
a 3.75
b 6.75
c 6.75
dtype: float64
edit : changed "strings" to "objects"
The problem in your data is that you are mixing numbers with non numbers which is the 'k' for type.
Therefore your dataframe has type OBJECT and not integers.
Now I can't really explain on the low level how the numbers are generating such answer, however, the solution is:
TLDR;
k[['a','b','c']].astype(int).mean()
Output:
a 3.75
b 6.75
c 6.75
dtype: float64
And Welcome!

Pandas Data Frame get the values only not the definition [duplicate]

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15
To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'
Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100
Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.
In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]
To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)
.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.
To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array
Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.
According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

How to compare if any value is similar to any other using numpy

I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])

pandas groupby and rolling_apply ignoring NaNs

I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.
For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.
I've tried the following but it doesn't seem to work:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
If I even try this:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: 1)
I'm getting NaN in the output so it must be something to do with how pandas works in the background.
Any ideas?
EDIT:
Here is a code sample with what I'm trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
The result is:
0 NaN
1 NaN
2 2.0
3 NaN
4 2.5
5 NaN
6 3.0
7 2.0
while I wanted to have the following:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.
The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:
g1 = df.groupby(['var1'])['value'] # group values
g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values
s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation
s.reset_index(level=0, drop=True).sort_index() # drop/sort index
The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
Name: value, dtype: float64
Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.
It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.
The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.
The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.
For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:
def rolling_nanvar(df, window):
"""
Group df by 'var1' values and then calculate rolling variance,
adjusting for the number of NaN values in the window.
Note: user may wish to edit this function to control degrees of
freedom (n), depending on their overall aim.
"""
g1 = df.groupby(['var1'])['value']
g2 = df.fillna(0).groupby(['var1'])['value']
# fill missing values with 0, square values and groupby
g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])
n = g1.rolling(window).count()
mean_of_squares = g3.rolling(window).sum() / n
square_of_mean = (g2.rolling(window).sum() / n)**2
variance = mean_of_squares - square_of_mean
return variance.reset_index(level=0, drop=True).sort_index()
Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.
Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:
>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop
>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop
Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.
Can this result match your expectations?
I slightly changed your solution with min_periods parameter and right filter for nan.
In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1)
Out[164]:
0 1.0
1 2.0
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
dtype: float64
Here is an alternative implementation without list comprehension, but it also fails to populate the first entry of the output with np.nan
means = df.groupby('var1')['value'].apply(
lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean))

how to read from an array without a particular column in python

I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row
data can be basically depicted as follows
column1 column2 column3 column4 column5 ....
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7
..... .... ..... ..... ....
..... .... ..... ..... ....
..... .... ..... ..... ....
There're unwanted fields in the middle that I want to remove. Suppose I don't want column3.
How do I remove only that column from my array? Or copy only relevant columns to another array?
Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.
from StringIO import StringIO
from pandas import *
import numpy as np
data = """column1 column2 column3 column4 column5
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7"""
data = StringIO(data)
print read_csv(data, delim_whitespace=True).drop('column3',axis =1)
out:
column1 column2 column4 column5
0 1 none 'gona' 5.3
1 2 34 'gina' 5.5
2 3 none 'gana' 5.1
3 4 43 'gena' 5.0
4 5 none 'guna' 5.7
If you need an array instead of DataFrame, use the to_records() method:
df.to_records(index = False)
#output:
rec.array([(1L, 'none', "'gona'", 5.3),
(2L, '34', "'gina'", 5.5),
(3L, 'none', "'gana'", 5.1),
(4L, '43', "'gena'", 5.0),
(5L, 'none', "'guna'", 5.7)],
dtype=[('column1', '<i8'), ('column2', '|O4'),
('column4', '|O4'), ('column5', '<f8')])
Assuming you're reading the CSV rows and sticking them into a numpy array, the easiest and best solution is almost definitely preprocessing the data before it gets to the array, as Maciek D.'s answer shows. (If you want to do something more complicated than "remove column 3" you might want something like [value for i, value in enumerate(row) if i not in (1, 3, 5)], but the idea is still the same.)
However, if you've already imported the array and you want to filter it after the fact, you probably want take or delete:
>>> d=np.array([[1,None,2,'gona',5.3],[2,34,2,'gina',5.5],[3,None,2,'gana',5.1],[4,43,2,'gena',5.0],[5,None,2,'guna',5.7]])
>>> np.delete(d, 2, 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
>>> np.take(d, [0, 1, 3, 4], 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
For the simple case of "remove column 3", delete makes more sense; for a more complicated case, take probably makes more sense.
If you haven't yet worked out how to import the data in the first place, you could either use the built-in csv module and something like Maciek D.'s code and process as you go, or use something like pandas.read_csv and post-process the result, as root's answer shows.
But it might be better to use a native numpy data format in the first place instead of CSV.
You can use range selection. Eg. to remove column3, you can use:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row[:2] + row[3:]
This will work, assuming that csv_file_object yields lists. If it is e.g. a simple file object created with csv_file_object = open("file.cvs"), add split in your loop:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
row = row.split()
data[i] = row[:2] + row[3:]

Categories

Resources