converting pandas dataframe with categorical values into binary values

converting pandas dataframe with categorical values into binary values - python

I am trying to convert categorical data into binary to be able to classify with an algorithm like logistic regression. I thought of using OneHotEncoder from 'sklearn.preprocessing' module but the problem is the dataframe entries are A, B pairs of arrays with different lengths, each row has pair of same-length arrays not equal to array lengths in other rows.
OneHotEncoder does not accept dataframe like mine
In [34]: data.index
Out[34]: Index([train1, train2, train3, ..., train7829, train7830,
train7831], dtype=object)
In [35]: data.columns
Out[35]: Index([A, B], dtype=object)
SampleID A B
train1 [2092.0, 1143.0, 390.0, ...] [5651.0, 4449.0, 4012.0...]
train2 [3158.0, 3158.0, 3684.0, 3684.0....] [2.0, 4.0, 2.0, 1.0...]
train3 [1699.0, 1808.0 ,...] [0.0, 1.0...]
So, I want to highlight again that each A and B pair has the same length but the length is variable across different pairs. Dataframe contains numerical, categorical and binary values.
I have another csv file with the information about every entry type. I read the file filter out categorical entries in both columns like this:
info=data_io.read_train_info()
col1=info.columns[0]
col2=info.columns[1]
info=info[(info[col1]=='Categorical')&(info[col2]=='Categorical')]
Then I use info.index to filter my training dataframe
filtered = data.loc[info.index]
Than I wrote an utility function to change dimensions of each array so that I can encode them later
def setDim(df):
for item in x[x.columns[0]].index:
df[df.columns[0]][item].shape=(1,df[df.columns[0]][item].shape[0])
df[df.columns[1]][item].shape=(1,df[df.columns[1]][item].shape[0])
setDim(filtered)
Then I thought to combine each pair of arrays into 2-row matrix so that I can pass it to encoder then to separate them again after encoding, like this:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def makeSparse(df):
enc = OneHotEncoder()
for i in df.index:
cd=np.append(df['A'][i],df['B'][i],axis=0)
a=enc.fit_transform(cd)
df['A'][i] = a[0,:]
df['B'][i] = a[1,:]
makeSparse(filtered)
After all these steps get a sparse dataframe. My questions are:
is this a right way to encode this dataframe?(I highly doubt it)
if no, then what alternatives do you offer?
Thanks a lot for your time helping me.

This is a nice way to transform your data to a better repr to deal with; uses some
neat apply tricks
In [72]: df
Out[72]:
A B
train1 [2092, 1143, 390] [5651, 449, 4012]
train2 [3158, 3158, 3684, 3684] [2, 4, 2, 1]
train3 [1699, 1808] [0, 1]
In [73]: concat(dict([ (x[0],x[1].apply(lambda y: Series(y))) for x in df.iterrows() ]))
Out[73]:
0 1 2 3
train1 A 2092 1143 390 NaN
B 5651 449 4012 NaN
train2 A 3158 3158 3684 3684
B 2 4 2 1
train3 A 1699 1808 NaN NaN
B 0 1 NaN NaN

Some 9 years later or so, as redirected to this thread from the official Pandas docs (namely the cookbook), I came upp with a probably even neater implementation of the transformation from the most upvoted answer.
To go from this:
A B
train1 [2092, 1143, 390] [5651, 449, 4012]
train2 [3158, 3158, 3684, 3684] [2, 4, 2, 1]
train3 [1699, 1808] [0, 1]
To this:
0 1 2 3
train1 A 2092.0 1143.0 390.0 NaN
B 5651.0 449.0 4012.0 NaN
train2 A 3158.0 3158.0 3684.0 3684.0
B 2.0 4.0 2.0 1.0
train3 A 1699.0 1808.0 NaN NaN
B 0.0 1.0 NaN NaN
...one can simply use:
df.transpose().unstack().apply(pd.Series)

Related

Optimizing a function to replace a row with a previous row given, a condition in Pandas

I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!

I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.

By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN

There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.

Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))

Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf

There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan

I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

Numpy nanmean and dataframe (possible bug?)

I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days
df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })
x
0 1
1 3
2 NaN
x
0 2
1 NaN
2 5
In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]:
x
0 1.5
1 NaN
2 NaN
In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]:
array([[ 1.5],
[ 3. ],
[ 5. ]])

It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.
np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at
a = np.array([df1, df2])
First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.
print(a.shape)
# (2,)
print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>
So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.
np.nanmean(a, axis=1)
# IndexError: tuple index out of range
print(np.nanmean(a))
# x
# 0 1.5
# 1 NaN
# 2 NaN
That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.
b = np.array([df1.values, df2.values ])
print(b.shape)
# (2, 3, 1)
print(type(b[1]))
# <class 'numpy.ndarray'>
print(type(b[0,0,0]))
# <class 'numpy.float64'>
These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?
print(np.nanmean(np.array([df1, df2, np.nan])))
# x
# 0 NaN
# 1 NaN
# 2 NaN
Yea, so I'm not sure. Best to avoid making these.

How to create a lagged data structure using pandas dataframe

Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30

As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
Doing work on this will be more efficient that on lists...

Very simple solution using pandas DataFrame:
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
#if you want numpy arrays with no null values:
df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
print(df)
vals lag_1 lag_2 lag_3
0 5 NaN NaN NaN
1 4 5.0 NaN NaN
2 3 4.0 5.0 NaN
3 2 3.0 4.0 5.0
4 1 2.0 3.0 4.0

For a dataframe df with the lag to be applied on 'col name', you can use the shift function.
df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)

I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.
Here's an example of the result:
# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
[5, 4, 3, 2, 1],
index=indx,
name='population')
shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):
Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.
def shift_timeseries_by_lags(df, lags, lag_label='lag'):
return pd.concat([
shift_timeseries_and_create_multiindex_column(df, lag,
lag_label=lag_label)
for lag in lags], axis=1)
def shift_timeseries_and_create_multiindex_column(
dataframe, lag, lag_label='lag'):
return (dataframe.shift(lag)
.pipe(append_level_to_columns_of_dataframe,
lag, lag_label))
I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.
def append_level_to_columns_of_dataframe(
dataframe, new_level, name_of_new_level, inplace=False):
"""Given a (possibly MultiIndex) DataFrame, append labels to the column
labels and assign this new level a name.
Parameters
----------
dataframe : a pandas DataFrame with an Index or MultiIndex columns
new_level : scalar, or arraylike of length equal to the number of columns
in `dataframe`
The labels to put on the columns. If scalar, it is broadcast into a
list of length equal to the number of columns in `dataframe`.
name_of_new_level : str
The label to give the new level.
inplace : bool, optional, default: False
Whether to modify `dataframe` in place or to return a copy
that is modified.
Returns
-------
dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
The original `dataframe` with new columns that have the given `level`
appended to each column label.
"""
old_columns = dataframe.columns
if not hasattr(new_level, '__len__') or isinstance(new_level, str):
new_level = [new_level] * dataframe.shape[1]
if isinstance(dataframe.columns, pd.MultiIndex):
new_columns = pd.MultiIndex.from_arrays(
old_columns.levels + [new_level],
names=(old_columns.names + [name_of_new_level]))
elif isinstance(dataframe.columns, pd.Index):
new_columns = pd.MultiIndex.from_arrays(
[old_columns] + [new_level],
names=([old_columns.name] + [name_of_new_level]))
if inplace:
dataframe.columns = new_columns
return dataframe
else:
copy_dataframe = dataframe.copy()
copy_dataframe.columns = new_columns
return copy_dataframe
Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:
def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
return pd.concat({
'{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
df.shift(lag)
for lag in lags},
axis=1)
Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):

Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:
lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()

You can do following:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()
It produces:
array([[ 3., 4., 5.],
[ 2., 3., 4.],
[ 1., 2., 3.]])
which I hope is very close to what you are actually want.

For multiple (many of them) lags, this could be more compact:
df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))

Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
shiftdf
>>
test_0 test_1 test_2
1 5 NaN NaN
2 4 5.0 NaN
3 3 4.0 5.0
4 2 3.0 4.0
5 1 2.0 3.0

Based on the proposal by #charlie-brummitt, here is a revision that fix a set of columns:
def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
df_fix = df[fix_columns]
df_lag = df.drop(columns=fix_columns)
df_lagged = pd.concat({f'{lag_label}_{lag}':
df_lag.shift(lag) for lag in lag_numbers},
axis=1)
df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]
return pd.concat([df_fix, df_lagged], axis=1)
Here is an example of usage:
df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])
I personally prefer the lag name as suffix. But can be changed removing reversed().

how to read from an array without a particular column in python

I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row
data can be basically depicted as follows
column1 column2 column3 column4 column5 ....
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7
..... .... ..... ..... ....
..... .... ..... ..... ....
..... .... ..... ..... ....
There're unwanted fields in the middle that I want to remove. Suppose I don't want column3.
How do I remove only that column from my array? Or copy only relevant columns to another array?

Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.
from StringIO import StringIO
from pandas import *
import numpy as np
data = """column1 column2 column3 column4 column5
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7"""
data = StringIO(data)
print read_csv(data, delim_whitespace=True).drop('column3',axis =1)
out:
column1 column2 column4 column5
0 1 none 'gona' 5.3
1 2 34 'gina' 5.5
2 3 none 'gana' 5.1
3 4 43 'gena' 5.0
4 5 none 'guna' 5.7
If you need an array instead of DataFrame, use the to_records() method:
df.to_records(index = False)
#output:
rec.array([(1L, 'none', "'gona'", 5.3),
(2L, '34', "'gina'", 5.5),
(3L, 'none', "'gana'", 5.1),
(4L, '43', "'gena'", 5.0),
(5L, 'none', "'guna'", 5.7)],
dtype=[('column1', '<i8'), ('column2', '|O4'),
('column4', '|O4'), ('column5', '<f8')])

Assuming you're reading the CSV rows and sticking them into a numpy array, the easiest and best solution is almost definitely preprocessing the data before it gets to the array, as Maciek D.'s answer shows. (If you want to do something more complicated than "remove column 3" you might want something like [value for i, value in enumerate(row) if i not in (1, 3, 5)], but the idea is still the same.)
However, if you've already imported the array and you want to filter it after the fact, you probably want take or delete:
>>> d=np.array([[1,None,2,'gona',5.3],[2,34,2,'gina',5.5],[3,None,2,'gana',5.1],[4,43,2,'gena',5.0],[5,None,2,'guna',5.7]])
>>> np.delete(d, 2, 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
>>> np.take(d, [0, 1, 3, 4], 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
For the simple case of "remove column 3", delete makes more sense; for a more complicated case, take probably makes more sense.
If you haven't yet worked out how to import the data in the first place, you could either use the built-in csv module and something like Maciek D.'s code and process as you go, or use something like pandas.read_csv and post-process the result, as root's answer shows.
But it might be better to use a native numpy data format in the first place instead of CSV.

You can use range selection. Eg. to remove column3, you can use:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row[:2] + row[3:]
This will work, assuming that csv_file_object yields lists. If it is e.g. a simple file object created with csv_file_object = open("file.cvs"), add split in your loop:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
row = row.split()
data[i] = row[:2] + row[3:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

converting pandas dataframe with categorical values into binary values - python

Related

Optimizing a function to replace a row with a previous row given, a condition in Pandas

Pandas groupby mean() not ignoring NaNs

Numpy nanmean and dataframe (possible bug?)

How to create a lagged data structure using pandas dataframe

how to read from an array without a particular column in python

Categories

Resources