Convert a Numpy array into a Pandas DataFrame - python

I have a Pandas Dataframe (dataset, 889x4) and a Numpy ndarray (targets_one_hot, 889X29), which I want to concatenate. Therefore, I want to convert the targets_one_hot into a Pandas Dataframe.
To do so, I looked at several suggestions. However, these suggestions are about smaller arrays, for which it is okay to write out the different columns.
For 29 columns, this seems inefficient. Who can tell me efficient ways to turn this Numpy array into a Pandas DataFrame?

We can wrap a numpy array in a pandas dataframe, by passing it as the first parameter. Then we can make use of pd.concat(..) [pandas-doc] to concatenate the original dataset, and the dataframe of the target_one_hot into a new dataframe. Since we here concatenate "vertically", we need to set the axis parameter on axis=1:
pd.concat((dataset, pd.DataFrame(targets_one_hot)), axis=1)

Related

Merge multiple int columns/rows into one numpy array (pandas dataframe)

I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one.
Currently I am doing so by:
df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist()
df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix')
This gives me the matrix in form of a list.
Later I convert it into numpy.ndarray using:
df['matrix'] = df['matrix'].apply(np.array)
This is a small segment of my dataset for reference:
id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48
16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0
16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0
Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement.
Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.
No need to merge the columns at first. Split DataFrame using groupby and then flatten the result
matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())

Slicing a pandas df using a numpy array

This should be easy, but can't figure out the right syntax. Let's say I get a numpy array of all NA locations for a particular column like so:
index = np.where(df['Gene'].isnull())[0]
I want to now examine those rows in the df. I've tried things like:
df.iloc[[index]]
df[[index]]
To no avail.

Common way to select columns in numpy array and pandas dataframe

I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?
You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.

pandas: Select one-row data frame instead of series [duplicate]

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

how to make 1 by n dataframe from series in pandas?

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

Categories

Resources