Common way to select columns in numpy array and pandas dataframe - python

I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?

You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.

Related

adding a column of numpy arrays to an existing Pandas DataFrame

I have a Pandas DataFrame to which I would like to add a new column that I will then populate with numpy arrays, such that each row in that column contains a numpy array. I'm using the following approach, and am wondering whether this is the correct approach.
df['embeddings'] = pd.Series(dtype='object')
Then I would iterate over rows and add computed arrays like so (using np.zeros(1024) for illustration only, in reality these are the output of a neural network):
for i in range(df.shape[0]):
df['embeddings'].loc[i] = np.zeros(1024)
I tested whether it helps to pre-allocate the cells like so, but didn't notice a difference in execution time when I then iterate over rows, at least not with a DataFrame that only has 200 rows:
df['embeddings'] = [np.zeros(1024)] * df.shape[0]
As alternative to adding a column to then update the rows in it, one could create the list of numpy arrays first, to then add the list as a new column, but that would require more memory.

Convert a Numpy array into a Pandas DataFrame

I have a Pandas Dataframe (dataset, 889x4) and a Numpy ndarray (targets_one_hot, 889X29), which I want to concatenate. Therefore, I want to convert the targets_one_hot into a Pandas Dataframe.
To do so, I looked at several suggestions. However, these suggestions are about smaller arrays, for which it is okay to write out the different columns.
For 29 columns, this seems inefficient. Who can tell me efficient ways to turn this Numpy array into a Pandas DataFrame?
We can wrap a numpy array in a pandas dataframe, by passing it as the first parameter. Then we can make use of pd.concat(..) [pandas-doc] to concatenate the original dataset, and the dataframe of the target_one_hot into a new dataframe. Since we here concatenate "vertically", we need to set the axis parameter on axis=1:
pd.concat((dataset, pd.DataFrame(targets_one_hot)), axis=1)

How to make an object to be a dataframe in python

I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for

python pandas difference between df_train["x"] and df_train[["x"]]

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

Merge multiple int columns/rows into one numpy array (pandas dataframe)

I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one.
Currently I am doing so by:
df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist()
df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix')
This gives me the matrix in form of a list.
Later I convert it into numpy.ndarray using:
df['matrix'] = df['matrix'].apply(np.array)
This is a small segment of my dataset for reference:
id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48
16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0
16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0
Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement.
Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.
No need to merge the columns at first. Split DataFrame using groupby and then flatten the result
matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())

Categories

Resources