Forming a sparse feature matrix data frame in Pandas - python

I want to expand the 'features' column of this data frame so that I create a new data frame where these features become the column names.
For example. From this,
To this,
My solution works but I don't think it is very good because there are lots of for-loops. Maybe there is a better approach thats takes advantage of features of the Pandas.DataFrame class?
The code to generate the feature matrix is below,
def feature_data_frame_by_exploding_column(input_df, col_name):
# Create data frame with same columns minus the column you want to explode
df = input_df.copy()
del df[col_name]
# The items that you want to become new features
all_new_features = []
new_feature_list = input_df[col_name].values
for ingred_list in new_feature_list:
all_new_features.extend(ingred_list) # Extend vs append!
# Add new features as columns of zeros
for feature in all_new_features:
df[feature] = 0
# For each row in data frame set values that need to be 1
for index in df.index:
ingreds_arr = new_feature_list[index]
df.loc[index, ingreds_arr] = 1
return df
df = pd.DataFrame(columns = ["id", "features"])
df['id'] = [0,1]
df['features'] = [["A", "B"], ["C", "D"]]
df
feature_data_frame_by_exploding_column(df,"features")

Scikit learn's MultiLabelBinarizer creates a binary matrix from labels. You can extract feature column from pandas dataframe and apply it:
mlb = MultiLabelBinarizer()
new_array = mlb.fit_transform(feature)
Additionally by specifying MultiLabelBinarizer(sparse_output=True) you will get a truly sparse output (useful if the number of different features is large).
Sample output:
>>> MultiLabelBinarizer().fit_transform([["A", "B"], ["C", "D"]])
array([[1, 1, 0, 0],
[0, 0, 1, 1]])

Related

Sort numpy arrays from .dat files in the right row of a pandas dataframe

I got a question about storing data from .dat files in the right row of a dataframe. I go with this minimal example.
I have already a dataframe like this:
data = {'col1': [1, 2, 3, 4],'col2': ["a", "b", "c", "d"]}
df = pd.DataFrame(data, index=['row_exp1','row_exp2','row_exp3','row_exp4'])
Now I want to add a new column called col3 with numpy arrays in each single cell. Thus, I will have 4 numpy arrays, one in every cell.
I get the numpy arrays from a .dat file.
The import part is that I have to found the right row. I have 4 .dat files and every dat file matches to the row name. For example the first .dat file has got the name 230109_exp3_foo.dat. So this dat file matches to the third row of my dataframe.
Then the algorithm has to put the data from the .dat file in the right cell:
col1
col2
col3
row_exp1
1
a
row_exp2
2
b
row_exp3
3
c
[1,2,3,4,5,6]
row_exp4
4
d
The other entries should be NaN and I would fill them with the right numpy array in the next loop.
I think the difficult part is to select the right row and to math this with the file name of the .dat file.
If you're working with time series data, this isn't how you want to structure your dataframe. Read up on "tidy" data. (https://r4ds.had.co.nz/tidy-data.html)
Every column is a variable. Every row is an observation.
So let's assume you're loading your data with a function called load_data that accepts a file name:
def load_data(filename):
# load the data, fill in your own details
pass
Then you would build up your dataframe like this:
meta_data = {
'col1': [1, 2, 3, 4],
'col2': ["a", "b", "c", "d"],
}
list_of_dataframes = []
for n, fname in enumerate(filenames):
this_array = load_data(fname)
list_of_dataframes.append(
pd.DataFrame({
'row_num': list(range(len(this_array))),
'col1': meta_data['col1'][n],
'col2': meta_data['col2'][n],
'values': this_array,
})
)
df = pd.concat(list_of_dataframes, ignore_index=True)
Maybe it helps:
# Do you have the similar pattern in each .dat file name? (I assume that yes)
list_of_files = ['230109_exp3_foo.dat', '230109_exp2_foo.dat', '230109_exp1_foo.dat', '230109_exp4_foo.dat']
# for each index trying to find value after row_ in file list
files_match = df.reset_index()['index'].map(lambda x: [y for y in list_of_files if x.replace('row_', '') in y])
# if I understand correctly, you know how to read .dat file,
# so you can insert your function instead of function_for_reading_dat_file
df['col3'] = files_match.map(lambda x: function_for_reading_dat_file(x[0]) if len(x) != 0 else 'None')

How to use slice to exclude rows and columns from dataframe

I have a DataFrame
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B"], ["AA", "BB"]])
columns = pd.MultiIndex.from_product([["X", "Y"], ["XX", "YY"]])
df = pd.DataFrame([[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]], index = index, columns = columns)
and slice
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
I know that I can write df.loc[slice] to get the subset of DataFrame which corresponds to this slice. But how can I do the opposite so get the difference between original df and the one obtained with that slice?
How to invert slicing
To get the idea let's make it more complicated.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B", "C"], ["AA", "BB", "CC"]])
columns = pd.MultiIndex.from_product([["X", "Y", "Z"], ["XX", "YY", "ZZ"]])
data = (
np
.arange(len(index) * len(columns))
.reshape(len(index), len(columns))
)
df = pd.DataFrame(data, index, columns)
Let's say I want to process all the data except the inner square (B,Y).
I can get the square by slicing. To get others I'm gonna use a boolean mask:
mask = pd.DataFrame(True, index, columns)
toSkip = ((['B'], slice(None)), (['Y'], slice(None)))
mask.loc[toSkip] = False
Now I can transform others by windowing with mask:
# just for illustration purposes
# let's invert the sign of numbers
df[mask] *= -1
Here's the output:
If slice is a Series with boolean values, then logical negation operator ~ will give the opposite of the condition. So,
df[~slice]
will return rows that doesn't satisfy the condition slice
Not sure if this is you want, you can drop the index and columns of toSkip dataframe
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
tmp = df.loc[toSkip]
out = df.drop(index=tmp.index, columns=tmp.columns)
print(out)
Empty DataFrame
Columns: [(Y, XX), (Y, YY)]
Index: []

Adding a new row to a pandas data frame when columns have different data type?

I have a 2-column pandas data frame, initialized with df = pd.DataFrame([], columns = ["A", "B"]). Column A needs to be of type float, and column B is of type datetime.datetime. I need to add my first values to it (i.e. new rows), but I can't seem to figure out how to do it. I can't do new_row = [x, y] then append it since x and y are not of the same type. How should I go about adding these rows? Thank you.
import pandas as pd
from numpy.random import rand
Option 1 - make new row as a DF and append to previous:
df = pd.DataFrame([], columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df2=pd.DataFrame( columns = ["A", "B"],data=[[rand(),T]])
df=df.append(df2)
Or, Option 2 - create empty DF and then index:
df = pd.DataFrame(index=range(5), columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df.iloc[0,:]=[rand(),T]

ValueError When Performing scipy.stats test on Pandas Column Selection by Row

The goal is to create a new column in a pandas column that stores the value of a KS D-statistic, df['ks']. The KS statistic is generated between two groups of columns in that dataframe, grp1 and grp2:
# sample dataframe
import pandas as pd
import numpy as np
dic = {'gene': ['x','y','z','n'],
'cell_a': [1, 5, 8,9],
'cell_b': [8, 5, 4,9],
'cell_c': [8, 6, 1,1],
'cell_d': [1, 2, 7,1],
'cell_e': [5, 7, 9,1],
}
df = pd.DataFrame(dic)
df.set_index('gene', inplace=True)
df['ks'] = np.nan
# sample groups
grp1 = ['cell_a','cell_b']
grp2 = ['cell_d','cell_e']
So the D-statistic for gene x would be stats.ks_2samp([1,5], [1,6])[0], gene y would be stats.ks_2samp([5,2], [1,7])[0], etc. Attempt is below:
# attempt 1 to fill in KS stat
for idx, row in df.iterrows():
df.ix[idx, 'ks'] = stats.ks_2samp(df[grp1], df[grp2])[0]
However, when I attempt to fill the ks series, I get the following error:
ValueError: object too deep for desired array
My question has two parts: 1) What does it mean for an object to be "too deep for an array", and 2) how can I accomplish the same thing without iteration?
The KS calculation in the loop was getting a "too deep" error because I needed to pass it a 1-D array for each distribution to test:
for idx, row in df.iterrows():
df.loc[idx, 'ks'] = stats.ks_2samp(df.loc[idx, grp1], (df.loc[idx, grp2]))[0]
My previous attempt used a 2-D array instead. That is what was causing it to be "too deep"

pandas dataframe, how to add new row efficiently

I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/

Categories

Resources