I have a DataFrame
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B"], ["AA", "BB"]])
columns = pd.MultiIndex.from_product([["X", "Y"], ["XX", "YY"]])
df = pd.DataFrame([[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]], index = index, columns = columns)
and slice
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
I know that I can write df.loc[slice] to get the subset of DataFrame which corresponds to this slice. But how can I do the opposite so get the difference between original df and the one obtained with that slice?
How to invert slicing
To get the idea let's make it more complicated.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B", "C"], ["AA", "BB", "CC"]])
columns = pd.MultiIndex.from_product([["X", "Y", "Z"], ["XX", "YY", "ZZ"]])
data = (
np
.arange(len(index) * len(columns))
.reshape(len(index), len(columns))
)
df = pd.DataFrame(data, index, columns)
Let's say I want to process all the data except the inner square (B,Y).
I can get the square by slicing. To get others I'm gonna use a boolean mask:
mask = pd.DataFrame(True, index, columns)
toSkip = ((['B'], slice(None)), (['Y'], slice(None)))
mask.loc[toSkip] = False
Now I can transform others by windowing with mask:
# just for illustration purposes
# let's invert the sign of numbers
df[mask] *= -1
Here's the output:
If slice is a Series with boolean values, then logical negation operator ~ will give the opposite of the condition. So,
df[~slice]
will return rows that doesn't satisfy the condition slice
Not sure if this is you want, you can drop the index and columns of toSkip dataframe
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
tmp = df.loc[toSkip]
out = df.drop(index=tmp.index, columns=tmp.columns)
print(out)
Empty DataFrame
Columns: [(Y, XX), (Y, YY)]
Index: []
Related
I have a DataFrame with a MultiIndex where I would like to, as efficiently as possible:
Filter by one index (flag & flag_filter != 0)
Group and sum by the other two (df.groupby(['time', 'sensor']).sum(['col1','col2','col3']))
So as a setup:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product(
[
range(0, 0xff),
range(0, 5000),
range(1, 3),
], names = ["flags", "time", "sensor"]
)
data = pd.DataFrame({
"col1": np.random.uniform(size=len(index), low=0.0, high=0.5),
"col2": np.random.uniform(size=len(index), low=0.0, high=0.5),
"col3": np.random.uniform(size=len(index), low=0.0, high=0.5),
}, index = index)
I'm hoping to get, from this, a DataFrame with the same columns, but an index of just time, sensor. The idea is we threw out rows that didn't match the filter, and summed the rows that did, while still maintaining the time, sensor grouping.
Combine .loc with droplevel:
# Let's say we want to filter for even flags
flag_filter = data.index.get_level_values("flags") % 2 == 0
# Select matching rows and drop the first level
data.loc[flag_filter, :].droplevel(0)
I'm trying to set create a new column on my DataFrame grouping two existing columns
import pandas as pd
import numpy as np
DATA=pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
DATA['index']=np.arange(5)
DATA.set_index('index', inplace=True)
The output is something like this
'A' 'B'
index
0 -0.003635 -0.644897
1 -0.617104 -0.343998
2 1.270503 -0.514588
3 -0.053097 -0.404073
4 -0.056717 1.870671
I would like to have an extra column 'C' that has an np.array with the elements of 'A' and 'B' for the corresponding row. In the real case, 'A' and 'B' are already 1D np.arrays, but of different lengths. I would like to make a longer array with all the elements stacked or concatenated.
Thanks
If columns a and b contains numpy arrays, you could apply hstack across rows:
import pandas as pd
import numpy as np
num_rows = 10
max_arr_size = 3
df = pd.DataFrame({
"a": [np.random.rand(max_arr_size) for _ in range(num_rows)],
"b": [np.random.rand(max_arr_size) for _ in range(num_rows)],
})
df["c"] = df.apply(np.hstack, 1)
assert all(row.a.size + row.b.size == row.c.size for _, row in df.iterrows())
DATA['C'] = DATA.apply(lambda x: np.array([x.A, x.B]), axis=1)
pandas requires all rows to be of the same length so the problem of uneven pandas series shouldn't be present
I have a 2-column pandas data frame, initialized with df = pd.DataFrame([], columns = ["A", "B"]). Column A needs to be of type float, and column B is of type datetime.datetime. I need to add my first values to it (i.e. new rows), but I can't seem to figure out how to do it. I can't do new_row = [x, y] then append it since x and y are not of the same type. How should I go about adding these rows? Thank you.
import pandas as pd
from numpy.random import rand
Option 1 - make new row as a DF and append to previous:
df = pd.DataFrame([], columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df2=pd.DataFrame( columns = ["A", "B"],data=[[rand(),T]])
df=df.append(df2)
Or, Option 2 - create empty DF and then index:
df = pd.DataFrame(index=range(5), columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df.iloc[0,:]=[rand(),T]
This is a common pattern I've been using:
rows = ['Joe','Amy','Tom']
columns = ['account_no', 'balance']
def f(row, column):
'''Fetches value from database'''
return np.random.random()
pd.DataFrame([[f(row, column) for column in columns] for row in rows], index=rows, columns=columns)
If the rows and columns are numerical, I can also use np.meshgrid:
rows = [1,2,3]
columns = [4,5]
pd.DataFrame(np.vectorize(f)(xs, ys), index=rows, columns=columns)
My question is, what is the most elegant/Pythonic/"pandasic"/fastest/most readable way to doing this in the general case?
Thanks!
a way of doing this could be to turn your function into a ufunc, and then use outer
import numpy as np
uf = np.frompyfunc(f, 2, 1) # f has 2 inputs, 1 outputs
pd.DataFrame(uf.outer(rows, columns), index=rows, columns=columns)
one criterion you have above though is 'most readable' for which I'd say your existing for loop solution is best.
I want to expand the 'features' column of this data frame so that I create a new data frame where these features become the column names.
For example. From this,
To this,
My solution works but I don't think it is very good because there are lots of for-loops. Maybe there is a better approach thats takes advantage of features of the Pandas.DataFrame class?
The code to generate the feature matrix is below,
def feature_data_frame_by_exploding_column(input_df, col_name):
# Create data frame with same columns minus the column you want to explode
df = input_df.copy()
del df[col_name]
# The items that you want to become new features
all_new_features = []
new_feature_list = input_df[col_name].values
for ingred_list in new_feature_list:
all_new_features.extend(ingred_list) # Extend vs append!
# Add new features as columns of zeros
for feature in all_new_features:
df[feature] = 0
# For each row in data frame set values that need to be 1
for index in df.index:
ingreds_arr = new_feature_list[index]
df.loc[index, ingreds_arr] = 1
return df
df = pd.DataFrame(columns = ["id", "features"])
df['id'] = [0,1]
df['features'] = [["A", "B"], ["C", "D"]]
df
feature_data_frame_by_exploding_column(df,"features")
Scikit learn's MultiLabelBinarizer creates a binary matrix from labels. You can extract feature column from pandas dataframe and apply it:
mlb = MultiLabelBinarizer()
new_array = mlb.fit_transform(feature)
Additionally by specifying MultiLabelBinarizer(sparse_output=True) you will get a truly sparse output (useful if the number of different features is large).
Sample output:
>>> MultiLabelBinarizer().fit_transform([["A", "B"], ["C", "D"]])
array([[1, 1, 0, 0],
[0, 0, 1, 1]])