Remove a substring from a pandas dataframe column - python

I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I'd like to avoid using apply if at all possible.
Here's a sample dataset:
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
My current solution is to use:
(
df.merge(
drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
how="left",
on=["ID", "S"],
indicator="_dropS",
).assign(
S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))
But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S by the delimiter "/", remove the matching entry, and join the list back together again (there may be more than two entries in the S column), but I can't seem to make that work within the DataFrame without using apply.
Edited to add goal state: Column S should have the entries: 'n', 'o', ''. The final could be NaN as well.
Is there a reasonable way to do this without a separate function call?

IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.
#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})
res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()
print(res)
ID T S
0 30 C n
1 40 D o
2 50 E
Didn't use apply :)

remove specific values from specific columns,you can use .str.replace
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
df.loc[:,'S'] = df['S'].str.replace(r'[/p]','')
the result :
ID T S
0 30 C n
1 40 D o
2 50 E

Related

Return a boolean mask for a Pandas index wrt a subset of the index

I have a df like this:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
And a subset of the indexes: s = ["a", "b"]
I would like to form the boolean mask so I can change the values of row "a" and "b" only, like so:
df.loc[m, "col"] = [10, 11]
Is there a neat way to do this?
If list of assigned values is same like list of indexes use:
df.loc[s, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
If use boolean mask by Index.isin and order of list is different like indices get different ouput:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
#swapped order
s = ["b", "a"]
m = df.index.isin(s)
df.loc[m, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
df.loc[s, "col"] = [10, 11]
print (df)
col
a 11
b 10
c 3

How to mark first entry per group satisfying some criterion?

Let's say I have some dataframe where one column has some values occuring multiple times forming groups (column A in the snippet). Now I'd like to create a new column that with e.g. a 1 for the first x (column C) entries per group, and 0 in the other ones.
I managed to do the first part, but I did not find a good way to include the condition on the xes, is there a good way of doing that?
import pandas as pd
df = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"], # data to group by
"B": ["a", "b", "c", "d", "e", "f"], # some other irrelevant data to be preserved
"C": ["y", "x", "y", "x", "y", "x"], # only consider the 'x'
}
)
target = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"],
"B": ["a", "b", "c", "d", "e", "f"],
"C": ["y", "x", "y", "x", "y", "x"],
"D": [ 0, 1, 0, 1, 0, 0] # first entry per group of 'A' that has an 'C' == 'x'
}
)
# following partial solution doesn't account for filtering by 'x' in 'C'
df['D'] = df.groupby('A')['C'].transform(lambda x: [1 if i == 0 else 0 for i in range(len(x))])
In your case do slice then drop_duplicates and assign back
df['D'] = df.loc[df.C=='x'].drop_duplicates('A').assign(D=1)['D']
df['D'].fillna(0,inplace=True)
df
Out[149]:
A B C D
0 0 a y 0.0
1 0 b x 1.0
2 1 c y 0.0
3 2 d x 1.0
4 2 e y 0.0
5 2 f x 0.0

Create For Loop To Predict Next Value Over Group (Python)

I am working on a project where I need to take groups of data and predict the next value for that group using a time series model. In my data, I have a grouping variable and a numeric variable.
Here is an example of my data:
import pandas as pd
data = [
["A", 10],
["B", 10],
["C", 15],
["D", 12],
["A", 18],
["B", 19],
["C", 14],
["D", 22],
["A", 20],
["B", 25],
["C", 12],
["D", 30],
["A", 36],
["B", 27],
["C", 10],
["D", 45]
]
data = pd.DataFrame(
data,
columns=[
"group",
"value"
],
)
What I want to do is to create a for loop that iterates over the groups and predicts the next value for A, B, C, and D. Essentially, my end result would be a new data frame with 4 rows, one for each new predicted value. It would look something like this:
group pred_value
A 40
B 36
C 8
D 42
Here is my attempt at that so far:
from statsmodels.tsa.ar_model import AutoReg
final=pd.DataFrame()
for i in data['group']:
group = data[data['group']==i]
model = AutoReg(group['value'], lags=1)
model_fit = model.fit()
yhat = model_fit.predict(len(group), len(group))
final = final.append(yhat,ignore_index=True)
Unfortunately, this produces a data frame with 15 rows and I'm not sure how to get the end result that I described above.
Can anyone help point me in the right direction? Any help would be appreciated! Thank you!
You can groupby first and then iterate. We can store the results in a dict and after the loop convert it to a DataFrame:
# will hold the predictions
forecasts = {}
# in each turn e.g., group == "A", values are [10, 18, 20, 36]
for group, values in data.groupby("group").value:
# form the model and fit
model = AutoReg(values, lags=1)
result = model.fit()
# predict
prediction = result.forecast(steps=1)
# store
forecasts[group] = prediction
# after `for` ends, convert to DataFrame
all_predictions = pd.DataFrame(forecasts)
to get
>>> all_predictions
A B C D
4 51.809524 28.561404 7.285714 62.110656
We can also do this all with apply:
>>> data.groupby("group").value.apply(lambda x: AutoReg(x, lags=1).fit().forecast(1))
group
A 4 51.809524
B 4 28.561404
C 4 7.285714
D 4 62.110656
Name: value, dtype: float64
However, we potentially lose the ability to hold references to the fitted models, whereas in explicit for, we could keep them aside. But if that is not wanted anyway, this can be used.

Building new pandas DataFrame using dict of row selections (cannot reindex from a duplicate axis)

I have a pandas DataFrame and several lists of row indices. My goal is to use these row indices to create columns in a new dataset based on the corresponding values in a given the original DataFrame, and make boxplots from this. My lists of row indices are associated with names. I represent this as a dictionary of lists of row indices.
The following small example works as expected:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f"],
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
However, with my real data, I end up with a ValueError: cannot reindex from a duplicate axis.
What can be the problem, and how do I fix it ?
As the error message suggests, some of the lists of indices may have duplicates. Simply transforming them in a set can solve the issue:
Here is an example that reproduces the error:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f", "c"], # Note the extra "c"
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
And here is how to fix it:
new_df = pd.DataFrame(
{list_name : df.loc[set(id_list)]["col1"] for (list_name, id_list) in lists_of_indices.items()})
It might however be worthwhile to check why some of these lists of indices contain duplicates in the first place.

Filling in nulls in dataframe with different elements by sampling pandas

Ive created a dataframe 'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
my initial dataframe 'df' looks like
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
I want to fill in the null deck values in df by choosing a sample from the
decks given in Pclass based on the weights.
I've only managed to code the sampling procedure.
np.random.choice(a=Pclass.deck,p=Pclass.weight)
I'm having trouble implementing a method to fill in the nulls by finding null rows that belong to class 3 and picking a random deck value for each(not the same value all the time), so not fillna('with just one').
Note: I have another question similar to this,but broader with a groupby object as well to maximize efficiency but I've gotten no responses. Any help would be greatly appreciated!
edit: added rows to dataframe Pclass
1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769
This generates a random selection from the deck column from the Pclass dataframe and assigns these to the df dataframe in the deck column (generating the required number). These commands could be put in a list comprehension if you wanted to do this across different values of the class variable. I'd recommend avoiding using class as a variable name since it's used to define new classes within Python.
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Running for multiple levels of the categorical variable
If you wanted to run this on all levels of the class variable you'd need to make sure you're selecting a subset of the data in Pclass (just the class of interest). One could use a list comprehension to find the missing data for each level of 'class' like so (I've updated the mock data below) ...
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
However, I think the code would be easier to read if it was in a loop:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)

Categories

Resources