I am working on a project where I need to take groups of data and predict the next value for that group using a time series model. In my data, I have a grouping variable and a numeric variable.
Here is an example of my data:
import pandas as pd
data = [
["A", 10],
["B", 10],
["C", 15],
["D", 12],
["A", 18],
["B", 19],
["C", 14],
["D", 22],
["A", 20],
["B", 25],
["C", 12],
["D", 30],
["A", 36],
["B", 27],
["C", 10],
["D", 45]
]
data = pd.DataFrame(
data,
columns=[
"group",
"value"
],
)
What I want to do is to create a for loop that iterates over the groups and predicts the next value for A, B, C, and D. Essentially, my end result would be a new data frame with 4 rows, one for each new predicted value. It would look something like this:
group pred_value
A 40
B 36
C 8
D 42
Here is my attempt at that so far:
from statsmodels.tsa.ar_model import AutoReg
final=pd.DataFrame()
for i in data['group']:
group = data[data['group']==i]
model = AutoReg(group['value'], lags=1)
model_fit = model.fit()
yhat = model_fit.predict(len(group), len(group))
final = final.append(yhat,ignore_index=True)
Unfortunately, this produces a data frame with 15 rows and I'm not sure how to get the end result that I described above.
Can anyone help point me in the right direction? Any help would be appreciated! Thank you!
You can groupby first and then iterate. We can store the results in a dict and after the loop convert it to a DataFrame:
# will hold the predictions
forecasts = {}
# in each turn e.g., group == "A", values are [10, 18, 20, 36]
for group, values in data.groupby("group").value:
# form the model and fit
model = AutoReg(values, lags=1)
result = model.fit()
# predict
prediction = result.forecast(steps=1)
# store
forecasts[group] = prediction
# after `for` ends, convert to DataFrame
all_predictions = pd.DataFrame(forecasts)
to get
>>> all_predictions
A B C D
4 51.809524 28.561404 7.285714 62.110656
We can also do this all with apply:
>>> data.groupby("group").value.apply(lambda x: AutoReg(x, lags=1).fit().forecast(1))
group
A 4 51.809524
B 4 28.561404
C 4 7.285714
D 4 62.110656
Name: value, dtype: float64
However, we potentially lose the ability to hold references to the fitted models, whereas in explicit for, we could keep them aside. But if that is not wanted anyway, this can be used.
Related
I have a dataframe like the following:
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
My goal is if there are duplicated values in Col1, I would then sort (only the duplicate values) by the values in Col2 from smallest to largest, and rename the original "Value" in Col1 to be "Value.A" (for duplicates with smallest value in Col2) "Value.B" (for 2nd smallest, etc). Value of the Col3
Using the example above, this is what I should end up with:
pd.DataFrame({"Col1": ["AA.B", "AB", "AA.A", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
Since 13<18 so the 2nd "AA" becomes "AA.A" and first "AA" becomes AA.B. (values in Col3 stays unchanged). Also, "AB","CC","FF" all needs to remain unchanged. I could have potentially more than 1 sets of duplicates in Col1.
I do not need to preserve the rows, so long as the values in each row stay the same except the renamed value in Col1. (i.e., I should still have "AA.B", 18, 17 for the 3 columns no matter where the 1st row in the output moves to).
I tried to use the row['Col1'] == df['Col1'].shift() as a lambda function but this gives me the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I suspect this was due to the na value when I called shift() but using fillna() doesn't help since that will always create a duplicate at the beginning.
Any suggestions on how I can make it work?
You can use the pandas Groupby.apply with a custom function to get what you want.
You group the dataframe by the first column and then apply you custom function to each "sub" dataframe. In this case I check if there is a duplicate and if so I use some sorting and ASCII value switching to generate the new labels that you need.
# your example
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
def de_dup_labels(x):
"""A function to de-duplicate labels based on column orders"""
# if at least 1 duplicate exists
if len(x) > 1:
# work out what order Col2 would be in if it were sorted
order = x["Col2"].argsort().values
# update Col1 with the new labels using the order from above
x["Col1"] = np.asarray([f"{x['Col1'].iloc[i]}.{chr(ord('A') + i)}"
for i in range(len(x))])[order]
return x
updated_df = df.groupby("Col1").apply(de_dup_labels)
The logic is:
Group by "Col1".
Collect Col2 as map {index: Col2} and sorted by Col2.
Replace "Col1" as "Col1.1", "Col1.2", ... if it is duplicate.
Join new "Col1" using original index.
PS - I have changed the suffix logic to use ".1", ".2" ... as it is not clear what series to use when values are exhausted with ".A", ".B".
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
df_grp = df.groupby("Col1") \
.agg(Col2=("Col2", \
# Collect Col2 as map {index: Col2} and sorted by Col2.
lambda s: sorted(
{x[0]:x[1] for x in zip(s.index, s)}.items(),
key=lambda y: y[1]
)
)) \
.reset_index()
# Mark the duplicate values.
df_grp["is_dup"] = df_grp["Col2"].apply(lambda x: len(x) > 1)
# Replace "Col1" as "Col1.1", "Col1.2", ... if it is duplicate.
df_grp = df_grp.explode("Col2").reset_index().rename(columns={"index": "Col2_index"})
df_grp["Col2_index"] = df_grp.groupby("Col2_index").cumcount()
df_grp["Col1"] = df_grp.apply(lambda x: f'{x["Col1"]}.{x["Col2_index"]+1}' if x["is_dup"] else x["Col1"], axis=1)
# Restore original index.
df_grp["orig_index"] = df_grp["Col2"].apply(lambda x: x[0])
df_grp = df_grp.set_index("orig_index")
# Join new "Col1" using original index.
df = df.drop("Col1", axis=1).join(df_grp[["Col1"]])
Output:
Col2 Col3 Col1
0 18 17 AA.2
1 23 27 AB
2 13 22 AA.1
3 33 37 CC
4 48 52 FF
I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I'd like to avoid using apply if at all possible.
Here's a sample dataset:
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
My current solution is to use:
(
df.merge(
drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
how="left",
on=["ID", "S"],
indicator="_dropS",
).assign(
S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))
But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S by the delimiter "/", remove the matching entry, and join the list back together again (there may be more than two entries in the S column), but I can't seem to make that work within the DataFrame without using apply.
Edited to add goal state: Column S should have the entries: 'n', 'o', ''. The final could be NaN as well.
Is there a reasonable way to do this without a separate function call?
IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.
#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})
res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()
print(res)
ID T S
0 30 C n
1 40 D o
2 50 E
Didn't use apply :)
remove specific values from specific columns,you can use .str.replace
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
df.loc[:,'S'] = df['S'].str.replace(r'[/p]','')
the result :
ID T S
0 30 C n
1 40 D o
2 50 E
This question already has answers here:
Change column type in pandas
(16 answers)
Closed 1 year ago.
import numpy as np
import pandas as pd
a = np.array([["M", 86],
["M", 76],
["M", 56],
["M", 66],
["B", 16],
["B", 13],
["B", 16],
["B", 18],
["B", 14], ])
df = pd.DataFrame(data=a, columns=["Case", "radius"])
print(df)
print(df.columns)
a = df[(df["radius"] >= 57) & (df["Case"] == "M")]["radius"].tolist()
print(a)
I get an error -
TypeError: '>=' not supported between instances of 'str' and 'int'
But here i am putting a condition on a column that contains integers. What is the problem here?
i want to have a list of column radius values where the values of column radius are greater than or equal to 57 and "Case"=="M
Typecast the radius column after creating the df, it should work:
df.radius = df.radius.astype(int)
Ive created a dataframe 'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
my initial dataframe 'df' looks like
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
I want to fill in the null deck values in df by choosing a sample from the
decks given in Pclass based on the weights.
I've only managed to code the sampling procedure.
np.random.choice(a=Pclass.deck,p=Pclass.weight)
I'm having trouble implementing a method to fill in the nulls by finding null rows that belong to class 3 and picking a random deck value for each(not the same value all the time), so not fillna('with just one').
Note: I have another question similar to this,but broader with a groupby object as well to maximize efficiency but I've gotten no responses. Any help would be greatly appreciated!
edit: added rows to dataframe Pclass
1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769
This generates a random selection from the deck column from the Pclass dataframe and assigns these to the df dataframe in the deck column (generating the required number). These commands could be put in a list comprehension if you wanted to do this across different values of the class variable. I'd recommend avoiding using class as a variable name since it's used to define new classes within Python.
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Running for multiple levels of the categorical variable
If you wanted to run this on all levels of the class variable you'd need to make sure you're selecting a subset of the data in Pclass (just the class of interest). One could use a list comprehension to find the missing data for each level of 'class' like so (I've updated the mock data below) ...
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
However, I think the code would be easier to read if it was in a loop:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
I'm developing a ML algorithm to get a feature importance with an ExtraTrees.
The problem that I'm trying to solving is that the variables are not scalars but lists with different dimensions, or matrixes but for now I will focus only on lists.
From the momement the only think that I was able to do is a FI with the flat lists concateneted each other.
GOAL:
What I would like to do is to get a score point for each different list instead of a score for each lists' element.
I present here a toy example of the dataset and the current code:
df = pd.DataFrame({"list1": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0] , "scalar1":[1,2] , "scalar2":[3,4]})
After PCA ( below ):
df['new'] = pd.Series([a.reshape((c, r)) for (a, c, r) in zip(df.A, df.C, df.R)])
df['pca'] = pd.Series([ pca_volatilities(matrix) for matrix in df.new ])
Becomes:
list1 # C # C1 # C2 # CLASS # R # new # pca # flat_pca
0 [10, 15, 12, 14] 2 1 3 1 2 [[10, 15], [12, 14]] [[-1.11803398875], [1.11803398875]] [-1.11803398875, 1.11803398875]
1 [20, 30, 10, 43] 2 2 4 0 2 [[20, 30], [10, 43]] [[-8.20060973343], [8.20060973343]] [-8.20060973343, 8.20060973343]
Here I present the fit:
X = np.concatenate([np.stack(df.flat_pca,axis=0), [df.C1, df.C2]], axis=0).transpose()
Y = np.array(df.CLASS)
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
This returns:
array([ 0.2, 0.3, 0.2, 0.3]).
What I need is a score for list1 , C1,C2 and flat_pca. I don't know how to do this.
Hoping that someone is able to help me, thanks in advance !!!!!