Imagine you have in R this 'dplyr' code:
test <- data %>%
group_by(PrimaryAccountReference) %>%
mutate(Counter_PrimaryAccountReference = n()) %>%
ungroup()
how can I exactly convert this to pandas equivalent code ?
Shortly, I need to group by to add another column and then ungroup the initial query. My concern is about how to do 'ungroup' function using pandas package.
Now you are able to do it with datar:
from datar import f
from datar.dplyr import group_by, mutate, ungroup, n
test = data >> \
group_by(f.PrimaryAccountReference) >> \
mutate(Counter_PrimaryAccountReference = n()) >> \
ungroup()
I am the author of the package. Feel free to submit issues if you have any questions.
Here's the pandas way should be by using transform function:
data['Counter_PrimaryAccountReference'] = data.groupby('PrimaryAccountReference')['PrimaryAccountReference'].transform('size')
Related
I want migrate my application from R using tidyvers to Python Polars, what equivalent of this code in python polars?
new_table <- table1 %>%
mutate(no = row_number()) %>%
mutate_at(vars(c, d), ~ifelse(no %in% c(2,5,7), replace_na(., 0), .)) %>%
mutate(e = table2$value[match(a, table2$id)],
f = ifelse(no %in% c(3,4), table3$value[match(b, table3$id)], f))
I try see the polars document for combining data and selecting data but still do not undestand
I expressed the assignment from the other tables as a join (actually I would have done this in tidyverse as well). Otherwise the translation is straight forward. You need:
with_row_count for the row numbers
with_columns to mutate columns
pl.col to reference columns
pl.when.then.otherwise for conditional expressions
fill_nan to replace NaN values
(table1
.with_row_count("no", 1)
.with_columns(
pl.when(pl.col("no").is_in([2, 5, 7]))
.then(pl.col(["c", "d"]).fill_nan(0))
.otherwise(pl.col(["c", "d"]))
)
.join(table2, how="left", left_on="a", right_on="id")
.rename({"value": "e"})
.join(table3, how="left", left_on="b", right_on="id")
.with_columns(
pl.when(pl.col("no").is_in([3, 4]))
.then(pl.col("value"))
.otherwise(pl.col("f"))
.alias("f")
)
.select(pl.exclude("value")) # drop the joined column table3["value"]
)
I'm trying to guess the past data by using time series analysis.
Usually, time series analysis forecast the future, but in the opposite direction, can time series forecast(?) the past?
The reason why I do this is, there is missing part in the past data.
I'm trying to write down the code in R or Python.
I tried forecast(arima, h=-92) in R. This didn't work.
This is the code I tried in R.
library('ggfortify')
library('data.table')
library('ggplot2')
library('forecast')
library('tseries')
library('urca')
library('dplyr')
library('TSstudio')
library("xts")
df<- read.csv('https://drive.google.com/file/d/1Dt2ZLOCASYIbvviWQkwwgdo2BdmKfl9H/view?usp=sharing')
colnames(df)<-c("date", "production")
df$date<-as.Date(df$date, format="%Y-%m-%d")
CandyXTS<- xts(df[-1], df[[1]])
CandyTS<- ts(df$production, start=c(1972,1),end=c(2017,8), frequency=12 )
ggAcf(CandyTS)
forecast(CandyTS, h=-92)
It is possible. It is called backcasting. You can find some information in this chapter of Forecasting: Principles and Practice.
Basicly you need to forecast in reverse. I have added an example based on the code in the chapter and your data. Adjust as needed. You create a reverse index and use that to be able to backcast in time. You can use different models than ETS. Same principle
# I downloaded data.
df1 <- readr::read_csv("datasets/candy_production.csv")
colnames(df1) <- c("date", "production")
library(fpp3)
back_cast <- df1 %>%
as_tsibble() %>%
mutate(reverse_time = rev(row_number())) %>%
update_tsibble(index = reverse_time) %>%
model(ets = ETS(production ~ season(period = 12))) %>%
# backcast
forecast(h = 12) %>%
# add dates in reverse order to the forecast with the same name as in original dataset.
mutate(date = df1$date[1] %m-% months(1:12)) %>%
as_fable(index = date, response = "production",
distribution = "production")
back_cast %>%
autoplot(df1) +
labs(title = "Backcast of candy production",
y = "production")
I need to run robust ANOVA from Python. The function I want to use is t2way from R package WRS2. I tried with r2py, but I'm stuck with an error:
>>> import rpy2.robjects.packages as rpackages
>>> from rpy2.robjects import pandas2ri
>>> pandas2ri.activate()
>>> df = pd.read_csv("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv")
>>> rdf = pandas2ri.py2rpy(df)
>>> WRS2 = rpackages.importr('WRS2')
>>> WRS2.t2way("attractiveness ~ gender*alcohol", data = rdf)
RRuntimeError: Error in x[[grp[i]]] :
attempt to select less than one element in get1index
I'm looking for either a way to make this work with rpy2, or (even better) a port of WRS2 to the python environment. Any help would be much appreciated.
If the issue is with columns in the dataframe that are not factors (as suggested in other answer), casting them into factors is quite easy:
rdf = pandas2ri.py2rpy(df)
base = importr('base')
import rpy2.robjects as ro
for cn in ('alcohol', 'gender'):
i = rdf.colnames.index(cn)
rdf[i] = base.as_factor(rdf[i])
# We could also do it with
# rdf[i] = ro.FactorVector(rdf[i])
To be on the safe side, it is recommended to create an R formula object. Some R functions will accept strings and assume that they are formula, but this is up to a package author and not always the case.
WRS2.t2way(ro.Formula('attractiveness ~ gender*alcohol'), data = rdf)
here is my particular solution for this problem. At the very beginnig the first problem in R is that when you import the data frame you have to change the type of the column alcohol and gender as.factor.
in R the script would be:
library(WRS2)
df <- read.csv2("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv",header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
t2way(attractiveness ~ gender*alcohol, data = df)
In python, although, I didn't find the way to change the data type of the column, but I came with this solution:
First you have to create an .R file named my_t2way.R that contains:
my_t2way <- function(df1){
library(WRS2)
df <- read.csv2(df1,header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
f <- t2way(attractiveness ~ gender*alcohol, data = df)
df1 = data.frame(factor=c('gender','alcohol','gender:alcohol'),
value = c(f$Qa,f$Qb,f$Qab),
p.value = c(f$A.p.value,f$B.p.value,f$AB.p.value))
return(df1)
}
And then you can run the following commands from python
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri# Defining the R script and loading the instance in Python
pandas2ri.activate()
r = robjects.r
r['source']('my_t2way.R')# Loading the function we have defined in R.
my_t2way_r = robjects.globalenv['my_t2way']# Reading and processing data
df1 = "https://github.com/lawrence009/dsur/raw/master/data/goggles.csv"
df_result_r = my_t2way_r(df1)
Certainly this solution only works for this particular case, but I think that could be easily extensible to other dataframes.
I am trying to write a dplyr/magrittr like chain operation in pandas where one step includes a replace if command.
in R this would be:
mtcars %>%
mutate(mpg=replace(mpg, cyl==4, -99)) %>%
as.data.frame()
in python, I can do the following:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
data.loc[df.cyl == 4, 'mpg'] = -99
but would much prefer if this could be part of a chain. I could not find any replace alternative for pandas, which puzzles me. I am looking for something like:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True) \
.replace_if(...)
Pretty simple to do in a chain. Make sure you don't use inplace= in a chain as it does not return a data frame to next thing in chain
(pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
.rename(columns={'Unnamed: 0':'brand'})
.assign(mpg=lambda dfa: np.where(dfa["cyl"]==4, -99, dfa["mpg"]))
)
Pandas has proven very successful as a tool for working with time series data. For example to perform a 5 minutes mean you can use the resample function like this :
import pandas as pd
dframe = pd.read_table("test.csv",
delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
## 5 minutes mean
dframe.resample('t', how = 'mean')
## daily mean
ts.resample('D', how='mean')
How can I perform this in R ?
In R you can use xts package specialised in time series manipulations. For example, you can use the period.apply function like this :
library(xts)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days')
## daily mean
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
There some handy wrappers of this function ,
apply.daily(x, FUN, ...)
apply.weekly(x, FUN, ...)
apply.monthly(x, FUN, ...)
apply.quarterly(x, FUN, ...)
apply.yearly(x, FUN, ...)
R has data frames (data.frame) and it can also read csv files. Eg.
dframe <- read.csv2("test.csv")
For dates, you may need to specify the columns using the colClasses parameter. See ?read.csv2. For example:
dframe <- read.csv2("test.csv", colClasses=c("POSIXct",NA,NA))
You should then be able to round the date field using round or trunc, which will allow you to break up the data into the desired frequencies.
For example,
dframe$trunc.times <- trunc(dframe$date.field,1,units='mins');
means <- daply(dframe, 'trunc.times', function(df) { return( mean(df$value) ) });
Where value is the name of the field that you want to average.
Personally, I really like a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(ts.data, month, sum)
ts.daily.mean <- aggregate(ts.data, day, mean)
ts.mins.mean <- aggregate(ts.data, minutes, mean)
You can also use the standard time functions yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.