pandas replace_if using chains - python

I am trying to write a dplyr/magrittr like chain operation in pandas where one step includes a replace if command.
in R this would be:
mtcars %>%
mutate(mpg=replace(mpg, cyl==4, -99)) %>%
as.data.frame()
in python, I can do the following:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
data.loc[df.cyl == 4, 'mpg'] = -99
but would much prefer if this could be part of a chain. I could not find any replace alternative for pandas, which puzzles me. I am looking for something like:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True) \
.replace_if(...)

Pretty simple to do in a chain. Make sure you don't use inplace= in a chain as it does not return a data frame to next thing in chain
(pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
.rename(columns={'Unnamed: 0':'brand'})
.assign(mpg=lambda dfa: np.where(dfa["cyl"]==4, -99, dfa["mpg"]))
)

Related

how to code in polars python with the following tidyverse R code?

I want migrate my application from R using tidyvers to Python Polars, what equivalent of this code in python polars?
new_table <- table1 %>%
mutate(no = row_number()) %>%
mutate_at(vars(c, d), ~ifelse(no %in% c(2,5,7), replace_na(., 0), .)) %>%
mutate(e = table2$value[match(a, table2$id)],
f = ifelse(no %in% c(3,4), table3$value[match(b, table3$id)], f))
I try see the polars document for combining data and selecting data but still do not undestand
I expressed the assignment from the other tables as a join (actually I would have done this in tidyverse as well). Otherwise the translation is straight forward. You need:
with_row_count for the row numbers
with_columns to mutate columns
pl.col to reference columns
pl.when.then.otherwise for conditional expressions
fill_nan to replace NaN values
(table1
.with_row_count("no", 1)
.with_columns(
pl.when(pl.col("no").is_in([2, 5, 7]))
.then(pl.col(["c", "d"]).fill_nan(0))
.otherwise(pl.col(["c", "d"]))
)
.join(table2, how="left", left_on="a", right_on="id")
.rename({"value": "e"})
.join(table3, how="left", left_on="b", right_on="id")
.with_columns(
pl.when(pl.col("no").is_in([3, 4]))
.then(pl.col("value"))
.otherwise(pl.col("f"))
.alias("f")
)
.select(pl.exclude("value")) # drop the joined column table3["value"]
)

efficient way of computing a dataframe using concat and split

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?
I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

r - Dplyr 'ungroup' function in pandas

Imagine you have in R this 'dplyr' code:
test <- data %>%
group_by(PrimaryAccountReference) %>%
mutate(Counter_PrimaryAccountReference = n()) %>%
ungroup()
how can I exactly convert this to pandas equivalent code ?
Shortly, I need to group by to add another column and then ungroup the initial query. My concern is about how to do 'ungroup' function using pandas package.
Now you are able to do it with datar:
from datar import f
from datar.dplyr import group_by, mutate, ungroup, n
test = data >> \
group_by(f.PrimaryAccountReference) >> \
mutate(Counter_PrimaryAccountReference = n()) >> \
ungroup()
I am the author of the package. Feel free to submit issues if you have any questions.
Here's the pandas way should be by using transform function:
data['Counter_PrimaryAccountReference'] = data.groupby('PrimaryAccountReference')['PrimaryAccountReference'].transform('size')

Reshaping Pandas Data Frame

I'm parsing some HTML data using Pandas like this:
rankings = pd.read_html('https://en.wikipedia.org/wiki/Rankings_of_universities_in_the_United_Kingdom')
university_guide = rankings[0]
This gives me a nice data frame:
What I want is to reshape this data frame so that there are only two columns (rank and university name). My current solution is to do something like this:
ug_copy = rankings[0][1:]
npa1 = ug_copy.as_matrix( columns=[0,1] )
npa2 = ug_copy.as_matrix( columns=[2,3] )
npa3 = ug_copy.as_matrix( columns=[4,5] )
npam = np.append(npa1,npa2)
npam = np.append(npam,npa3)
reshaped = npam.reshape((npam.size/2,2))
pd.DataFrame(data=reshaped)
This gives me what I want, but it doesn't seem like it could possibly be the best solution. I can't seem to find a good way to complete this all using a data frame. I've tried using stack/unstack and pivoting the data frame (as some of the other solutions here have suggested), but I haven't had any luck. I've tried doing something like this:
ug_copy.columns=['Rank','University','Rank','University','Rank','University']
ug_copy = ug_copy[1:]
ug_copy.groupby(['Rank', 'University'])
There has to be something small I'm missing!
This is probably a bit shorter (also note that you can use the header option in read_html to save a bit of work):
import pandas as pd
rankings = pd.read_html('https://en.wikipedia.org/wiki/Rankings_of_universities_in_the_United_Kingdom', header=0)
university_guide = rankings[0]
df = pd.DataFrame(university_guide.values.reshape((30, 2)), columns=['Rank', 'University'])
df = df.sort('Rank').reset_index(drop=True)
print df

Data frames in R

Pandas has proven very successful as a tool for working with time series data. For example to perform a 5 minutes mean you can use the resample function like this :
import pandas as pd
dframe = pd.read_table("test.csv",
delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
## 5 minutes mean
dframe.resample('t', how = 'mean')
## daily mean
ts.resample('D', how='mean')
How can I perform this in R ?
In R you can use xts package specialised in time series manipulations. For example, you can use the period.apply function like this :
library(xts)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days')
## daily mean
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
There some handy wrappers of this function ,
apply.daily(x, FUN, ...)
apply.weekly(x, FUN, ...)
apply.monthly(x, FUN, ...)
apply.quarterly(x, FUN, ...)
apply.yearly(x, FUN, ...)
R has data frames (data.frame) and it can also read csv files. Eg.
dframe <- read.csv2("test.csv")
For dates, you may need to specify the columns using the colClasses parameter. See ?read.csv2. For example:
dframe <- read.csv2("test.csv", colClasses=c("POSIXct",NA,NA))
You should then be able to round the date field using round or trunc, which will allow you to break up the data into the desired frequencies.
For example,
dframe$trunc.times <- trunc(dframe$date.field,1,units='mins');
means <- daply(dframe, 'trunc.times', function(df) { return( mean(df$value) ) });
Where value is the name of the field that you want to average.
Personally, I really like a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(ts.data, month, sum)
ts.daily.mean <- aggregate(ts.data, day, mean)
ts.mins.mean <- aggregate(ts.data, minutes, mean)
You can also use the standard time functions yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.

Categories

Resources