I want migrate my application from R using tidyvers to Python Polars, what equivalent of this code in python polars?
new_table <- table1 %>%
mutate(no = row_number()) %>%
mutate_at(vars(c, d), ~ifelse(no %in% c(2,5,7), replace_na(., 0), .)) %>%
mutate(e = table2$value[match(a, table2$id)],
f = ifelse(no %in% c(3,4), table3$value[match(b, table3$id)], f))
I try see the polars document for combining data and selecting data but still do not undestand
I expressed the assignment from the other tables as a join (actually I would have done this in tidyverse as well). Otherwise the translation is straight forward. You need:
with_row_count for the row numbers
with_columns to mutate columns
pl.col to reference columns
pl.when.then.otherwise for conditional expressions
fill_nan to replace NaN values
(table1
.with_row_count("no", 1)
.with_columns(
pl.when(pl.col("no").is_in([2, 5, 7]))
.then(pl.col(["c", "d"]).fill_nan(0))
.otherwise(pl.col(["c", "d"]))
)
.join(table2, how="left", left_on="a", right_on="id")
.rename({"value": "e"})
.join(table3, how="left", left_on="b", right_on="id")
.with_columns(
pl.when(pl.col("no").is_in([3, 4]))
.then(pl.col("value"))
.otherwise(pl.col("f"))
.alias("f")
)
.select(pl.exclude("value")) # drop the joined column table3["value"]
)
Related
the .apply() method does not seem to work on my exampl. i want to format specific columns with a thousand separator for my numbers
my function:
def thousand_separator(df, column):
df.loc\[:, column\] = df\[column\].map("{:,.4f}".format)
return df
**the below is not working **
portfolios_moc_tab_export\[\["Equity Active Risk", "NAV (USD)"\]\] = portfolios_moc_tab_export\[\["Equity Active Risk", "NAV (USD)"\]\].apply(thousand_separator, axis = 1)
Below is working but i do not want to create so many rows in my code:
thousand_separator(portfolios_moc_tab_export, "NAV (USD)")
i expect many columns to take in my function which also use my dataframe
You can simply use applymap() once on all the columns of your choice, instead of map() on each of them.
For example:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 1e9, (5, 5)), columns=list('abcde'))
colsel = list('acde')
>>> df[colsel].applymap('{:,.2f}'.format)
a c d e
0 209,652,396.00 924,231,285.00 404,868,288.00 441,365,315.00
1 463,622,907.00 417,693,031.00 745,841,673.00 530,702,035.00
2 626,610,453.00 805,680,932.00 204,159,575.00 608,910,406.00
3 243,580,376.00 97,308,044.00 573,126,970.00 977,814,209.00
4 179,207,654.00 124,102,743.00 987,744,430.00 292,249,176.00
I am trying to write a dplyr/magrittr like chain operation in pandas where one step includes a replace if command.
in R this would be:
mtcars %>%
mutate(mpg=replace(mpg, cyl==4, -99)) %>%
as.data.frame()
in python, I can do the following:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
data.loc[df.cyl == 4, 'mpg'] = -99
but would much prefer if this could be part of a chain. I could not find any replace alternative for pandas, which puzzles me. I am looking for something like:
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')\
.rename(columns={'Unnamed: 0':'brand'}, inplace=True) \
.replace_if(...)
Pretty simple to do in a chain. Make sure you don't use inplace= in a chain as it does not return a data frame to next thing in chain
(pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
.rename(columns={'Unnamed: 0':'brand'})
.assign(mpg=lambda dfa: np.where(dfa["cyl"]==4, -99, dfa["mpg"]))
)
I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values. Here is I've tried:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")
It thrown the error:
ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table
Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?
I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.
from pyspark.sql import Row
from pyspark.sql.functions import lit
bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])
fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))
uniondf = bigdf.unionAll(fitdf)
Can you try this
df = spark.createDataFrame(pandas_df)
df_table_struct = sqlContext.sql('select * from my_table limit 0')
for col in set(df_table_struct.columns) - set(df.columns):
df = df.withColumn(col, F.lit(None))
df_table_struct = df_table_struct.unionByName(df)
df_table_struct.write.saveAsTable('my_table', mode='append')
Imagine you have in R this 'dplyr' code:
test <- data %>%
group_by(PrimaryAccountReference) %>%
mutate(Counter_PrimaryAccountReference = n()) %>%
ungroup()
how can I exactly convert this to pandas equivalent code ?
Shortly, I need to group by to add another column and then ungroup the initial query. My concern is about how to do 'ungroup' function using pandas package.
Now you are able to do it with datar:
from datar import f
from datar.dplyr import group_by, mutate, ungroup, n
test = data >> \
group_by(f.PrimaryAccountReference) >> \
mutate(Counter_PrimaryAccountReference = n()) >> \
ungroup()
I am the author of the package. Feel free to submit issues if you have any questions.
Here's the pandas way should be by using transform function:
data['Counter_PrimaryAccountReference'] = data.groupby('PrimaryAccountReference')['PrimaryAccountReference'].transform('size')
I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()