I am trying to create a list of new columns based on the latest column. I can achieve this by using with_columns() and simple multiplication. Given I want a long list of new columns, I am thinking to use a loop with an f-string to do it. However, I am not so sure how to apply f-string into polars column names.
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2003": [-9, 3, 8],
"eat2004": [10, 11, 8]
}
); df
┌─────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┘
(
df
.with_columns((pl.col('eat2004') * 2).alias('eat2005'))
.with_columns((pl.col('eat2005') * 2).alias('eat2006'))
.with_columns((pl.col('eat2006') * 2).alias('eat2007'))
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
If you can base each of the newest columns from eat2004, I would suggest the following approach:
expr_list = [
(pl.col('eat2004') * (2**i)).alias(f"eat{2004 + i}")
for i in range(1, 8)
]
(
df
.with_columns(expr_list)
)
shape: (3, 10)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 ┆ eat2008 ┆ eat2009 ┆ eat2010 ┆ eat2011 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 ┆ 160 ┆ 320 ┆ 640 ┆ 1280 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 ┆ 176 ┆ 352 ┆ 704 ┆ 1408 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 ┆ 128 ┆ 256 ┆ 512 ┆ 1024 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
As long as all the Expressions are independent of each other, we can run them in parallel in a single with_columns context (for a nice performance gain). However, if the Expressions are not independent, then they must be run each in successive with_column contexts.
I've purposely created the list of Expressions outside of any query context to demonstrate that Expressions can be generated independent of any query. Later, the list can be supplied to with_columns. This approach helps with debugging and keeping code clean, as you build and test your Expressions.
Related
String columns can be selected with:
df.select(pl.col(pl.Utf8))
and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
df.filter(pl.col("feature").str.contains("dangerous"))
How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.
EDIT 1
Here's a generated df and patterns to test functionality and performance.
import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
[
pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
pl.Series("c", faker.sentences(df_len), pl.Utf8),
pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
pl.Series("e", faker.words(df_len), pl.Utf8),
pl.Series(
"f",
[random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
pl.Float32,
),
pl.Series("g", faker.words(df_len), pl.Utf8),
]
)
patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]
print(df) yields:
shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say. ┆ 164 ┆ let ┆ 980.17688 ┆ yard │
│ [binary data] ┆ true ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach ┆ 458.672516 ┆ son │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm ┆ 50.265484 ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29 ┆ operation ┆ 929.911438 ┆ final │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Building sign recently avoid upo... ┆ 132 ┆ practice ┆ 282.743347 ┆ big │
│ [binary data] ┆ false ┆ Paper will board. ┆ 72 ┆ similar ┆ 376.991119 ┆ just │
│ [binary data] ┆ true ┆ Technology money worker spring m... ┆ 140 ┆ sign ┆ 94.24778 ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘
EDIT 2
Using #jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:
shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true ┆ During prevent accept seem show ... ┆ 137 ┆ various ┆ 471.238892 ┆ customer │
│ [binary data] ┆ true ┆ Ball always it focus economy bef... ┆ 179 ┆ key ┆ 471.238892 ┆ guy │
│ [binary data] ┆ false ┆ Admit attack energy always. ┆ 175 ┆ purpose ┆ 1281.769775 ┆ wonder │
│ [binary data] ┆ false ┆ Beyond prevent entire staff. ┆ 242 ┆ hair ┆ 904.778687 ┆ around │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Your sure piece simple always so... ┆ 247 ┆ recently ┆ 1055.575073 ┆ laugh │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former ┆ 1061.858276 ┆ always │
│ [binary data] ┆ true ┆ Morning carry event tell prevent... ┆ 3 ┆ entire ┆ 1432.566284 ┆ hit │
│ [binary data] ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘
You can turn the list into a single regex.
regex = "|".join(
f"(?:{pattern})" for pattern in
sorted(patterns, key=len, reverse=True)
)
df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
To check regex patterns for each string column, you can use .fold() method
df = pl.DataFrame({
"a": ["foo", "fo", "foa"],
"b": ["foa", "fo", "foo"]
})
df.filter(
pl.fold(acc=pl.lit(False),
f=lambda acc, col: acc | col.str.contains("foo"),
exprs=pl.col(pl.Utf8))
)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ foo ┆ foa │
│ foa ┆ foo │
└─────┴─────┘
Another way to do it - concat all string columns into a single one and then apply regex expressions:
df.filter(
pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
)
When I load my parquet file into a Polars DataFrame, it takes about 5.5 GB of RAM. Polars is great compared to other options I have tried. However, Polars does not support creating indices like Pandas. This is troublesome for me because one column in my DataFrame is unique and the pattern of accessing the data in the df in my application is row lookups based on the unique column (dict-like).
Since the dataframe is massive, filtering is too expensive. However, I also seem to be short on RAM (32 GB). I am currently converting the df in "chunks" like this:
h = df.height # number of rows
chunk_size = 1_000_000 # split each rows
b = (np.linspace(1, math.ceil(h/chunk_size), num=math.ceil(h/chunk_size)))
new_col = (np.repeat(b, chunk_size))[:-( chunk_size - (h%chunk_size))]
df = df.with_column(polars.lit(new_col).alias('new_index'))
m = df.partition_by(groups="new_index", as_dict=True)
del df
gc.collect()
my_dict = {}
for key, value in list(m.items()):
my_dict.update(
{
uas: frame.select(polars.exclude("unique_col")).to_dicts()[0]
for uas, frame in
(
value
.drop("new_index")
.unique(subset=["unique_col"], keep='last')
.partition_by(groups=["unique_col"],
as_dict=True,
maintain_order=True)
).items()
}
)
m.pop(key)
RAM consumption does not seem to have changed much. Plus, I get an error saying that the dict size has changed during iteration (true). But what can I do? Is getting more RAM the only option?
When I load my parquet file into a Polars DataFrame, it takes about 5.5 GB of RAM. Polars is great compared to other options I have tried. However, Polars does not support creating indices like Pandas. This is troublesome for me because one column in my DataFrame is unique and the pattern of accessing the data in the df in my application is row lookups based on the unique column (dict-like).
Since the DataFrame is massive, filtering is too expensive.
Let's see if I can help. There may be ways to get excellent performance without partitioning your DataFrame.
First, some data
First, let's create a VERY large dataset and do some timings. The code below is something I've used in other situations to create a dataset of an arbitrary size. In this case, I'm going to create a dataset that is 400 GB in RAM. (I have a 32-core system with 512 GB of RAM.)
After creating the dataframe, I'm going to use set_sorted on col_0. (By the way the data is created, all columns are in sorted order.) In addition, I'll shuffle col_1 to look at some timings with sorted and non-sorted lookup columns.
import polars as pl
import math
import time
mem_in_GB = 400
def mem_squash(mem_size_GB: int) -> pl.DataFrame:
nbr_uint64 = mem_size_GB * (2**30) / 8
nbr_cols = math.ceil(nbr_uint64 ** (0.15))
nbr_rows = math.ceil(nbr_uint64 / nbr_cols)
return pl.DataFrame(
data={
"col_" + str(col_nbr): pl.arange(0, nbr_rows, eager=True)
for col_nbr in range(nbr_cols)
}
)
df = mem_squash(mem_in_GB)
df = df.with_columns([
pl.col('col_0').set_sorted(),
pl.col('col_1').shuffle(),
])
df.estimated_size('gb')
df
>>> df.estimated_size('gb')
400.0000000670552
>>> df
shape: (1309441249, 41)
┌────────────┬───────────┬────────────┬────────────┬─────┬────────────┬────────────┬────────────┬────────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════════╪════════════╪════════════╪═════╪════════════╪════════════╪════════════╪════════════╡
│ 0 ┆ 438030034 ┆ 0 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 694387471 ┆ 1 ┆ 1 ┆ ... ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 669976383 ┆ 2 ┆ 2 ┆ ... ┆ 2 ┆ 2 ┆ 2 ┆ 2 │
│ 3 ┆ 771482771 ┆ 3 ┆ 3 ┆ ... ┆ 3 ┆ 3 ┆ 3 ┆ 3 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 1309441245 ┆ 214104601 ┆ 1309441245 ┆ 1309441245 ┆ ... ┆ 1309441245 ┆ 1309441245 ┆ 1309441245 ┆ 1309441245 │
│ 1309441246 ┆ 894526083 ┆ 1309441246 ┆ 1309441246 ┆ ... ┆ 1309441246 ┆ 1309441246 ┆ 1309441246 ┆ 1309441246 │
│ 1309441247 ┆ 378223586 ┆ 1309441247 ┆ 1309441247 ┆ ... ┆ 1309441247 ┆ 1309441247 ┆ 1309441247 ┆ 1309441247 │
│ 1309441248 ┆ 520540081 ┆ 1309441248 ┆ 1309441248 ┆ ... ┆ 1309441248 ┆ 1309441248 ┆ 1309441248 ┆ 1309441248 │
└────────────┴───────────┴────────────┴────────────┴─────┴────────────┴────────────┴────────────┴────────────┘
So I now have a DataFrame of 1.3 billion rows that takes up 400 GB of my 512 GB of RAM. (Clearly, I cannot afford to make copies of this DataFrame in memory.)
Simple Filtering
Now, let's run a simple filter, on both the col_0 (the column on which I used set_sorted) and col_1 (the shuffled column).
start = time.perf_counter()
(
df
.filter(pl.col('col_0') == 106338253)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 106338253 ┆ 885386691 ┆ 106338253 ┆ 106338253 ┆ ... ┆ 106338253 ┆ 106338253 ┆ 106338253 ┆ 106338253 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.6669054719995984
start = time.perf_counter()
(
df
.filter(pl.col('col_1') == 106338253)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 889423291 ┆ 106338253 ┆ 889423291 ┆ 889423291 ┆ ... ┆ 889423291 ┆ 889423291 ┆ 889423291 ┆ 889423291 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.6410857040000337
In both cases, filtering to find a unique column value took less than one second. (And that's for a DataFrame of 1.3 billion records.)
With such performance, you may be able to get by with simply filtering in your application, without the need to partition your data.
Wicked Speed: search_sorted
If you have sufficient RAM to sort your DataFrame by the search column, you may be able to get incredible speed.
Let's use search_sorted on col_0 (which is sorted), and slice which is merely a window into a DataFrame.
start = time.perf_counter()
(
df.slice(df.get_column('col_0').search_sorted(106338253), 1)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 106338253 ┆ 885386691 ┆ 106338253 ┆ 106338253 ┆ ... ┆ 106338253 ┆ 106338253 ┆ 106338253 ┆ 106338253 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.00603273300021101
If you can sort your dataframe by the lookup column and use search_sorted, you can get some incredible performance: a speed-up by a factor of ~100x in our example above.
Does this help? Perhaps you can get the performance you need without partitioning your data.
.partition_by() returns a list of dataframes which does not sound helpful in this case.
You can use .groupby_dynamic() to process a dataframe in "chunks" - .with_row_count() can be used as an "index".
import polars as pl
chunk_size = 3
df = pl.DataFrame({"a": range(1, 11), "b": range(11, 21)})
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_dynamic(index_column="row_nr", every=f"{chunk_size}i")
.agg([
pl.col("a"),
pl.col("b"),
]))
shape: (4, 3)
┌────────┬───────────┬──────────────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞════════╪═══════════╪══════════════╡
│ 0 ┆ [1, 2, 3] ┆ [11, 12, 13] │
│ 3 ┆ [4, 5, 6] ┆ [14, 15, 16] │
│ 6 ┆ [7, 8, 9] ┆ [17, 18, 19] │
│ 9 ┆ [10] ┆ [20] │
└────────┴───────────┴──────────────┘
I'm assuming you're using .scan_parquet() to load your data and you have a LazyFrame?
It may be helpful if you can share an actual example of what you need to do to your data.
A few things.
It's not accurate to say that polars doesn't support creating indices. Polars does not use an index and each row is indexed by its integer position in the table. Note the second clause of that sentence. Polars doesn't maintain a fixed index like pandas does but you can create and use an index column if it's helpful. You can use with_row_count("name_of_index") to create an index.
You say that filtering is too expensive but, if you use expressions, it should be close to free. Are you filtering as though it were pandas with square brackets or are you using .filter? Are you using apply in your filter or is it pure polars expressions? If you're using filter with pure polars expressions then it shouldn't take up memory to give you the result of a filter. If you're using square brackets and/or apply then you're creating copies to effectuate those processes.
See here https://pola-rs.github.io/polars-book/user-guide/howcani/selecting_data/selecting_data_expressions.html
Lastly, if you are going to repartition your parquet file then you can use pyarrow dataset scanner so you don't run out of memory. https://arrow.apache.org/docs/python/dataset.html
I have a small use case and here is a polars dataframe.
df_names = pl.DataFrame({'LN'['Mallesham','Bhavik','Mallesham','Bhavik','Mahesh','Naresh','Sharath','Rakesh','Mallesham'],
'FN':['Yamulla','Yamulla','Yamulla','Yamulla','Dayala','Burre','Velmala','Uppu','Yamulla'],
'SSN':['123','456','123','456','893','111','222','333','123'],
'Address':['A','B','C','D','E','F','G','H','S']})
Here I would like to group on LN,FN,SSN and create a new column in which how many number of observations for this group combination and below is the expected output.
'Mallesham','Yamulla','123' is appeared 3 times, hence LN_FN_SSN_count field is filled up with 3.
You can use an expression using over (which is like grouping, aggregating and self-joining in other libs, but without the need for the join):
df_names.with_column(pl.count().over(['LN', 'FN', 'SSN']).alias('LN_FN_SSN_count'))
┌───────────┬─────────┬─────┬─────────┬─────────────────┐
│ LN ┆ FN ┆ SSN ┆ Address ┆ LN_FN_SSN_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════╪═════════╪═════════════════╡
│ Mallesham ┆ Yamulla ┆ 123 ┆ A ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ B ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ C ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ D ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Naresh ┆ Burre ┆ 111 ┆ F ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sharath ┆ Velmala ┆ 222 ┆ G ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rakesh ┆ Uppu ┆ 333 ┆ H ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ S ┆ 3 │
└───────────┴─────────┴─────┴─────────┴─────────────────┘
I am trying to remove null values across a list of selected columns. But it seems that I might have got the wtih_columns operation not right. What's the right approach if you want to operate the removing only on selected columns?
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2000": [1, None, 3],
"eat2001": [-2, None, 4],
"eat2002": [None, None, None],
"eat2003": [-9, None, 8],
"eat2004": [None, None, 8]
}
); df
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
col_list = [word for word in df.columns if word.startswith(("eat"))]
(
df
.with_columns([
pl.col(col_list).filter(~pl.fold(True, lambda acc, s: acc & s.is_null(), pl.all()))
])
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The polars.all and polars.any expressions will perform their operations horizontally (i.e., row-wise) if we supply them with a list of Expressions or a polars.col with a regex wildcard expression. Let's use the latter to simplify our work:
(
df
.filter(
~pl.all(pl.col('^eat.*$').is_null())
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The ~ in front of the pl.all stands for negation. Notice that we didn't need the col_list.
One caution: the regex expression in the pl.col must start with ^ and end with $. These cannot be omitted, even if the resulting regex expression is otherwise valid.
Alternately, if you don't like the ~ operator:
(
df
.filter(
pl.any(pl.col('^eat.*$').is_not_null())
)
)
Other Notes
As an aside, polars.sum, polars.min, and polars.max will also operate row-wise when supplied with a list of Expression or a wildcard expression in col.
Edit - using fold
FYI, here's how to use the fold method, if that is what you'd prefer. Note the use of pl.col with a regex expression.
(
df
.filter(
~pl.fold(True, lambda acc, s: acc & s.is_null(), exprs=pl.col('^eat.*$'))
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
(
df
.filter(pl.col("eat2000").is_not_null())
)
In polars pandas want to inter change/ assign , & inter change row value in two rows.
for i in range(len(k2)):
k2['column1'][i] == k2['column2'][i]
k2['column3'][i] == k2['column4'][i]
You can use alias to copy & rename columns:
import polars as pl
k2 = pl.DataFrame({"column1": [1,2,3],
"column2": [4,5,6],
"column3": [7,8,9],
"column4": [10,11,12]})
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 4 ┆ 7 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 ┆ 8 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 6 ┆ 9 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘
k2.with_columns([pl.col("column2").alias("column1"), pl.col("column4").alias("column3")])
which prints out
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 4 ┆ 4 ┆ 10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 12 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘