I am trying to remove null values across a list of selected columns. But it seems that I might have got the wtih_columns operation not right. What's the right approach if you want to operate the removing only on selected columns?
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2000": [1, None, 3],
"eat2001": [-2, None, 4],
"eat2002": [None, None, None],
"eat2003": [-9, None, 8],
"eat2004": [None, None, 8]
}
); df
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
col_list = [word for word in df.columns if word.startswith(("eat"))]
(
df
.with_columns([
pl.col(col_list).filter(~pl.fold(True, lambda acc, s: acc & s.is_null(), pl.all()))
])
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The polars.all and polars.any expressions will perform their operations horizontally (i.e., row-wise) if we supply them with a list of Expressions or a polars.col with a regex wildcard expression. Let's use the latter to simplify our work:
(
df
.filter(
~pl.all(pl.col('^eat.*$').is_null())
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The ~ in front of the pl.all stands for negation. Notice that we didn't need the col_list.
One caution: the regex expression in the pl.col must start with ^ and end with $. These cannot be omitted, even if the resulting regex expression is otherwise valid.
Alternately, if you don't like the ~ operator:
(
df
.filter(
pl.any(pl.col('^eat.*$').is_not_null())
)
)
Other Notes
As an aside, polars.sum, polars.min, and polars.max will also operate row-wise when supplied with a list of Expression or a wildcard expression in col.
Edit - using fold
FYI, here's how to use the fold method, if that is what you'd prefer. Note the use of pl.col with a regex expression.
(
df
.filter(
~pl.fold(True, lambda acc, s: acc & s.is_null(), exprs=pl.col('^eat.*$'))
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
(
df
.filter(pl.col("eat2000").is_not_null())
)
Related
When I load my parquet file into a Polars DataFrame, it takes about 5.5 GB of RAM. Polars is great compared to other options I have tried. However, Polars does not support creating indices like Pandas. This is troublesome for me because one column in my DataFrame is unique and the pattern of accessing the data in the df in my application is row lookups based on the unique column (dict-like).
Since the dataframe is massive, filtering is too expensive. However, I also seem to be short on RAM (32 GB). I am currently converting the df in "chunks" like this:
h = df.height # number of rows
chunk_size = 1_000_000 # split each rows
b = (np.linspace(1, math.ceil(h/chunk_size), num=math.ceil(h/chunk_size)))
new_col = (np.repeat(b, chunk_size))[:-( chunk_size - (h%chunk_size))]
df = df.with_column(polars.lit(new_col).alias('new_index'))
m = df.partition_by(groups="new_index", as_dict=True)
del df
gc.collect()
my_dict = {}
for key, value in list(m.items()):
my_dict.update(
{
uas: frame.select(polars.exclude("unique_col")).to_dicts()[0]
for uas, frame in
(
value
.drop("new_index")
.unique(subset=["unique_col"], keep='last')
.partition_by(groups=["unique_col"],
as_dict=True,
maintain_order=True)
).items()
}
)
m.pop(key)
RAM consumption does not seem to have changed much. Plus, I get an error saying that the dict size has changed during iteration (true). But what can I do? Is getting more RAM the only option?
When I load my parquet file into a Polars DataFrame, it takes about 5.5 GB of RAM. Polars is great compared to other options I have tried. However, Polars does not support creating indices like Pandas. This is troublesome for me because one column in my DataFrame is unique and the pattern of accessing the data in the df in my application is row lookups based on the unique column (dict-like).
Since the DataFrame is massive, filtering is too expensive.
Let's see if I can help. There may be ways to get excellent performance without partitioning your DataFrame.
First, some data
First, let's create a VERY large dataset and do some timings. The code below is something I've used in other situations to create a dataset of an arbitrary size. In this case, I'm going to create a dataset that is 400 GB in RAM. (I have a 32-core system with 512 GB of RAM.)
After creating the dataframe, I'm going to use set_sorted on col_0. (By the way the data is created, all columns are in sorted order.) In addition, I'll shuffle col_1 to look at some timings with sorted and non-sorted lookup columns.
import polars as pl
import math
import time
mem_in_GB = 400
def mem_squash(mem_size_GB: int) -> pl.DataFrame:
nbr_uint64 = mem_size_GB * (2**30) / 8
nbr_cols = math.ceil(nbr_uint64 ** (0.15))
nbr_rows = math.ceil(nbr_uint64 / nbr_cols)
return pl.DataFrame(
data={
"col_" + str(col_nbr): pl.arange(0, nbr_rows, eager=True)
for col_nbr in range(nbr_cols)
}
)
df = mem_squash(mem_in_GB)
df = df.with_columns([
pl.col('col_0').set_sorted(),
pl.col('col_1').shuffle(),
])
df.estimated_size('gb')
df
>>> df.estimated_size('gb')
400.0000000670552
>>> df
shape: (1309441249, 41)
┌────────────┬───────────┬────────────┬────────────┬─────┬────────────┬────────────┬────────────┬────────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════════╪════════════╪════════════╪═════╪════════════╪════════════╪════════════╪════════════╡
│ 0 ┆ 438030034 ┆ 0 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 694387471 ┆ 1 ┆ 1 ┆ ... ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 669976383 ┆ 2 ┆ 2 ┆ ... ┆ 2 ┆ 2 ┆ 2 ┆ 2 │
│ 3 ┆ 771482771 ┆ 3 ┆ 3 ┆ ... ┆ 3 ┆ 3 ┆ 3 ┆ 3 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 1309441245 ┆ 214104601 ┆ 1309441245 ┆ 1309441245 ┆ ... ┆ 1309441245 ┆ 1309441245 ┆ 1309441245 ┆ 1309441245 │
│ 1309441246 ┆ 894526083 ┆ 1309441246 ┆ 1309441246 ┆ ... ┆ 1309441246 ┆ 1309441246 ┆ 1309441246 ┆ 1309441246 │
│ 1309441247 ┆ 378223586 ┆ 1309441247 ┆ 1309441247 ┆ ... ┆ 1309441247 ┆ 1309441247 ┆ 1309441247 ┆ 1309441247 │
│ 1309441248 ┆ 520540081 ┆ 1309441248 ┆ 1309441248 ┆ ... ┆ 1309441248 ┆ 1309441248 ┆ 1309441248 ┆ 1309441248 │
└────────────┴───────────┴────────────┴────────────┴─────┴────────────┴────────────┴────────────┴────────────┘
So I now have a DataFrame of 1.3 billion rows that takes up 400 GB of my 512 GB of RAM. (Clearly, I cannot afford to make copies of this DataFrame in memory.)
Simple Filtering
Now, let's run a simple filter, on both the col_0 (the column on which I used set_sorted) and col_1 (the shuffled column).
start = time.perf_counter()
(
df
.filter(pl.col('col_0') == 106338253)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 106338253 ┆ 885386691 ┆ 106338253 ┆ 106338253 ┆ ... ┆ 106338253 ┆ 106338253 ┆ 106338253 ┆ 106338253 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.6669054719995984
start = time.perf_counter()
(
df
.filter(pl.col('col_1') == 106338253)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 889423291 ┆ 106338253 ┆ 889423291 ┆ 889423291 ┆ ... ┆ 889423291 ┆ 889423291 ┆ 889423291 ┆ 889423291 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.6410857040000337
In both cases, filtering to find a unique column value took less than one second. (And that's for a DataFrame of 1.3 billion records.)
With such performance, you may be able to get by with simply filtering in your application, without the need to partition your data.
Wicked Speed: search_sorted
If you have sufficient RAM to sort your DataFrame by the search column, you may be able to get incredible speed.
Let's use search_sorted on col_0 (which is sorted), and slice which is merely a window into a DataFrame.
start = time.perf_counter()
(
df.slice(df.get_column('col_0').search_sorted(106338253), 1)
)
print(time.perf_counter() - start)
shape: (1, 41)
┌───────────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ ... ┆ col_37 ┆ col_38 ┆ col_39 ┆ col_40 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 106338253 ┆ 885386691 ┆ 106338253 ┆ 106338253 ┆ ... ┆ 106338253 ┆ 106338253 ┆ 106338253 ┆ 106338253 │
└───────────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
0.00603273300021101
If you can sort your dataframe by the lookup column and use search_sorted, you can get some incredible performance: a speed-up by a factor of ~100x in our example above.
Does this help? Perhaps you can get the performance you need without partitioning your data.
.partition_by() returns a list of dataframes which does not sound helpful in this case.
You can use .groupby_dynamic() to process a dataframe in "chunks" - .with_row_count() can be used as an "index".
import polars as pl
chunk_size = 3
df = pl.DataFrame({"a": range(1, 11), "b": range(11, 21)})
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_dynamic(index_column="row_nr", every=f"{chunk_size}i")
.agg([
pl.col("a"),
pl.col("b"),
]))
shape: (4, 3)
┌────────┬───────────┬──────────────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞════════╪═══════════╪══════════════╡
│ 0 ┆ [1, 2, 3] ┆ [11, 12, 13] │
│ 3 ┆ [4, 5, 6] ┆ [14, 15, 16] │
│ 6 ┆ [7, 8, 9] ┆ [17, 18, 19] │
│ 9 ┆ [10] ┆ [20] │
└────────┴───────────┴──────────────┘
I'm assuming you're using .scan_parquet() to load your data and you have a LazyFrame?
It may be helpful if you can share an actual example of what you need to do to your data.
A few things.
It's not accurate to say that polars doesn't support creating indices. Polars does not use an index and each row is indexed by its integer position in the table. Note the second clause of that sentence. Polars doesn't maintain a fixed index like pandas does but you can create and use an index column if it's helpful. You can use with_row_count("name_of_index") to create an index.
You say that filtering is too expensive but, if you use expressions, it should be close to free. Are you filtering as though it were pandas with square brackets or are you using .filter? Are you using apply in your filter or is it pure polars expressions? If you're using filter with pure polars expressions then it shouldn't take up memory to give you the result of a filter. If you're using square brackets and/or apply then you're creating copies to effectuate those processes.
See here https://pola-rs.github.io/polars-book/user-guide/howcani/selecting_data/selecting_data_expressions.html
Lastly, if you are going to repartition your parquet file then you can use pyarrow dataset scanner so you don't run out of memory. https://arrow.apache.org/docs/python/dataset.html
I am trying to create a list of new columns based on the latest column. I can achieve this by using with_columns() and simple multiplication. Given I want a long list of new columns, I am thinking to use a loop with an f-string to do it. However, I am not so sure how to apply f-string into polars column names.
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2003": [-9, 3, 8],
"eat2004": [10, 11, 8]
}
); df
┌─────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┘
(
df
.with_columns((pl.col('eat2004') * 2).alias('eat2005'))
.with_columns((pl.col('eat2005') * 2).alias('eat2006'))
.with_columns((pl.col('eat2006') * 2).alias('eat2007'))
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
If you can base each of the newest columns from eat2004, I would suggest the following approach:
expr_list = [
(pl.col('eat2004') * (2**i)).alias(f"eat{2004 + i}")
for i in range(1, 8)
]
(
df
.with_columns(expr_list)
)
shape: (3, 10)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 ┆ eat2008 ┆ eat2009 ┆ eat2010 ┆ eat2011 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 ┆ 160 ┆ 320 ┆ 640 ┆ 1280 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 ┆ 176 ┆ 352 ┆ 704 ┆ 1408 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 ┆ 128 ┆ 256 ┆ 512 ┆ 1024 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
As long as all the Expressions are independent of each other, we can run them in parallel in a single with_columns context (for a nice performance gain). However, if the Expressions are not independent, then they must be run each in successive with_column contexts.
I've purposely created the list of Expressions outside of any query context to demonstrate that Expressions can be generated independent of any query. Later, the list can be supplied to with_columns. This approach helps with debugging and keeping code clean, as you build and test your Expressions.
I have a dataframe as-
pl.DataFrame({'last_name':['Unknown','Mallesham',np.nan,'Bhavik','Unknown'],
'first_name_or_initial':['U',np.nan,'TRUE','yamulla',np.nan],
'number':['003123490','012457847','100030303','','0023004648'],
'date_of_birth':[np.nan,'12/09/1900','12/09/1900','12/09/1900',np.nan]})
Here I would like to add a new column which contains the field names that do hold on any information except NULL/EMPTY/NAN.
For example:
first row: it has last,first and number field information, and dob is NULL, hence a new column conso_field is filled in with these field names such as last_name,first_name_or_initial and number. like wise I need to get this done for all the rows.
Here is an expected output:
First, let's expand the example to show a row with all null/empty fields (to show how the algorithm handles this case).
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"last_name": ["Unknown", "Mallesham", np.nan, "Bhavik", "Unknown", None],
"first_name_or_initial": ["U", np.nan, "TRUE", "yamulla", np.nan, None],
"number": ["003123490", "012457847", "100030303", "", "0023004648", None],
"date_of_birth": [np.nan, "12/09/1900", "12/09/1900", "12/09/1900", np.nan, None],
}
)
df
shape: (6, 4)
┌───────────┬───────────────────────┬────────────┬───────────────┐
│ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════════════════════╪════════════╪═══════════════╡
│ Unknown ┆ U ┆ 003123490 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Unknown ┆ null ┆ 0023004648 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null ┆ null │
└───────────┴───────────────────────┴────────────┴───────────────┘
The Algorithm
df = df.with_row_count()
(
df
.join(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
.with_column(pl.col('conso_field').arr.join(','))
,
on='row_nr',
how='left'
)
)
shape: (6, 6)
┌────────┬───────────┬───────────────────────┬────────────┬───────────────┬─────────────────────────────────────┐
│ row_nr ┆ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth ┆ conso_field │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str │
╞════════╪═══════════╪═══════════════════════╪════════════╪═══════════════╪═════════════════════════════════════╡
│ 0 ┆ Unknown ┆ U ┆ 003123490 ┆ null ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 ┆ last_name,number,date_of_birth │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 ┆ first_name_or_initial,number,dat... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ Unknown ┆ null ┆ 0023004648 ┆ null ┆ last_name,number │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ null ┆ null ┆ null ┆ null │
└────────┴───────────┴───────────────────────┴────────────┴───────────────┴─────────────────────────────────────┘
Note that the algorithm keeps the last row with all null/empty values.
How it works
To see how it works, let's take it in steps.
First, we'll need to attach a row number to each row. (This is needed in case any row has all null/empty values.)
Then we'll use melt to place each value in each column on a separate row, next to it's column name.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
)
shape: (24, 3)
┌────────┬───────────────┬────────────┐
│ row_nr ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞════════╪═══════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name ┆ Mallesham │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ last_name ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ last_name ┆ Bhavik │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ date_of_birth ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ date_of_birth ┆ null │
└────────┴───────────────┴────────────┘
Note that columns values will be converted to string values in this step.
Next, we'll filter out any rows with null or "" values.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
)
shape: (14, 3)
┌────────┬───────────────┬────────────┐
│ row_nr ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞════════╪═══════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name ┆ Mallesham │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ last_name ┆ Bhavik │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ number ┆ 0023004648 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
└────────┴───────────────┴────────────┘
In the next step, we'll aggregate up all the remaining rows by row number, keeping only the column names. These represent columns with non-null, non-empty values.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
)
shape: (5, 2)
┌────────┬─────────────────────────────────────┐
│ row_nr ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ list[str] │
╞════════╪═════════════════════════════════════╡
│ 2 ┆ ["first_name_or_initial", "numbe... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ["last_name", "first_name_or_ini... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ ["last_name", "number"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ ["last_name", "first_name_or_ini... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["last_name", "number", "date_of... │
└────────┴─────────────────────────────────────┘
Note that we get a list of column names for each row. (Note: we don't need to worry about the order of the rows at this point. We'll use the row number and a left-join in the last step recombine the values to the original DataFrame.)
Then, it's simply a matter of joining the columns names into one string:
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
.with_column(pl.col('conso_field').arr.join(','))
)
shape: (5, 2)
┌────────┬─────────────────────────────────────┐
│ row_nr ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ str │
╞════════╪═════════════════════════════════════╡
│ 3 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ first_name_or_initial,number,dat... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ last_name,number │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name,number,date_of_birth │
└────────┴─────────────────────────────────────┘
From here, we simply use a "left join" to merge the data back to the original dataset (as shown at the beginning.)
In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?
d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.
Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.
import polars as pl
df = pl.DataFrame(
{
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str │
│ --- │
│ str │
╞═════════════════════════════╡
│ cat │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘
The Algorithm
The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
.with_column(
pl.col('^string_.*$').fill_null("")
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
How it works
We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[str] │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0 ┆ cat ┆ ["cat"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ ["cat", "dog"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ [""] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘
Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╡
│ 0 ┆ cat ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog │
└─────┴─────────────────────────────┴───────────┘
In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.
I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)
You can substitute your own naming in this step as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id ┆ my_str ┆ split_str ┆ col_nm │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘
In the next step, we'll use the pivot function to place each string in its own column.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)
You can use apply() method
import polars as pl
from polars import col
df = pl.DataFrame({
'col1': ["a/b/c/d", "e/f/j/k"]
})
print(df)
df:
shape: (2, 1)
┌─────────┐
│ col1 │
│ --- │
│ str │
╞═════════╡
│ a/b/c/d │
├╌╌╌╌╌╌╌╌╌┤
│ e/f/j/k │
└─────────┘
With apply()
df = df.with_columns([
col('col1'),
*[col('col1').apply(lambda s, i=i: s.split('/')[i]).alias(col_name)
for i, col_name in enumerate(['a', 'b', 'c', 'd'])]
# or without 'for'
# col('col1').apply(lambda s: s.split('/')[0]).alias('a'),
# col('col1').apply(lambda s: s.split('/')[1]).alias('b'),
# col('col1').apply(lambda s: s.split('/')[2]).alias('c'),
# col('col1').apply(lambda s: s.split('/')[3]).alias('d')
])
print(df)
df:
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ e/f/j/k ┆ e ┆ f ┆ j ┆ k │
└─────────┴─────┴─────┴─────┴─────┘
It works, but probably there is more accurate way)
With this way you do the string split to turn col1 into a list of strings. Then you loop over the lists and use .arr.get to extract each element into a separate column
(df
.with_column(pl.col("col1").str.split("/"))
.with_columns(
[pl.col("col1").arr.get(i).alias(str(i)) for i in range(len(df[0,"col1"].split('/')))
]
)
)
One challenge is whether you will have the same number of elements in the list in each row. In this solution I've assumed you have and have taken the length of the list in the first row to do the loop.
You can use struct datatype, as described in this post: https://stackoverflow.com/a/74219166:
import pandas as pl
df = pl.DataFrame({
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
})
df.select(pl.col('my_str').str.split('/')
.arr.to_struct(n_field_strategy="max_width")).unnest('my_str')
Notice you must use n_field_strategy="max_width", otherwise, unnest() will create only 1 column.
import polars as pl
#Create new column list(can be created dynamically as well)
new_cols=['new_col1','new_col2','new_col3',.....,new_coln]
#Define expression
expr = [pl.col('col1').str.split('/').arr.get(i).alias(col)
for i,col in enumerate(new_cols)
]
#Apply Expression
df.with_columns(expr)
In polars pandas want to inter change/ assign , & inter change row value in two rows.
for i in range(len(k2)):
k2['column1'][i] == k2['column2'][i]
k2['column3'][i] == k2['column4'][i]
You can use alias to copy & rename columns:
import polars as pl
k2 = pl.DataFrame({"column1": [1,2,3],
"column2": [4,5,6],
"column3": [7,8,9],
"column4": [10,11,12]})
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 4 ┆ 7 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 ┆ 8 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 6 ┆ 9 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘
k2.with_columns([pl.col("column2").alias("column1"), pl.col("column4").alias("column3")])
which prints out
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 4 ┆ 4 ┆ 10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 12 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘