Python Polars : Update one dataframe with another (like Pandas .update() function) - python

I have two dataframes. The first one is with all zero values and the second one, with actual values. I wish to update first dataframe with values from the second one, like Pandas .update function.
Here is the sample dataframe which I am using for illustration. These dataframes represent research databases and used to tabulate results.
dict = { 'state':
['state 1', 'state 2', 'state 3', 'state 4', 'state 5', 'state 6', 'state 7', 'state 8', 'state 9', 'state 10'],
'development': ['Low', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Medium', 'Low', 'Medium'],
'investment': ['50-500MN', '<50MN', '<50MN', '<50MN', '500MN+', '50-500MN', '<50MN', '50-500MN', '<50MN', '<50MN'],
'population': [22, 19, 25, 24, 19, 21, 33, 36, 22, 36],
'gdp': [18, 19, 29, 23, 22, 19, 35, 18, 26, 27]
}
df = pl.DataFrame(dict)
df.head()
A table may be generated on filtered dataframe. It may not have a few rows and / or columns due to zero records. My aim is not to allow code to suppress rows / columns with blank values. Hence I create a blank dataframe 'tabstr' using the following code
row = df['development'].unique().to_list() # Faster than pl.col twice
col = df['investment'].unique().to_list()
data = np.zeros( (len(row), len(col)), float)
tabstr = pl.concat([
pl.DataFrame({'development': row}),
pl.DataFrame(data, schema=col)], how='horizontal')
Now I create a pivot table on a filtered dataframe
df2 = df.filter(pl.col('development') != 'High')
_ct = df2.pivot(index='development', columns='investment', values='gdp')
I am using the below code to update the blank table tabstr using pivot table _ct
(
tabstr
.join(_ct, on="development", how="left", suffix = '_right')
.with_columns(
pl.coalesce([pl.col("<50MN_right"),pl.col("<50MN")]).alias("<50MN")
)
.drop("<50MN_right")
)
Above code updates a single column. How can I loop through the columns of tabstr and update them using _ct columns, designated with suffix of '_right'?

Perhaps this code could move you closer to your answer.
Let's write two generic functions for update. First, one in Lazy mode:
from typing import Sequence
def update_lazy(self: pl.LazyFrame,
updt_df: pl.LazyFrame,
on: str | Sequence[str]) -> pl.LazyFrame:
if isinstance(on, str):
on = [on]
common = set(self.columns) & set(updt_df.columns) - set(on)
result = (
self
.join(
updt_df,
on=on,
how="left"
)
.with_columns([
pl.coalesce([col_nm + "_right", pl.col(col_nm)])
for col_nm in common
])
.drop([col_nm for col_nm in common])
.rename({col_nm + "_right": col_nm for col_nm in common})
)
return result
And one in eager mode (that simply calls the Lazy mode):
def update_eager(self: pl.DataFrame,
updt_df: pl.DataFrame,
on: str | Sequence[str]) -> pl.DataFrame:
return self.lazy().update(updt_df.lazy(), on).collect()
I'll next assign them as methods to LazyFrame and to DataFrame. (In your own code, you probably should use the namespace functionality in Polars -- but that's beyond the scope of this question.)
pl.DataFrame.update = update_eager
pl.LazyFrame.update = update_lazy
Now we can call our functions (in either Lazy or eager mode) as methods:
tabstr.update(_ct, on='development')
shape: (3, 4)
┌─────────────┬────────┬──────────┬───────┐
│ development ┆ 500MN+ ┆ 50-500MN ┆ <50MN │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞═════════════╪════════╪══════════╪═══════╡
│ High ┆ 0.0 ┆ 0.0 ┆ 0.0 │
│ Low ┆ 0.0 ┆ 18.0 ┆ 29.0 │
│ Medium ┆ 0.0 ┆ 18.0 ┆ 19.0 │
└─────────────┴────────┴──────────┴───────┘
You may need to tweak this code to suit your needs, but perhaps it will get you moving in the right direction.

Related

Polars upsample/downsample and interpolate only small gaps

I have a Timeseries dataset that needs to be interpolated such that any gaps more than 3 minutes are left as null values.
The problem i'm facing is that Polars upsample leads to a lot of nulls even when there is data close to the time period. Here's a snippet of the dataframe.
utc gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
0 2022-06-04 04:49:31 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
1 2022-06-04 04:49:46 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's a Pandas code for the same operation
output = sensor_dataframe.sort_values(by=['utc']) # sort according to time
output['utc'] = pd.to_datetime(output['utc'])
# Apply smoothing function for all data columns.
for column in output.columns[1::]:
output[column] = scipy.signal.savgol_filter(pd.to_numeric(output[column]), 31, 3)
print(output)
output = output.set_index('utc')
output.index = pd.to_datetime(output.index)
output = output.resample(sampling_rate).mean()
sampling_delta = pd.to_timedelta(sampling_rate)
# The interpolating limit is dependant on the sampling rate.
interpolating_limit = int(MAX_DELTA_FOR_INTERPOLATION / sampling_delta)
if interpolating_limit != 0:
output.interpolate(
limit=interpolating_limit,
inplace=True,
limit_direction='both',
limit_area='inside',
)
Here's the output in a 10 second sampling rate.
gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
utc
2022-06-04 04:49:30 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
2022-06-04 04:49:40 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's the same attempt at a Polars version.
df = pl.from_pandas(sensor_dataframe)
q = df.lazy().with_column(pl.col('utc').str.strptime(pl.Datetime, fmt='%F %T').cast(pl.Datetime)).select([pl.col('utc'),
pl.exclude('utc').map(lambda x: savgol_filter(x.to_numpy(), 31, 3)).explode()])
df = q.collect()
df = df.upsample(time_column="utc", every="10s")
Here's the output of the above snipper
│ 2022-06-04 04:49:31 ┆ 955.081699 ┆ 293.84 ┆ 77.009159 ┆ ... ┆ 421.510185 ┆ 1.878339 ┆ 0.0 ┆ 0.0 │
│ 2022-06-04 04:49:41 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
│ 2022-06-04 04:49:51 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
Polars just spits out a df with a lot of nulls. I would have to interpolate to fill the values but that would mean I interpolate the entire dataset. Polars unfortunately provides no arguments or parameters on interpolate() which leads to the all the series getting interpolated which is not the desired action.
I think the solution should have something to do with masks. Anyone has experience working with Polars and interpoaltion?
Reproducable CODE: https://pastebin.com/gQ1WU4zp
sample csv data: https://0bin.net/paste/3fX2AOM2#uQmEv2KvBK5Xk-2vuWxx2z0QgXlttdnaa78eFt8ra62
I'm not going to download your whole dataset so let's use this as an example:
np.random.seed(0)
df = pl.DataFrame(
{
"time": pl.date_range(
low=datetime(2023, 2, 1),
high=datetime(2023, 2, 2),
interval="1m"),
'data':list(np.random.choice([None, 1,2,3,4], size=1441))
}).filter(~pl.col('data').is_null())
upsample, by definition, doesn't interpolate, it (as you've discovered) just inserts a bunch of nulls to match the periods you want.
If you only want to interpolate when the preupsampled gap is 3m or less then make a helper column before the upsample.
Use when then looking at the helper column to interpolate or not interpolate.
df \
.with_columns(
(pl.col('time')-pl.col('time').shift()<pl.duration(minutes=3)).alias('small_gap')) \
.upsample(time_column="time", every="10s") \
.with_columns(pl.col('small_gap').backward_fill()) \
.with_columns(
pl.when(pl.col('small_gap')) \
.then(pl.exclude(['small_gap']).interpolate()) \
.otherwise(pl.exclude(['small_gap']))) \
.select(pl.exclude('small_gap'))

Polars Looping through the rows in a dataset

I am trying to loop through a Polars recordset using the following code:
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
print(mydf)
│start_date ┆ Name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════╡
│ 2020-01-02 ┆ John │
│ 2020-01-03 ┆ Joe │
│ 2020-01-04 ┆ James │
for row in mydf.rows():
print(row)
('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')
Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:
import pandas as pd
mydf = pd.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
for index, row in mydf.iterrows():
mydf['Name'][index]
'John'
'Joe'
'James'
You can specify that you want the rows to be named
for row in mydf.rows(named=True):
print(row)
It will give you a dict:
{'start_date': '2020-01-02', 'Name': 'John'}
{'start_date': '2020-01-03', 'Name': 'Joe'}
{'start_date': '2020-01-04', 'Name': 'James'}
You can then call row['Name']
Note that:
previous versions returned namedtuple instead of dict.
it's less memory intensive to use iter_rows
overall it's not recommended to iterate through the data this way
Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
You would use select for that
names = mydf.select(['Name'])
for row in names:
print(row)

Sorting a dataframe with multiple colums using Python and Polars

I have a data sorting problem where the original data consists of three 'blocks' containing a 'parent' row and two 'children' rows. A minimum working example looks like this:
import polars as pl
df_original = pl.DataFrame(
{
'Direction': ["Buy", "Sell", "Buy", "Sell", "Sell", "Buy"],
'Order ID': [None, '123_1', '123_0', None, '456_1', '456_0'],
'Parent Order ID': [123, None, None, 456, None, None],
}
)
I would like to order these based on the parent row. If the parent is a 'Buy' then the next row should be the 'Sell' child-order, the third row should be the 'Buy' order.
For a parent 'Sell' order it needs to be followed buy the 'Buy' order and then the 'Sell' order.
I have tried it with polars.sort(), but I am missing a piece of logic and can't find out what it is.
The final result should look like this:
df_sorted = pl.DataFrame(
{
'Direction': ["Buy", "Sell", "Buy", "Sell", "Buy", "Sell"],
'Order ID': [None, '123_1', '123_0', None, '456_0', '456_1'],
'Parent Order ID': [123, None, None, 456, None, None],
}
)
If I understand the question correctly you want to alternate the order of "Buy"/"Sell".
This snippet produces your desired output.
df = pl.DataFrame(
{
'Direction': ["Buy", "Sell", "Buy", "Sell", "Sell", "Buy"],
'Order ID': [None, '123_1', '123_0', None, '456_1', '456_0'],
'Parent Order ID': [123, None, None, 456, None, None],
}
)
consecutive = (pl.col("Direction") != pl.col("Direction").shift())
df.filter(consecutive).vstack(df.filter(~consecutive))
shape: (6, 3)
┌───────────┬──────────┬─────────────────┐
│ Direction ┆ Order ID ┆ Parent Order ID │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════╪══════════╪═════════════════╡
│ Buy ┆ null ┆ 123 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sell ┆ 123_1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Buy ┆ 123_0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sell ┆ null ┆ 456 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Buy ┆ 456_0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sell ┆ 456_1 ┆ null │
└───────────┴──────────┴─────────────────┘

How to use a polars column with offset string to add to another date column

Suppose you have
df=pl.DataFrame(
{
"date":["2022-01-01", "2022-01-02"],
"hroff":[5,2],
"minoff":[1,2]
}).with_column(pl.col('date').str.strptime(pl.Date,"%Y-%m-%d"))
and you want to make a new column that adds the hour and min offsets to the date column. The only thing I saw was the dt.offset_by method. I made an extra column
df=df.with_column((pl.col('hroff')+"h"+pl.col('minoff')+"m").alias('offset'))
and then tried
df.with_column(pl.col('date') \
.cast(pl.Datetime).dt.with_time_zone('UTC') \
.dt.offset_by(pl.col('offset')).alias('newdate'))
but that doesn't work because dt.offset_by only takes a fixed string, not another column.
What's the best way to do that?
Use pl.duration:
import polars as pl
df = pl.DataFrame({
"date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Datetime(time_zone="UTC"), "%Y-%m-%d"),
"hroff": [5, 2],
"minoff": [1, 2]
})
print(df.select(
pl.col("date") + pl.duration(hours=pl.col("hroff"), minutes=pl.col("minoff"))
))
shape: (2, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-01 05:01:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-02 02:02:00 │
└─────────────────────┘

Polars - How can I make multiple joins cross multiple Dataframes, examples included

import polars as pl
#Auctiondata which is used to create the AuctionDF
auctiondata = {"AuctionId": [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275],
"Bid": [9499998, 8499998, 8500000 , 1400832, 1400000, 872],
"Buyout": [9499998, 9499998, 8500000, 1450832, 1500000, 900],
"Quantity": [1, 1, 1, 1, 1, 1],
"Time_Left": ['Short', 'Very long', 'Long', 'Short', 'Long', 'Long'],
"ItemId": [24655, 24648, 3184, 14187,6580,1482],
"ItemRand": [-39, -19, 24, 2032, 1020,None],
"ItemSeed": [886505522, 483524644, 384031104, 1995900544,2119510144,None],
"SuffixFactor": [50, 36, 55680, 1664, 10368,None],
"Faction": ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'],
"RealmName": ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'],
"BuyoutGold": ['45', '77', '24', '39', '120','42'],
"BuyoutSilver": ['40', '44', '12', '33', '12','51'],
"BuyoutCopper": ['12', '11', '21', '52', '32','42'],
"BidGold": ['39', '12', '11', '27', '99','23'],
"BidSilver": ['32', '14', '44', '12', '4','42'],
"BidCopper": ['21', '12', '32', '12', '45','33']
}
#itemsData which is used to create the itemsDF
itemsData = {"ID": [24655,24648,3184,14187,6580],
"Display_lang" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic']}
#suffixData which is used to create the suffixDF
suffixData = {"ID": [19, 39],
"Name_lang": ['of Intellect', 'of the Invoker'],
"Enchantment[0]": [2804, 2804],
"Enchantment[1]": [0, 2824],
"Enchantment[2]": [0, 2822],
"Enchantment[3]": [0, 0],
"Enchantment[4]": [0, 0],
"AllocationPct[0]": [10000, 5259],
"AllocationPct[1]": [None, 6153],
"AllocationPct[2]": [None, 5259],
"AllocationPct[3]": [None, None],
"AllocationPct[4]": [None, None]
}
#propertiesData which is used to create the propertiesDF
propertiesData = {"ID": [24, 1020, 2032],
"Name_lang": ['of Strength', 'of the Whale', 'of Healing'],
"Enchantment[0]": [70, 98, 2312],
"Enchantment[1]": [0, 103, 0],
"Enchantment[2]": [0, 0, 0],
"Enchantment[3]": [0, 0, 0],
"Enchantment[4]": [0, 0, 0]
}
#enchantmentsData which is used to create the enchantmentDF
enchantmentsData = {"ID": [70, 98, 103, 2312, 2804, 2822, 2824],
"Name_lang" : ['+3 Strength','+4 Spirit','+5 Stamina','+7 Spell Power','+$i Intellect', '+$i Critical Strike Rating', '+$i Spell Power']}
#resultData which is used in order to create the resultDF
resultData = {"AuctionId" : [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275],
"ItemId" : [24655, 24648, 3184, 14187, 6580, 1482],
"ItemName" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic',''],
"RealmName" : ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'],
"Faction" : ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'],
"EnchantmentName" : ['of the Invoker','of Intellect','of Strength','of Healing','of the Whale',''],
"Stat0" : ['+26 Intellect','+36 Intellect','+3 Strength','+13 Healing Spells and +5 Damage Spells','+4 Spirit',''],
"Stat1" : ['+30 Spell Damage and Healing','','','','+5 Stamina',''],
"Stat2" : ['+26 Spell Critical Strike Rating','','','','',''],
"Stat3" : ['','','','','',''],
"Stat4" : ['','','','','',''],
"BuyoutGold" : ['45', '77', '24', '39', '120','42'],
"BuyoutSilver" : ['40', '44', '12', '33', '12','51'],
"BuyoutCopper" : ['12', '11', '21', '52', '32','42'],
"BidGold" : ['39', '12', '11', '27', '99','23'],
"BidSilver" : ['32', '14', '44', '12', '4','42'],
"BidCopper" : ['21', '12', '32', '12', '45','33']}
#"Main" DF
auctionDF = pl.DataFrame(auctiondata)
#The ID column of the below Dataframe refrence to "ItemId" in auctionDF, it's not ALWAYS the ItemId from AuctionDF is within the itemsDF tho, but 99% of the time it is.
itemsDF = pl.DataFrame(itemsData)
#All negative ItemRands (ItemRands lower than 0) from AuctionDF refrences to the ID column of suffixDF, so I would imagine one of the first things to do is to make the ID column in suffixDF negative for a later join?
suffixDF = pl.DataFrame(suffixData)
#All positive ItemRands (ItemRands larger than 0) from AuctionDF refrences to the ID column of propertiesDF
propertiesDF = pl.DataFrame(propertiesData)
#All the various "Enchantment[X] columns from suffixDF and propertiesDF references to the ID column of the enchantmentDF"
enchantmentsDF = pl.DataFrame(enchantmentsData)
#The reason ItemName is blank for the ItemId 1482 is because it does not exist in the itemsDF
resultDF = pl.DataFrame(resultData)
print(resultDF)
The resultDF is the result I want to obtain.
So basically we have the mainDF "AuctionDF":
ItemId references to the ID column in itemsDF where we will need the Display_lang in order to make the "ItemName" column in the resultDF
ItemRand references to either the propertiesDF (if the ItemRand is positive) or the suffixDF (if the ItemRand is negative)
In the propertiesDF we need to get the column "Name_lang" which gives us the column "EnchantmentName" in the resultDF
Furthermore we also need to use the Enchantment[0-4] columns from propertiesDF, these columns contains an ID which references over to enchantmentsDF where we need the column Name_lang.
In the resultDF you will notice there are 5x StatX columns (stat0-4), those contains the value of Name_lang from the EnchantmnetsDF
Lastly if the ItemRand from AuctionsDF had been a negative value, we would have gone into the "suffixDF" as the negative ItemRand is a reference to the ID column in SuffixDF
In SuffixDF you also find Enchantment[0-4] as we did in the "propertiesDF", and those also references to ID in enchantmentsDF. However you will fast notice the Name_lang for those values looks a bit differently, such as "+$i Intellect", this is because we have to calculate the $i value our selfs.
In the SuffixDF there also are an additional 5 columns AllocationPct[0-4] all these values has to be divided by 10000 firstly, as an example then for each AllocationPct[0-4] which has a value, we multiply it with the SuffixFactor
So for negative rands (SuffixDF) Enchantment0 and AllocationPct0 gives Stat0, Enchantment1 and AllocationPct1 gives Stat1 and so on
Example of item calucation of a negative rand:
ItemId 24655 from AuctionDF has ItemRand -39 and SuffixFactor 50
We take the -39, lookup in SuffixDF (ID column), we see Enchantment0 = 2804
We lookup 2804 in the ID column in EnchantmentsDF which has the Name_lang = +$i Intellect
Now we calculate $i by look at AllocationPct0 in the SuffixDF which is 5259
Then we divide 5259 by 10000 (5259/10000 = 0,5259)
Now we multiply 0,5259 with the SuffixFactor (50) and use floor on it (0,5259*50 = 26,295 = floor(26,295) = 26)
Now we have Stat0 = +26 Intellect
I might have made slight mistakes in the data as I typed all this out manually.
If there are any questions or you think I might have made a mistake, feel free to ask.
Best regards
We'll take this in three steps.
Step 1: Normalize and Stack propertiesDF and suffixDF
When you see columns like Enchantment[0], Enchantment[1], etc. and lots of "0" or "null" values, you're essentially working with a database in non-normal form. This can make calculations awkward.
So, as our first step, we'll use melt to put each Enchantment and Allocation in its own row for each ItemRand. In general, melt is used to convert a "wide format" DataFrame (few rows, many columns) into a "long format" DataFrame (many rows, few columns).
We'll use hstack to horizontally stack the Enchantment and Allocation columns next to each other, and use concat to vertically stack the propertiesDF and suffixDF DataFrames after they are melted. We'll also change the sign of the ItemRand values in suffixDF.
As a final step, we'll join with the enchantmentsDF - to prepare for calculating the Stat values for items that came from suffixDF in the next step.
This step may seem intimidating at first glance, but you'll find that the code is rather repetitive: melting and stacking. I've laid out the code so that you can comment out sections and lines to following the development of the algorithm.
You'll also note that I rename variables along the way, just to keep the code clean and understandable.
In the end, you'll see a clean, tidy DataFrame derived from suffixDF, propertiesDF, and enchantmentsDF. This is the goal.
enchantment_ids = (
pl.concat(
[
(
suffixDF.rename(
{f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range(
0, 5)}
)
.melt(
id_vars=["ID", "Name_lang"],
value_vars=[f"Stat{nbr}" for nbr in range(0, 5)],
value_name="EnchantmentID",
variable_name="Stat_Nbr",
)
.hstack(
suffixDF.melt(
id_vars=None,
value_vars=[
f"AllocationPct[{nbr}]" for nbr in range(0, 5)],
value_name="Allocation",
).drop("variable"),
)
.filter(pl.col('EnchantmentID') != 0)
.select(
[
-pl.col("ID").alias("ItemRand"),
pl.col("Name_lang").alias("EnchantmentName"),
"EnchantmentID",
"Allocation",
"Stat_Nbr",
]
)
),
(
propertiesDF.rename(
{f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range(
0, 5)}
).melt(
id_vars=["ID", "Name_lang"],
value_vars=[f"Stat{nbr}" for nbr in range(0, 5)],
value_name="EnchantmentID",
variable_name="Stat_Nbr",
)
.filter(pl.col('EnchantmentID') != 0)
.select(
[
pl.col("ID").alias("ItemRand"),
pl.col("Name_lang").alias("EnchantmentName"),
"EnchantmentID",
"Stat_Nbr",
]
)
),
],
how="diagonal",
)
.sort(["ItemRand", "Stat_Nbr"])
.join(
enchantmentsDF.select(
[
pl.col("ID").alias("EnchantmentID"),
pl.col("Name_lang").alias("Enchantment_type"),
]
),
how="left",
on="EnchantmentID",
)
)
enchantment_ids
shape: (8, 6)
┌──────────┬─────────────────┬───────────────┬────────────┬──────────┬────────────────────────────┐
│ ItemRand ┆ EnchantmentName ┆ EnchantmentID ┆ Allocation ┆ Stat_Nbr ┆ Enchantment_type │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 ┆ str ┆ str │
╞══════════╪═════════════════╪═══════════════╪════════════╪══════════╪════════════════════════════╡
│ -39 ┆ of the Invoker ┆ 2804 ┆ 5259.0 ┆ Stat0 ┆ +$i Intellect │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -39 ┆ of the Invoker ┆ 2824 ┆ 6153.0 ┆ Stat1 ┆ +$i Spell Power │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -39 ┆ of the Invoker ┆ 2822 ┆ 5259.0 ┆ Stat2 ┆ +$i Critical Strike Rating │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -19 ┆ of Intellect ┆ 2804 ┆ 10000.0 ┆ Stat0 ┆ +$i Intellect │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24 ┆ of Strength ┆ 70 ┆ null ┆ Stat0 ┆ +3 Strength │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1020 ┆ of the Whale ┆ 98 ┆ null ┆ Stat0 ┆ +4 Spirit │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1020 ┆ of the Whale ┆ 103 ┆ null ┆ Stat1 ┆ +5 Stamina │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2032 ┆ of Healing ┆ 2312 ┆ null ┆ Stat0 ┆ +7 Spell Power │
└──────────┴─────────────────┴───────────────┴────────────┴──────────┴────────────────────────────┘
Step 2: Calculate Stats and then pivot
In the next step, we will calculate all Stats for the Items that had negative ItemRand. We'll need to join with the auctionDF to get the SuffixFactor for the calculations. You'll notice the familiar replace function.
Once everything is calculated, we'll de-normalize the DataFrame using pivot. One way to think of pivot is that pivot is the opposite of melt. pivot converts a "long format" (many rows, few columns) to a "wide format" DataFrame (few rows, many columns.) pivot will create the Stat0, Stat1, and Stat2 columns.
Note: Stat3 and Stat4 are not created because they are not needed. This reduces the width of your final DataFrame. (At the end, I'll show you how to keep Stat3 and Stat4 if you need them.)
The result again is a neat, tidy DataFrame of Stats for each AuctionID.
auction_enchantments = (
auctionDF.select(
[
"AuctionId",
"ItemRand",
"SuffixFactor",
]
)
.filter(pl.col("ItemRand").is_not_null())
.join(enchantment_ids, on="ItemRand", how="left")
.with_columns(
[
pl.when(pl.col("Allocation").is_null())
.then(pl.col("Enchantment_type"))
.otherwise(
pl.col("Enchantment_type").str.replace(
r"\$i",
(pl.col("Allocation") * pl.col("SuffixFactor") / 10_000)
.floor()
.cast(pl.Int64)
.cast(pl.Utf8),
)
)
]
)
.pivot(
index=["AuctionId", "EnchantmentName"],
values="Enchantment_type",
columns="Stat_Nbr",
)
)
auction_enchantments
shape: (5, 5)
┌────────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┐
│ AuctionId ┆ EnchantmentName ┆ Stat0 ┆ Stat1 ┆ Stat2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞════════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╡
│ 2095293259 ┆ of the Invoker ┆ +26 Intellect ┆ +30 Spell Power ┆ +26 Critical Strike Rating │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2096131235 ┆ of Intellect ┆ +36 Intellect ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094319272 ┆ of Strength ┆ +3 Strength ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094265820 ┆ of Healing ┆ +7 Spell Power ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094902378 ┆ of the Whale ┆ +4 Spirit ┆ +5 Stamina ┆ null │
└────────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┘
Step 3: Putting it all together
Our goal all along was to create neat, tidy DataFrames so that this last step is simple. We simply join our auctionDF with our calculated enchantments from Step 2.
Much of this step is just clean up: ordering the columns the way we want. (Note the slick use of regex expressions in col.) Also, we fill null values with empty strings "", for a tidier look.
(
auctionDF.join(
itemsDF.rename({"ID": "ItemId", "Display_lang": "ItemName"}),
on="ItemId",
how="left",
)
.join(auction_enchantments, on="AuctionId", how="left")
.with_columns([pl.col(pl.Utf8).fill_null("")])
.select(
[
"AuctionId",
"ItemId",
"ItemName",
"RealmName",
"Faction",
"EnchantmentName",
pl.col("^Stat.*$"),
pl.col("^Buyout.+$"),
pl.col("^Bid.+$"),
]
)
)
shape: (6, 15)
┌────────────┬────────┬──────────────────┬───────────┬──────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┬────────────┬──────────────┬──────────────┬─────────┬───────────┬───────────┐
│ AuctionId ┆ ItemId ┆ ItemName ┆ RealmName ┆ Faction ┆ EnchantmentName ┆ Stat0 ┆ Stat1 ┆ Stat2 ┆ BuyoutGold ┆ BuyoutSilver ┆ BuyoutCopper ┆ BidGold ┆ BidSilver ┆ BidCopper │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞════════════╪════════╪══════════════════╪═══════════╪══════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╪════════════╪══════════════╪══════════════╪═════════╪═══════════╪═══════════╡
│ 2095293259 ┆ 24655 ┆ Consortium Robe ┆ Mograine ┆ Horde ┆ of the Invoker ┆ +26 Intellect ┆ +30 Spell Power ┆ +26 Critical Strike Rating ┆ 45 ┆ 40 ┆ 12 ┆ 39 ┆ 32 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2096131235 ┆ 24648 ┆ Astralaan Gloves ┆ Bloodfang ┆ Alliance ┆ of Intellect ┆ +36 Intellect ┆ ┆ ┆ 77 ┆ 44 ┆ 11 ┆ 12 ┆ 14 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094319272 ┆ 3184 ┆ Hook Dagger ┆ Firemaw ┆ Alliance ┆ of Strength ┆ +3 Strength ┆ ┆ ┆ 24 ┆ 12 ┆ 21 ┆ 11 ┆ 44 ┆ 32 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094265820 ┆ 14187 ┆ Raincaller Cuffs ┆ Firemaw ┆ Alliance ┆ of Healing ┆ +7 Spell Power ┆ ┆ ┆ 39 ┆ 33 ┆ 52 ┆ 27 ┆ 12 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2094902378 ┆ 6580 ┆ Defender Tunic ┆ Mograine ┆ Horde ┆ of the Whale ┆ +4 Spirit ┆ +5 Stamina ┆ ┆ 120 ┆ 12 ┆ 32 ┆ 99 ┆ 4 ┆ 45 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2096005275 ┆ 1482 ┆ ┆ Mograine ┆ Horde ┆ ┆ ┆ ┆ ┆ 42 ┆ 51 ┆ 42 ┆ 23 ┆ 42 ┆ 33 │
└────────────┴────────┴──────────────────┴───────────┴──────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┴────────────┴──────────────┴──────────────┴─────────┴───────────┴───────────┘
Other Notes
The results are not exactly as your resultsDF. (But I think the calculated results above are correct.)
There are quite a few advanced steps used in the algorithm above. But if you take your time and work through the algorithm, you'll understand what is happening.
I've also provided links to Polars documentation for crucial methods (like pivot and melt).
The replace step depends on Polars version 0.14.4 or above, so upgrade your version of Polars.
If you need the Stat3 and Stat4 columns, just comment out the two lines that contain .filter(pl.col('EnchantmentID') != 0). This will cause the Stat3 and Stat4 columns to appear in the final step.

Categories

Resources