Polars upsample/downsample and interpolate only small gaps - python
I have a Timeseries dataset that needs to be interpolated such that any gaps more than 3 minutes are left as null values.
The problem i'm facing is that Polars upsample leads to a lot of nulls even when there is data close to the time period. Here's a snippet of the dataframe.
utc gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
0 2022-06-04 04:49:31 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
1 2022-06-04 04:49:46 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's a Pandas code for the same operation
output = sensor_dataframe.sort_values(by=['utc']) # sort according to time
output['utc'] = pd.to_datetime(output['utc'])
# Apply smoothing function for all data columns.
for column in output.columns[1::]:
output[column] = scipy.signal.savgol_filter(pd.to_numeric(output[column]), 31, 3)
print(output)
output = output.set_index('utc')
output.index = pd.to_datetime(output.index)
output = output.resample(sampling_rate).mean()
sampling_delta = pd.to_timedelta(sampling_rate)
# The interpolating limit is dependant on the sampling rate.
interpolating_limit = int(MAX_DELTA_FOR_INTERPOLATION / sampling_delta)
if interpolating_limit != 0:
output.interpolate(
limit=interpolating_limit,
inplace=True,
limit_direction='both',
limit_area='inside',
)
Here's the output in a 10 second sampling rate.
gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
utc
2022-06-04 04:49:30 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
2022-06-04 04:49:40 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's the same attempt at a Polars version.
df = pl.from_pandas(sensor_dataframe)
q = df.lazy().with_column(pl.col('utc').str.strptime(pl.Datetime, fmt='%F %T').cast(pl.Datetime)).select([pl.col('utc'),
pl.exclude('utc').map(lambda x: savgol_filter(x.to_numpy(), 31, 3)).explode()])
df = q.collect()
df = df.upsample(time_column="utc", every="10s")
Here's the output of the above snipper
│ 2022-06-04 04:49:31 ┆ 955.081699 ┆ 293.84 ┆ 77.009159 ┆ ... ┆ 421.510185 ┆ 1.878339 ┆ 0.0 ┆ 0.0 │
│ 2022-06-04 04:49:41 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
│ 2022-06-04 04:49:51 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
Polars just spits out a df with a lot of nulls. I would have to interpolate to fill the values but that would mean I interpolate the entire dataset. Polars unfortunately provides no arguments or parameters on interpolate() which leads to the all the series getting interpolated which is not the desired action.
I think the solution should have something to do with masks. Anyone has experience working with Polars and interpoaltion?
Reproducable CODE: https://pastebin.com/gQ1WU4zp
sample csv data: https://0bin.net/paste/3fX2AOM2#uQmEv2KvBK5Xk-2vuWxx2z0QgXlttdnaa78eFt8ra62
I'm not going to download your whole dataset so let's use this as an example:
np.random.seed(0)
df = pl.DataFrame(
{
"time": pl.date_range(
low=datetime(2023, 2, 1),
high=datetime(2023, 2, 2),
interval="1m"),
'data':list(np.random.choice([None, 1,2,3,4], size=1441))
}).filter(~pl.col('data').is_null())
upsample, by definition, doesn't interpolate, it (as you've discovered) just inserts a bunch of nulls to match the periods you want.
If you only want to interpolate when the preupsampled gap is 3m or less then make a helper column before the upsample.
Use when then looking at the helper column to interpolate or not interpolate.
df \
.with_columns(
(pl.col('time')-pl.col('time').shift()<pl.duration(minutes=3)).alias('small_gap')) \
.upsample(time_column="time", every="10s") \
.with_columns(pl.col('small_gap').backward_fill()) \
.with_columns(
pl.when(pl.col('small_gap')) \
.then(pl.exclude(['small_gap']).interpolate()) \
.otherwise(pl.exclude(['small_gap']))) \
.select(pl.exclude('small_gap'))
Related
How can I filter datetime field in time slots? [duplicate]
Is there a way to filter data in a period of time (i.e., start time and end time) using polars? import pandas as pd import polars as pl dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min") df = pd.DataFrame({"timestamp": dr}) pf = pl.from_pandas(df) The best try I've got was: pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)) It only gave me everything after 9:30; and if I append another filter after that: pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16)) This however does not give me the slice that falls right on 16:00. polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question! Firstly, we can create this kind of DataFrame in Polars: from datetime import datetime, time import polars as pl start = datetime(2020,1,1) stop = datetime(2021,1,1) df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")}) To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype. To filter on a range of times we then pass the upper and lower boundaries of time to in_between. In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition. ( df .select( [ pl.col("timestamp"), pl.col("timestamp").cast(pl.Time).alias('time_component'), (pl.col("timestamp").cast(pl.Time).is_between( time(9,30),time(16),include_bounds=True ) ) ] ) ) What you are after is: ( df .filter( pl.col("timestamp").cast(pl.Time).is_between( time(9,30),time(16),include_bounds=True ) ) ) See the API docs for the syntax on controlling behaviour at the boundaries: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range It would look something like this: start_date = "2022-03-22 00:00:00" end_date = "2022-03-27 00:00:00" df = pl.DataFrame( { "dates": [ "2022-03-22 00:00:00", "2022-03-23 00:00:00", "2022-03-24 00:00:00", "2022-03-25 00:00:00", "2022-03-26 00:00:00", "2022-03-27 00:00:00", "2022-03-28 00:00:00", ] } ) df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True) shape: (4, 2) ┌─────────────────────┬────────────┐ │ dates ┆ is_between │ │ --- ┆ --- │ │ str ┆ bool │ ╞═════════════════════╪════════════╡ │ 2022-03-23 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-24 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-25 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-26 00:00:00 ┆ true │ └─────────────────────┴────────────┘
How to use a polars column with offset string to add to another date column
Suppose you have df=pl.DataFrame( { "date":["2022-01-01", "2022-01-02"], "hroff":[5,2], "minoff":[1,2] }).with_column(pl.col('date').str.strptime(pl.Date,"%Y-%m-%d")) and you want to make a new column that adds the hour and min offsets to the date column. The only thing I saw was the dt.offset_by method. I made an extra column df=df.with_column((pl.col('hroff')+"h"+pl.col('minoff')+"m").alias('offset')) and then tried df.with_column(pl.col('date') \ .cast(pl.Datetime).dt.with_time_zone('UTC') \ .dt.offset_by(pl.col('offset')).alias('newdate')) but that doesn't work because dt.offset_by only takes a fixed string, not another column. What's the best way to do that?
Use pl.duration: import polars as pl df = pl.DataFrame({ "date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Datetime(time_zone="UTC"), "%Y-%m-%d"), "hroff": [5, 2], "minoff": [1, 2] }) print(df.select( pl.col("date") + pl.duration(hours=pl.col("hroff"), minutes=pl.col("minoff")) )) shape: (2, 1) ┌─────────────────────┐ │ date │ │ --- │ │ datetime[μs] │ ╞═════════════════════╡ │ 2022-01-01 05:01:00 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-01-02 02:02:00 │ └─────────────────────┘
Polars DataFrame filter data in a period of time (start and end time)
Is there a way to filter data in a period of time (i.e., start time and end time) using polars? import pandas as pd import polars as pl dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min") df = pd.DataFrame({"timestamp": dr}) pf = pl.from_pandas(df) The best try I've got was: pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)) It only gave me everything after 9:30; and if I append another filter after that: pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16)) This however does not give me the slice that falls right on 16:00. polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question! Firstly, we can create this kind of DataFrame in Polars: from datetime import datetime, time import polars as pl start = datetime(2020,1,1) stop = datetime(2021,1,1) df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")}) To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype. To filter on a range of times we then pass the upper and lower boundaries of time to in_between. In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition. ( df .select( [ pl.col("timestamp"), pl.col("timestamp").cast(pl.Time).alias('time_component'), (pl.col("timestamp").cast(pl.Time).is_between( time(9,30),time(16),include_bounds=True ) ) ] ) ) What you are after is: ( df .filter( pl.col("timestamp").cast(pl.Time).is_between( time(9,30),time(16),include_bounds=True ) ) ) See the API docs for the syntax on controlling behaviour at the boundaries: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range It would look something like this: start_date = "2022-03-22 00:00:00" end_date = "2022-03-27 00:00:00" df = pl.DataFrame( { "dates": [ "2022-03-22 00:00:00", "2022-03-23 00:00:00", "2022-03-24 00:00:00", "2022-03-25 00:00:00", "2022-03-26 00:00:00", "2022-03-27 00:00:00", "2022-03-28 00:00:00", ] } ) df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True) shape: (4, 2) ┌─────────────────────┬────────────┐ │ dates ┆ is_between │ │ --- ┆ --- │ │ str ┆ bool │ ╞═════════════════════╪════════════╡ │ 2022-03-23 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-24 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-25 00:00:00 ┆ true │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2022-03-26 00:00:00 ┆ true │ └─────────────────────┴────────────┘
Polars - How can I make multiple joins cross multiple Dataframes, examples included
import polars as pl #Auctiondata which is used to create the AuctionDF auctiondata = {"AuctionId": [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275], "Bid": [9499998, 8499998, 8500000 , 1400832, 1400000, 872], "Buyout": [9499998, 9499998, 8500000, 1450832, 1500000, 900], "Quantity": [1, 1, 1, 1, 1, 1], "Time_Left": ['Short', 'Very long', 'Long', 'Short', 'Long', 'Long'], "ItemId": [24655, 24648, 3184, 14187,6580,1482], "ItemRand": [-39, -19, 24, 2032, 1020,None], "ItemSeed": [886505522, 483524644, 384031104, 1995900544,2119510144,None], "SuffixFactor": [50, 36, 55680, 1664, 10368,None], "Faction": ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'], "RealmName": ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'], "BuyoutGold": ['45', '77', '24', '39', '120','42'], "BuyoutSilver": ['40', '44', '12', '33', '12','51'], "BuyoutCopper": ['12', '11', '21', '52', '32','42'], "BidGold": ['39', '12', '11', '27', '99','23'], "BidSilver": ['32', '14', '44', '12', '4','42'], "BidCopper": ['21', '12', '32', '12', '45','33'] } #itemsData which is used to create the itemsDF itemsData = {"ID": [24655,24648,3184,14187,6580], "Display_lang" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic']} #suffixData which is used to create the suffixDF suffixData = {"ID": [19, 39], "Name_lang": ['of Intellect', 'of the Invoker'], "Enchantment[0]": [2804, 2804], "Enchantment[1]": [0, 2824], "Enchantment[2]": [0, 2822], "Enchantment[3]": [0, 0], "Enchantment[4]": [0, 0], "AllocationPct[0]": [10000, 5259], "AllocationPct[1]": [None, 6153], "AllocationPct[2]": [None, 5259], "AllocationPct[3]": [None, None], "AllocationPct[4]": [None, None] } #propertiesData which is used to create the propertiesDF propertiesData = {"ID": [24, 1020, 2032], "Name_lang": ['of Strength', 'of the Whale', 'of Healing'], "Enchantment[0]": [70, 98, 2312], "Enchantment[1]": [0, 103, 0], "Enchantment[2]": [0, 0, 0], "Enchantment[3]": [0, 0, 0], "Enchantment[4]": [0, 0, 0] } #enchantmentsData which is used to create the enchantmentDF enchantmentsData = {"ID": [70, 98, 103, 2312, 2804, 2822, 2824], "Name_lang" : ['+3 Strength','+4 Spirit','+5 Stamina','+7 Spell Power','+$i Intellect', '+$i Critical Strike Rating', '+$i Spell Power']} #resultData which is used in order to create the resultDF resultData = {"AuctionId" : [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275], "ItemId" : [24655, 24648, 3184, 14187, 6580, 1482], "ItemName" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic',''], "RealmName" : ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'], "Faction" : ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'], "EnchantmentName" : ['of the Invoker','of Intellect','of Strength','of Healing','of the Whale',''], "Stat0" : ['+26 Intellect','+36 Intellect','+3 Strength','+13 Healing Spells and +5 Damage Spells','+4 Spirit',''], "Stat1" : ['+30 Spell Damage and Healing','','','','+5 Stamina',''], "Stat2" : ['+26 Spell Critical Strike Rating','','','','',''], "Stat3" : ['','','','','',''], "Stat4" : ['','','','','',''], "BuyoutGold" : ['45', '77', '24', '39', '120','42'], "BuyoutSilver" : ['40', '44', '12', '33', '12','51'], "BuyoutCopper" : ['12', '11', '21', '52', '32','42'], "BidGold" : ['39', '12', '11', '27', '99','23'], "BidSilver" : ['32', '14', '44', '12', '4','42'], "BidCopper" : ['21', '12', '32', '12', '45','33']} #"Main" DF auctionDF = pl.DataFrame(auctiondata) #The ID column of the below Dataframe refrence to "ItemId" in auctionDF, it's not ALWAYS the ItemId from AuctionDF is within the itemsDF tho, but 99% of the time it is. itemsDF = pl.DataFrame(itemsData) #All negative ItemRands (ItemRands lower than 0) from AuctionDF refrences to the ID column of suffixDF, so I would imagine one of the first things to do is to make the ID column in suffixDF negative for a later join? suffixDF = pl.DataFrame(suffixData) #All positive ItemRands (ItemRands larger than 0) from AuctionDF refrences to the ID column of propertiesDF propertiesDF = pl.DataFrame(propertiesData) #All the various "Enchantment[X] columns from suffixDF and propertiesDF references to the ID column of the enchantmentDF" enchantmentsDF = pl.DataFrame(enchantmentsData) #The reason ItemName is blank for the ItemId 1482 is because it does not exist in the itemsDF resultDF = pl.DataFrame(resultData) print(resultDF) The resultDF is the result I want to obtain. So basically we have the mainDF "AuctionDF": ItemId references to the ID column in itemsDF where we will need the Display_lang in order to make the "ItemName" column in the resultDF ItemRand references to either the propertiesDF (if the ItemRand is positive) or the suffixDF (if the ItemRand is negative) In the propertiesDF we need to get the column "Name_lang" which gives us the column "EnchantmentName" in the resultDF Furthermore we also need to use the Enchantment[0-4] columns from propertiesDF, these columns contains an ID which references over to enchantmentsDF where we need the column Name_lang. In the resultDF you will notice there are 5x StatX columns (stat0-4), those contains the value of Name_lang from the EnchantmnetsDF Lastly if the ItemRand from AuctionsDF had been a negative value, we would have gone into the "suffixDF" as the negative ItemRand is a reference to the ID column in SuffixDF In SuffixDF you also find Enchantment[0-4] as we did in the "propertiesDF", and those also references to ID in enchantmentsDF. However you will fast notice the Name_lang for those values looks a bit differently, such as "+$i Intellect", this is because we have to calculate the $i value our selfs. In the SuffixDF there also are an additional 5 columns AllocationPct[0-4] all these values has to be divided by 10000 firstly, as an example then for each AllocationPct[0-4] which has a value, we multiply it with the SuffixFactor So for negative rands (SuffixDF) Enchantment0 and AllocationPct0 gives Stat0, Enchantment1 and AllocationPct1 gives Stat1 and so on Example of item calucation of a negative rand: ItemId 24655 from AuctionDF has ItemRand -39 and SuffixFactor 50 We take the -39, lookup in SuffixDF (ID column), we see Enchantment0 = 2804 We lookup 2804 in the ID column in EnchantmentsDF which has the Name_lang = +$i Intellect Now we calculate $i by look at AllocationPct0 in the SuffixDF which is 5259 Then we divide 5259 by 10000 (5259/10000 = 0,5259) Now we multiply 0,5259 with the SuffixFactor (50) and use floor on it (0,5259*50 = 26,295 = floor(26,295) = 26) Now we have Stat0 = +26 Intellect I might have made slight mistakes in the data as I typed all this out manually. If there are any questions or you think I might have made a mistake, feel free to ask. Best regards
We'll take this in three steps. Step 1: Normalize and Stack propertiesDF and suffixDF When you see columns like Enchantment[0], Enchantment[1], etc. and lots of "0" or "null" values, you're essentially working with a database in non-normal form. This can make calculations awkward. So, as our first step, we'll use melt to put each Enchantment and Allocation in its own row for each ItemRand. In general, melt is used to convert a "wide format" DataFrame (few rows, many columns) into a "long format" DataFrame (many rows, few columns). We'll use hstack to horizontally stack the Enchantment and Allocation columns next to each other, and use concat to vertically stack the propertiesDF and suffixDF DataFrames after they are melted. We'll also change the sign of the ItemRand values in suffixDF. As a final step, we'll join with the enchantmentsDF - to prepare for calculating the Stat values for items that came from suffixDF in the next step. This step may seem intimidating at first glance, but you'll find that the code is rather repetitive: melting and stacking. I've laid out the code so that you can comment out sections and lines to following the development of the algorithm. You'll also note that I rename variables along the way, just to keep the code clean and understandable. In the end, you'll see a clean, tidy DataFrame derived from suffixDF, propertiesDF, and enchantmentsDF. This is the goal. enchantment_ids = ( pl.concat( [ ( suffixDF.rename( {f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range( 0, 5)} ) .melt( id_vars=["ID", "Name_lang"], value_vars=[f"Stat{nbr}" for nbr in range(0, 5)], value_name="EnchantmentID", variable_name="Stat_Nbr", ) .hstack( suffixDF.melt( id_vars=None, value_vars=[ f"AllocationPct[{nbr}]" for nbr in range(0, 5)], value_name="Allocation", ).drop("variable"), ) .filter(pl.col('EnchantmentID') != 0) .select( [ -pl.col("ID").alias("ItemRand"), pl.col("Name_lang").alias("EnchantmentName"), "EnchantmentID", "Allocation", "Stat_Nbr", ] ) ), ( propertiesDF.rename( {f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range( 0, 5)} ).melt( id_vars=["ID", "Name_lang"], value_vars=[f"Stat{nbr}" for nbr in range(0, 5)], value_name="EnchantmentID", variable_name="Stat_Nbr", ) .filter(pl.col('EnchantmentID') != 0) .select( [ pl.col("ID").alias("ItemRand"), pl.col("Name_lang").alias("EnchantmentName"), "EnchantmentID", "Stat_Nbr", ] ) ), ], how="diagonal", ) .sort(["ItemRand", "Stat_Nbr"]) .join( enchantmentsDF.select( [ pl.col("ID").alias("EnchantmentID"), pl.col("Name_lang").alias("Enchantment_type"), ] ), how="left", on="EnchantmentID", ) ) enchantment_ids shape: (8, 6) ┌──────────┬─────────────────┬───────────────┬────────────┬──────────┬────────────────────────────┐ │ ItemRand ┆ EnchantmentName ┆ EnchantmentID ┆ Allocation ┆ Stat_Nbr ┆ Enchantment_type │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ f64 ┆ str ┆ str │ ╞══════════╪═════════════════╪═══════════════╪════════════╪══════════╪════════════════════════════╡ │ -39 ┆ of the Invoker ┆ 2804 ┆ 5259.0 ┆ Stat0 ┆ +$i Intellect │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ -39 ┆ of the Invoker ┆ 2824 ┆ 6153.0 ┆ Stat1 ┆ +$i Spell Power │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ -39 ┆ of the Invoker ┆ 2822 ┆ 5259.0 ┆ Stat2 ┆ +$i Critical Strike Rating │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ -19 ┆ of Intellect ┆ 2804 ┆ 10000.0 ┆ Stat0 ┆ +$i Intellect │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 24 ┆ of Strength ┆ 70 ┆ null ┆ Stat0 ┆ +3 Strength │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 1020 ┆ of the Whale ┆ 98 ┆ null ┆ Stat0 ┆ +4 Spirit │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 1020 ┆ of the Whale ┆ 103 ┆ null ┆ Stat1 ┆ +5 Stamina │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2032 ┆ of Healing ┆ 2312 ┆ null ┆ Stat0 ┆ +7 Spell Power │ └──────────┴─────────────────┴───────────────┴────────────┴──────────┴────────────────────────────┘ Step 2: Calculate Stats and then pivot In the next step, we will calculate all Stats for the Items that had negative ItemRand. We'll need to join with the auctionDF to get the SuffixFactor for the calculations. You'll notice the familiar replace function. Once everything is calculated, we'll de-normalize the DataFrame using pivot. One way to think of pivot is that pivot is the opposite of melt. pivot converts a "long format" (many rows, few columns) to a "wide format" DataFrame (few rows, many columns.) pivot will create the Stat0, Stat1, and Stat2 columns. Note: Stat3 and Stat4 are not created because they are not needed. This reduces the width of your final DataFrame. (At the end, I'll show you how to keep Stat3 and Stat4 if you need them.) The result again is a neat, tidy DataFrame of Stats for each AuctionID. auction_enchantments = ( auctionDF.select( [ "AuctionId", "ItemRand", "SuffixFactor", ] ) .filter(pl.col("ItemRand").is_not_null()) .join(enchantment_ids, on="ItemRand", how="left") .with_columns( [ pl.when(pl.col("Allocation").is_null()) .then(pl.col("Enchantment_type")) .otherwise( pl.col("Enchantment_type").str.replace( r"\$i", (pl.col("Allocation") * pl.col("SuffixFactor") / 10_000) .floor() .cast(pl.Int64) .cast(pl.Utf8), ) ) ] ) .pivot( index=["AuctionId", "EnchantmentName"], values="Enchantment_type", columns="Stat_Nbr", ) ) auction_enchantments shape: (5, 5) ┌────────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┐ │ AuctionId ┆ EnchantmentName ┆ Stat0 ┆ Stat1 ┆ Stat2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ str │ ╞════════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╡ │ 2095293259 ┆ of the Invoker ┆ +26 Intellect ┆ +30 Spell Power ┆ +26 Critical Strike Rating │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2096131235 ┆ of Intellect ┆ +36 Intellect ┆ null ┆ null │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094319272 ┆ of Strength ┆ +3 Strength ┆ null ┆ null │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094265820 ┆ of Healing ┆ +7 Spell Power ┆ null ┆ null │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094902378 ┆ of the Whale ┆ +4 Spirit ┆ +5 Stamina ┆ null │ └────────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┘ Step 3: Putting it all together Our goal all along was to create neat, tidy DataFrames so that this last step is simple. We simply join our auctionDF with our calculated enchantments from Step 2. Much of this step is just clean up: ordering the columns the way we want. (Note the slick use of regex expressions in col.) Also, we fill null values with empty strings "", for a tidier look. ( auctionDF.join( itemsDF.rename({"ID": "ItemId", "Display_lang": "ItemName"}), on="ItemId", how="left", ) .join(auction_enchantments, on="AuctionId", how="left") .with_columns([pl.col(pl.Utf8).fill_null("")]) .select( [ "AuctionId", "ItemId", "ItemName", "RealmName", "Faction", "EnchantmentName", pl.col("^Stat.*$"), pl.col("^Buyout.+$"), pl.col("^Bid.+$"), ] ) ) shape: (6, 15) ┌────────────┬────────┬──────────────────┬───────────┬──────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┬────────────┬──────────────┬──────────────┬─────────┬───────────┬───────────┐ │ AuctionId ┆ ItemId ┆ ItemName ┆ RealmName ┆ Faction ┆ EnchantmentName ┆ Stat0 ┆ Stat1 ┆ Stat2 ┆ BuyoutGold ┆ BuyoutSilver ┆ BuyoutCopper ┆ BidGold ┆ BidSilver ┆ BidCopper │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │ ╞════════════╪════════╪══════════════════╪═══════════╪══════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╪════════════╪══════════════╪══════════════╪═════════╪═══════════╪═══════════╡ │ 2095293259 ┆ 24655 ┆ Consortium Robe ┆ Mograine ┆ Horde ┆ of the Invoker ┆ +26 Intellect ┆ +30 Spell Power ┆ +26 Critical Strike Rating ┆ 45 ┆ 40 ┆ 12 ┆ 39 ┆ 32 ┆ 21 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ 2096131235 ┆ 24648 ┆ Astralaan Gloves ┆ Bloodfang ┆ Alliance ┆ of Intellect ┆ +36 Intellect ┆ ┆ ┆ 77 ┆ 44 ┆ 11 ┆ 12 ┆ 14 ┆ 12 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094319272 ┆ 3184 ┆ Hook Dagger ┆ Firemaw ┆ Alliance ┆ of Strength ┆ +3 Strength ┆ ┆ ┆ 24 ┆ 12 ┆ 21 ┆ 11 ┆ 44 ┆ 32 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094265820 ┆ 14187 ┆ Raincaller Cuffs ┆ Firemaw ┆ Alliance ┆ of Healing ┆ +7 Spell Power ┆ ┆ ┆ 39 ┆ 33 ┆ 52 ┆ 27 ┆ 12 ┆ 12 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ 2094902378 ┆ 6580 ┆ Defender Tunic ┆ Mograine ┆ Horde ┆ of the Whale ┆ +4 Spirit ┆ +5 Stamina ┆ ┆ 120 ┆ 12 ┆ 32 ┆ 99 ┆ 4 ┆ 45 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ 2096005275 ┆ 1482 ┆ ┆ Mograine ┆ Horde ┆ ┆ ┆ ┆ ┆ 42 ┆ 51 ┆ 42 ┆ 23 ┆ 42 ┆ 33 │ └────────────┴────────┴──────────────────┴───────────┴──────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┴────────────┴──────────────┴──────────────┴─────────┴───────────┴───────────┘ Other Notes The results are not exactly as your resultsDF. (But I think the calculated results above are correct.) There are quite a few advanced steps used in the algorithm above. But if you take your time and work through the algorithm, you'll understand what is happening. I've also provided links to Polars documentation for crucial methods (like pivot and melt). The replace step depends on Polars version 0.14.4 or above, so upgrade your version of Polars. If you need the Stat3 and Stat4 columns, just comment out the two lines that contain .filter(pl.col('EnchantmentID') != 0). This will cause the Stat3 and Stat4 columns to appear in the final step.
Why is it parsing datetime like this?
I've been recently using Polars for a project im starting to develop, ive come across several problems but with the info here and in the docs i have solved those issues. My issue: When I save the dataframe it stores datetime data like this: 1900-01-01T18:00:00.000000000 Load dataframe from saved dataframe When it shows like this in my console ( I have checked type is object) : 1900-01-01 18:00:00 Dataframe show pre-saved Code: ''' My column is a string like this: 1234 , this means 12:34, so i do the following transformation: ''' df = df2.with_columns([ pl.col('initial_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')), pl.col('final_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')), ]) df = df.drop('version').rename({'day_type': 'day'}) print(df) print(df.dtypes) #output: <class 'polars.datatypes.Datetime'>, <class 'polars.datatypes.Datetime'> ''' I save it with write_csv ''' df.write_csv('data/trp_occupation_level_emt_cleaned.csv', sep=",") dfnew = pl.read_csv('data/trp_occupation_level_emt_cleaned.csv') # print new df print(dfnew.head()) print(dfnew.dtypes) # output: <class 'polars.datatypes.Utf8'>, <class 'polars.datatypes.Utf8'> I know i can read the csv with parsed_dates= True, but i consume this dataframe in a database so i need it to export it with dates parsed.
Polars does not default to parsing string data as dates automatically. But you can easily turn it on by setting the parse_dates keyword argument. pl.read_csv("myfile.csv", parse_dates=True)
It sounds like you want to specify the formatting of Date and Datetime fields in an output csv file - to conform with the formatting requirements of an external application (e.g., database loader). We can do that easily using the strftime format function. Basically, we will convert the Date/Datetime fields to strings, formatted as we need them, just before we write the csv file. This way, the csv output writer will not alter them. For example, let's start with this data: from io import StringIO import polars as pl my_csv = """sample_id,initial_band_time,final_band_time 1,2022-01-01T18:00:00,2022-01-01T18:35:00 2,2022-01-02T19:35:00,2022-01-02T20:05:00 """ df = pl.read_csv(StringIO(my_csv), parse_dates=True) print(df) shape: (2, 3) ┌───────────┬─────────────────────┬─────────────────────┐ │ sample_id ┆ initial_band_time ┆ final_band_time │ │ --- ┆ --- ┆ --- │ │ i64 ┆ datetime[μs] ┆ datetime[μs] │ ╞═══════════╪═════════════════════╪═════════════════════╡ │ 1 ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │ └───────────┴─────────────────────┴─────────────────────┘ Now, we'll apply the strftime function and the following format specifier %F %T. df = df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T")) print(df) shape: (2, 3) ┌───────────┬─────────────────────┬─────────────────────┐ │ sample_id ┆ initial_band_time ┆ final_band_time │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═══════════╪═════════════════════╪═════════════════════╡ │ 1 ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │ └───────────┴─────────────────────┴─────────────────────┘ Notice that our Datetime fields have been converted to strings (the 'str' in the column header). And here's a pro tip: notice that I'm using a datatype wildcard expression in the col expression: pl.col(pl.Datetime). This way, you don't need to specify each Datetime field; Polars will automatically convert them all. Now, when we write the csv file, we get the following output. df.write_csv('/tmp/tmp.csv') Output csv: sample_id,initial_band_time,final_band_time 1,2022-01-01 18:00:00,2022-01-01 18:35:00 2,2022-01-02 19:35:00,2022-01-02 20:05:00 You may need to play around with the format specifier until you find one that your external application will accept. Here's a handy reference for format specifiers. Here's another trick: you can do this step just before writing the csv file: df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T")).write_csv('/tmp/tmp.csv') This way, your original dataset is not changed ... only the copy that you intend to write to a csv file. BTW, I use this trick all the time when writing csv files that I intend to use in spreadsheets. I often just want the "%F" (date) part of the datetime, not the "%T" part (time). It just makes parsing easier in the spreadsheet.