Pyspark Group and Order by Sum for Group Divide by parts - python

I am new to pyspark and am confused on how to group some data together by a couple of columns, order it by another column, then add up a column for each of the groups, then use that as a denominator for each row of data to calculate a weight in each row making up the groups.
This is being done in jupyterlab using a pyspark3 notebook. There's no way to get around that.
Here is an example of the data...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| A | 1 | 1-A | 2019-10-10 | 9 | 13245 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
+-------+-----+-----------+------------+------+--------+
I'd like to group this together by ntrk, zipcode, zip-ntwrk, event-date and then order it by event-date desc and hour desc. There are 24 hours for each date, so for each zip-ntwrk combo I would want to see the date and the hour in order. Something like this...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
+-------+-----+-----------+------------+------+--------+
Now that everything is in order, I need to run a calculation to create a ratio of how much count there is in each hour compared to the total of counts for each day when combining the hours. This will be used in the denominator to divide the hourly count by the total to get a ratio of how much count is in each hour compared to the day total. So something like this...
+-------+-----+-----------+------------+------+--------+-------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total |
+-------+-----+-----------+------------+------+--------+-------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 |
+-------+-----+-----------+------------+------+--------+-------+
And now that we have the denominator, we can divide counts by total for each row to get the factor counts/total=factor and this would end up looking like...
+-------+-----+-----------+------------+------+--------+-------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total | factor |
+-------+-----+-----------+------------+------+--------+-------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 | .766 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 | .233 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 | 1 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 | .02 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 | .979 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 | 1 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 | .64 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 | .359 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 | 1 |
+-------+-----+-----------+------------+------+--------+-------+--------+
That's what I'm trying to do, and any advice on how to get this done would be greatly appreciated.
Thanks

Use window sum function and then sum over the window partition by ntwrk,zip.
finally we are going to divide with counts/total.
Example:
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("ntwrk","zip","event-date")
df1.withColumn("total",sum(col("counts")).over(w).cast("int")).orderBy("ntwrk","zip","event-date","hour").\
withColumn("factor",format_number(col("counts")/col("total"),3)).show()
#+-----+---+---------+----------+----+------+-----+------+
#|ntwrk|zip|zip-ntwrk|event-date|hour|counts|total|factor|
#+-----+---+---------+----------+----+------+-----+------+
#| A| 1| 1-A|2019-10-10| 1| 12362|25607| 0.483|
#| A| 1| 1-A|2019-10-10| 9| 13245|25607| 0.517|#input 13245 not 3765
#| A| 2| 2-A|2019-10-11| 1| 28730|28730| 1.000|
#| B| 3| 3-B|2019-10-10| 1| 100| 4973| 0.020|
#| B| 3| 3-B|2019-10-10| 4| 4873| 4973| 0.980|
#| B| 4| 4-B|2019-10-11| 1| 3765| 3765| 1.000|
#| C| 5| 5-C|2019-10-10| 1| 17493|27320| 0.640|
#| C| 5| 5-C|2019-10-10| 2| 9827|27320| 0.360|
#| C| 6| 6-C|2019-10-11| 1| 728| 728| 1.000|
#+-----+---+---------+----------+----+------+-----+------+

You must have reticulated the splines

Pyspark works on distributive architecture and hence it may not retain the order. So, you should always order it the way you need before showing the records.
Now, on your point to get the %of records at various levels. You can achieve the same using window function, partition by on the levels you want the data.
Like:
w = Window.partitionBy("ntwrk-zip", "hour")
df =df.withColumn("hourly_recs", F.count().over(w))
Also, you can refer to this tutorial in YouTube - https://youtu.be/JEBd_4wWyj0

Related

Get Min and Max from values of another column after a Groupby in PySpark

I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+

Add column to dataframe based on value in other column

I have a Spark dataframe which shows (daily) how many times a product has been used. It looks like this:
| x_id | product | usage | yyyy_mm_dd | status |
|------|---------|-------|------------|--------|
| 10 | prod_go | 15 | 2020-10-10 | i |
| 10 | prod_rv | 7 | 2020-10-10 | fc |
| 10 | prod_mb | 0 | 2020-10-10 | n |
| 15 | prod_go | 0 | 2020-10-10 | n |
| 15 | prod_rv | 5 | 2020-10-10 | fc |
| 15 | prod_mb | 1 | 2020-10-10 | fc |
| 10 | prod_go | 20 | 2020-10-11 | i |
| 10 | prod_rv | 11 | 2020-10-11 | i |
| 10 | prod_mb | 3 | 2020-10-11 | fc |
| 15 | prod_go | 0 | 2020-10-11 | n |
| 15 | prod_rv | 5 | 2020-10-11 | fc |
| 15 | prod_mb | 1 | 2020-10-11 | fc |
The status column is based on usage. When usage is 0 then it will have n. When usage is between 1 and 9 and the status will be fc. If usage is >= 10 then the status will be i.
I would like to introduce two additional columns to this Spark dataframe, date_reached_fc and date_reached_i. These columns should hold the min(yyyy_mm_dd) when an x_id reached each status respectively for a product.
Based on the sample data, the output would look like this:
| x_id | product | usage | yyyy_mm_dd | status | date_reached_fc | date_reached_i |
|------|---------|-------|------------|--------|-----------------|----------------|
| 10 | prod_go | 15 | 2020-10-10 | i | null | 2020-10-10 |
| 10 | prod_rv | 7 | 2020-10-10 | fc | 2020-10-10 | null |
| 10 | prod_mb | 0 | 2020-10-10 | n | null | null |
| 15 | prod_go | 0 | 2020-10-10 | n | null | null |
| 15 | prod_rv | 5 | 2020-10-10 | fc | 2020-10-10 | null |
| 15 | prod_mb | 1 | 2020-10-10 | fc | 2020-10-10 | null |
| 10 | prod_go | 20 | 2020-10-11 | i | null | 2020-10-10 |
| 10 | prod_rv | 11 | 2020-10-11 | i | 2020-10-10 | 2020-10-11 |
| 10 | prod_mb | 3 | 2020-10-11 | fc | 2020-10-11 | null |
| 15 | prod_go | 0 | 2020-10-11 | n | null | null |
| 15 | prod_rv | 5 | 2020-10-11 | fc | 2020-10-10 | null |
| 15 | prod_mb | 1 | 2020-10-11 | fc | 2020-10-10 | null |
The ordering is a bit different from your question, but the results should be correct... Basically just use min over a window, and also use when to filter only the relevant dates.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'date_reached_fc',
F.min(F.when(F.col('status') == 'fc', F.col('yyyy_mm_dd'))).over(Window.partitionBy('x_id', 'product').orderBy('yyyy_mm_dd', 'usage'))
).withColumn(
'date_reached_i',
F.min(F.when(F.col('status') == 'i', F.col('yyyy_mm_dd'))).over(Window.partitionBy('x_id', 'product').orderBy('yyyy_mm_dd', 'usage'))
).orderBy('x_id', 'product', 'yyyy_mm_dd', 'usage')
df2.show()
+----+-------+-----+----------+------+---------------+--------------+
|x_id|product|usage|yyyy_mm_dd|status|date_reached_fc|date_reached_i|
+----+-------+-----+----------+------+---------------+--------------+
| 10|prod_go| 15|2020-10-10| i| null| 2020-10-10|
| 10|prod_go| 20|2020-10-11| i| null| 2020-10-10|
| 10|prod_mb| 0|2020-10-10| n| null| null|
| 10|prod_mb| 3|2020-10-11| fc| 2020-10-11| null|
| 10|prod_rv| 7|2020-10-10| fc| 2020-10-10| null|
| 10|prod_rv| 11|2020-10-11| i| 2020-10-10| 2020-10-11|
| 15|prod_go| 0|2020-10-10| n| null| null|
| 15|prod_go| 0|2020-10-11| n| null| null|
| 15|prod_mb| 1|2020-10-10| fc| 2020-10-10| null|
| 15|prod_mb| 1|2020-10-11| fc| 2020-10-10| null|
| 15|prod_rv| 5|2020-10-10| fc| 2020-10-10| null|
| 15|prod_rv| 5|2020-10-11| fc| 2020-10-10| null|
+----+-------+-----+----------+------+---------------+--------------+

i have from my original dataframe obtained another two , how can i merge in a final one the columns that i need

i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

Rolling Status from Pandas Dataframe

+------+--------+------------+------------+---+---+---+
| area | locale | date | end date | i | t | o |
+------+--------+------------+------------+---+---+---+
| abc | abc25 | 2001-03-01 | 2001-04-01 | 1 | | |
| abc | abc25 | 2001-04-01 | 2001-05-01 | 1 | | |
| abc | abc25 | 2001-05-01 | 2001-06-01 | 1 | | |
| abc | abc25 | 2001-06-01 | 2001-07-01 | | 1 | |
| abc | abc25 | 2001-07-01 | 2001-08-01 | | | 1 |
| abc | abc25 | 2001-08-01 | 2001-09-01 | | 1 | |
| abc | abc25 | 2001-09-01 | 2001-05-01 | | 1 | |
| abc | abc25 | 2001-10-01 | 2001-11-01 | | 1 | |
| abc | abc25 | 2001-11-01 | 2001-12-01 | | | 1 |
| abc | abc25 | 2001-12-01 | | | | 1 |
| def | def25 | 2001-03-01 | 2001-04-01 | | | 1 |
| def | def25 | 2001-04-01 | 2001-05-01 | | | 1 |
| def | def25 | 2001-05-01 | 2001-06-01 | | | 1 |
| def | def25 | 2001-06-01 | 2001-07-01 | | 1 | |
| def | def25 | 2001-07-01 | 2001-08-01 | | 1 | |
| def | def25 | 2001-08-01 | 2001-09-01 | 1 | | |
| def | def25 | 2001-09-01 | 2001-05-01 | 1 | | |
| def | def25 | 2001-10-01 | 2001-11-01 | | 1 | |
| def | def25 | 2001-11-01 | 2001-12-01 | | | 1 |
| def | def25 | 2001-12-01 | | | | 1 |
+------+--------+------------+------------+---+---+---+
Here is the data table sample that I am working with. What I am attempting to do is a status column added on here. The status column is a bit tricky though and here is the criteria:
If any 2 periods of time are the same i/t/o then they get their associated status (let's say R/Y/G)
If you have two different statuses you choose "best"
Example Output:
+------+--------+------------+------------+---+---+---+--------+
| area | locale | date | end date | i | t | o | Status |
+------+--------+------------+------------+---+---+---+--------+
| abc | abc25 | 2001-03-01 | 2001-04-01 | 1 | | | NONE |
| abc | abc25 | 2001-04-01 | 2001-05-01 | 1 | | | R |
| abc | abc25 | 2001-05-01 | 2001-06-01 | 1 | | | R |
| abc | abc25 | 2001-06-01 | 2001-07-01 | | 1 | | Y |
| abc | abc25 | 2001-07-01 | 2001-08-01 | | | 1 | G |
| abc | abc25 | 2001-08-01 | 2001-09-01 | | 1 | | G |
| abc | abc25 | 2001-09-01 | 2001-05-01 | | 1 | | Y |
| abc | abc25 | 2001-10-01 | 2001-11-01 | | 1 | | Y |
| abc | abc25 | 2001-11-01 | 2001-12-01 | | | 1 | G |
| abc | abc25 | 2001-12-01 | | | | 1 | G |
| def | def25 | 2001-03-01 | 2001-04-01 | | | 1 | NONE |
| def | def25 | 2001-04-01 | 2001-05-01 | | | 1 | G |
| def | def25 | 2001-05-01 | 2001-06-01 | | | 1 | G |
| def | def25 | 2001-06-01 | 2001-07-01 | | 1 | | G |
| def | def25 | 2001-07-01 | 2001-08-01 | | 1 | | Y |
| def | def25 | 2001-08-01 | 2001-09-01 | 1 | | | Y |
| def | def25 | 2001-09-01 | 2001-05-01 | 1 | | | R |
| def | def25 | 2001-10-01 | 2001-11-01 | | 1 | | Y |
| def | def25 | 2001-11-01 | 2001-12-01 | | | 1 | G |
| def | def25 | 2001-12-01 | | | | 1 | G |
+------+--------+------------+------------+---+---+---+--------+
Now I looked up pandas rolling, but that might not be the best approach; I tried the following:
df.groupby('locale')['o'].rolling(2).sum()
which works on it's own, but I can't seem to create a column out of it so I can say if that == 2 then it is whatever status. I also tried to just use this in an if statement:
if df.groupby('locale')['o'].rolling(2).sum() == 2.0 :
df['locale_status'] = 'Green'
this gives an error about the truth value of a series
I also tried :
if df.groupby('locale')['o'] == df.groupby('locale')['o'].shift() : df['test'] = 'Green'
This results in an invalid type comparison.
I don't think this problem lends itself to vectorization/Pandas efficiency, but I'd love to be proven wrong by one of the ninjas on here. My solution involves some prep from pd.read_clipboard() you probably don't need.
Basically I replaced blanks with 0, used idxmax to get the 'current' letter, and found if there's a streak. I then looped through the rows to find the 'best' or 'streak', inside a groupby.
#data cleaning - from clipboard, prob irrelevant to OP
df=pd.read_clipboard(sep='|', engine='python', header=1)
df=df.reset_index().iloc[1:-1,1:-1]
df=df.rename(columns={ ' i ':'i',' t ':'t',' o ':'o',})
df=df.drop('Unnamed: 0',1)
df=df.replace(' ', 0)
df['current'] = df[['i','t','o']].astype(int).idxmax(1)
df['streak'] = df['current'] == df['current'].shift(1)
weights = {'i':0, 't':1, 'o':2}
results = []
for val in df[' area '].unique():
temp = df.loc[df.groupby(' area ').groups[val]].reset_index(drop=True)
winner = []
for idx, row in temp.iterrows():
if idx == 0:
winner.append(np.nan)
else:
current = row['current']
if row['streak']:
winner.append(current)
else:
last = temp.loc[idx-1, 'current']
if weights[last] > weights[current]:
winner.append(last)
else:
winner.append(current)
temp['winner'] = winner
results.append(temp)
res = pd.concat(results)
res['winner'] = res['winner'].map({'i':'R','t':'Y','o':'G'})

Categories

Resources