Filtering, transposing and concatenating with Pandas

Filtering, transposing and concatenating with Pandas - python

I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe

You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside

Related

Converting VBA script to Python Script

I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake

import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.

Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.

One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23

Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Pandas - Sum Previous Rows if Value In Column Meets Condition

I have a dataframe that is of the following type. I have all the columns except the final column, "Total Previous Points P1", which I am hoping to create:
The data is sorted by the "Date" column.
Date | Points_P1 | P1_id | P2_id | Total_Previous_Points_P1
-------------+---------------+----------+-----------------------------------
10/08/15 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
11/09/16 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
20/09/19 | 10 | 10000 | 360 | 4,200
-------------+---------------+----------+-----------------------------------
... | | ... | ... | ...
-------------+---------------+----------+-----------------------------------
n | | | |
Now the column I want to create, is the "Total_Previous_Points_P1" column shown above.
The way to create it:
For each row, check the date (call this DATE_VAL) and P1_id (call this ID_VAL)
Now, for all rows before DATE_VAL AND where P1 id == ID_VAL, sum up the previous points.
Put this sum in the final column, in the current row
Is there a fast pandas pythonic way to do this? My data set is very large.
Thank you!

The solution by SIA computes sum of Points_P1 including the
current value of Points_P1, whereas the requirement is to sum
previous points (for all rows before...).
Assuming that dates in each group are unique (in your sample they are),
the proper, pandasonic solution should include the following steps:
Sort by Date.
Group by P1_id, then for each group:
Take Points_P1 column.
Compute cumulative sum.
Subtract the current value of Points_P1.
So the whole code should be:
df['Total_Previous_Points_P1'] = df.sort_values('Date')\
.groupby(['P1_id']).Points_P1.cumsum() - df.Points_P1
Edit
If Date is not unique (within group of rows with some P1_id), the case
is more complicated, what can be shown on such source DataFrame:
Date Points_P1 P1_id
0 2016-11-09 5 100
1 2016-11-09 3 100
2 2015-10-08 5 100
3 2019-09-20 10 10000
4 2019-09-21 7 100
5 2019-07-10 12 10000
6 2019-12-10 12 10000
Note that for P1_id there are two rows for 2016-11-09.
In this case, start from computing "group" sums of previous points,
for each P1_id and Date:
sumPrev = df.groupby(['P1_id', 'Date']).Points_P1.sum()\
.groupby(level=0).apply(lambda gr: gr.shift(fill_value=0).cumsum())\
.rename('Total_Previous_Points_P1')
The result is:
P1_id Date
100 2015-10-08 0
2016-11-09 5
2019-09-21 13
10000 2019-07-10 0
2019-09-20 12
2019-12-10 22
Name: Total_Previous_Points_P1, dtype: int64
Then merge df with sumPrev on P1_id and Date (in sumPrev on the index):
df = pd.merge(df, sumPrev, left_on=['P1_id', 'Date'], right_index=True)
To show the result, it is more instructive to sort df also on ['P1_id', 'Date']:
Date Points_P1 P1_id Total_Previous_Points_P1
2 2015-10-08 5 100 0
0 2016-11-09 5 100 5
1 2016-11-09 3 100 5
4 2019-09-21 7 100 13
5 2019-07-10 12 10000 0
3 2019-09-20 10 10000 12
6 2019-12-10 12 10000 22
As you can see:
The first sum for each P1_id is 0 (no points from previous dates).
E.g. for both rows with Date == 2016-11-09 the sum of previous
points is 5 (which is in row for Date == 2015-10-08).

Try:
df['Total_Previous_Points_P1'] = df.groupby(['P1_id'])['Points_P1'].cumsum()
How It Works
First, it groups the data using P1_id feature.
Then it accesses the Points_P1 values on the grouped dataframe and apply the cumulative sum function cumsum(), which returns the sum of points up to and including the current row for each group.

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science

IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()

df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13

If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Conditional expanding group aggregation pandas

For some data preprocessing I have a huge dataframe where I need historical performance within groups. However since it is for a predictive model that runs a week before the target I cannot use any data that happened in that week in between. There are a variable number of rows per day per group which means I cannot always discard the last 7 values by using a shift on the expanding functions, I have to somehow condition on the datetime of rows before it. I can write my own function to apply on the groups however this is usually very slow in my experience (albeit flexible). This is how I did it without conditioning on date and just looking at previous records:
df.loc[:, 'new_col'] = df_gr['old_col'].apply(lambda x: x.expanding(5).mean().shift(1))
The 5 represents that I want at least a sample size of 5 or to put it to NaN.
Small example with aggr_mean looking at the mean of all samples within group A at least a week earlier:
group | dt | value | aggr_mean
A | 01-01-16 | 5 | NaN
A | 03-01-16 | 4 | NaN
A | 08-01-16 | 12 | 5 (only looks at first row)
A | 17-01-16 | 11 | 7 (looks at first three rows since all are
at least a week earlier)

new answer
using #JulienMarrec's better example
dt group value
2016-01-01 A 5
2016-01-03 A 4
2016-01-08 A 12
2016-01-17 A 11
2016-01-04 B 10
2016-01-05 B 5
2016-01-08 B 12
2016-01-17 B 11
Condition df to be more useful
d1 = df.drop('group', 1)
d1.index = [df.group, df.groupby('group').cumcount().rename('gidx')]
d1
create a custom function that does what old answer did. Then apply it within groupby
def lag_merge_asof(df, lag):
d = df.set_index('dt').value.expanding().mean()
d.index = d.index + pd.offsets.Day(lag)
d = d.reset_index(name='aggr_mean')
return pd.merge_asof(df, d)
d1.groupby(level='group').apply(lag_merge_asof, lag=7)
we can get some formatting with this
d1.groupby(level='group').apply(lag_merge_asof, lag=7) \
.reset_index('group').reset_index(drop=True)
old answer
create a lookback dataframe by offsetting the dates by 7 days, then use it to pd.merge_asof
lookback = df.set_index('dt').value.expanding().mean()
lookback.index += pd.offsets.Day(7)
lookback = lookback.reset_index(name='aggr_mean')
lookback
pd.merge_asof(df, lookback, left_on='dt', right_on='dt')

Given this dataframe where I added another group in order to more clearly see what's happening:
dt group value
2016-01-01 A 5
2016-01-03 A 4
2016-01-08 A 12
2016-01-17 A 11
2016-01-04 B 10
2016-01-05 B 5
2016-01-08 B 12
2016-01-17 B 11
Let's load it:
df = pd.read_clipboard(index_col=0, sep='\s+', parse_dates=True)
Now we can use a groupby, resample daily, and do an shift that 7 days, and take the mean:
x = df.groupby('group')['value'].apply(lambda gp: gp.resample('1D').mean().shift(7).expanding().mean())
Now you can merge left that back into your df:
merged = df.reset_index().set_index(['group','dt']).join(x, rsuffix='_aggr_mean', how='left')
merged

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering, transposing and concatenating with Pandas - python

Related

Converting VBA script to Python Script

Using regex to create new column in dataframe

Pandas - Sum Previous Rows if Value In Column Meets Condition

Python - Groupby a DataFrameGroupBy object

Conditional expanding group aggregation pandas

Categories

Resources