Getting first non null value after group by function

Getting first non null value after group by function - python

I would like to return the first non null value of the utm_source column from each group after running a group by function.
This is the code I have written:
file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())
This seems to return this:
anonymous_id
00003df1-be12-47b8-b3b8-d01c84a22fdf NaN
00009cc0-279f-4ccf-aea4-f6af1f2bb75a NaN
0000a6a0-00bc-475f-a9e5-9dcbb4309e78 NaN
0000c906-7060-4521-8090-9cd600b08974 638.0
0000c924-5959-4e2d-8757-0d10f96ca462 NaN
0000dc27-292c-4676-8a1b-4977f2ad1577 275.0
0000df7e-2579-4071-8aa5-814ab294bf9a 419.0
I am not quite sure what the values associated with the anon_id's are.
Here is a sample of my data:
{'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000'},
'utm_source': {0: nan, 1: 'facebook', 2: 'facebook', 3: nan, 4: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'steps': {0: 'Sign-ups', 1: nan, 2: nan, 3: nan, 4: nan}}
So for each anonymous_id I would return the first (chronological, sorted by the ts column) utm_source associated with the anon_id

So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id
IIUC you can first drop the null values and then groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
Output for your example data:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook

Related

How to get categories of words containing unique 3-letter set from the columns of pandas dataframe in Python?

I have a dataframe df which looks as
Unnamed: 0 Characters Split A B C D Set Names
0 FROKDUWJU [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1 IDJWPZSUR [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2 UCFURKIRODCQ [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3 ORI [ORI] ORI NaN NaN NaN {ORI}
4 PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5 QAZWREDCQIBR [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6 PLPRUFSWURKI [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7 FROIEUSKIKIR [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8 ORIUWJZSRFRO [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9 URKIFJVUR [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10 RUFOFR [RUF, OFR] RUF OFR NaN NaN {OFR, RUF}
11 IEU [IEU] IEU NaN NaN NaN {IEU}
12 PIMIEU [PIM, IEU] PIM IEU NaN NaN {PIM, IEU}
The first column contains certain names. The Characters Split column contains the name split into every 3 letters in the form of a list. Columns A, B, C, and D contain the breakdown of those 3-letters. Column Set Names have the same 3-letters but in the form of a set.
Some of the 3-letters are common in different names. For example: "FRO" is present in name in index 0, 7 and 8. For these names which have one 3-letter set in common, I'd like to put them into one category, perferably in the form of list. Is it possible to have these categories for each unique 3-letter set? What would be the suitable way to do it?
df.to_dict() is as shown:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Set Names': {0: {'FRO', 'KDU', 'WJU'},
1: {'IDJ', 'SUR', 'WPZ'},
2: {'DCQ', 'IRO', 'UCF', 'URK'},
3: {'ORI'},
4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
6: {'PLP', 'RKI', 'RUF', 'SWU'},
7: {'FRO', 'IEU', 'KIR', 'SKI'},
8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
9: {'IFJ', 'URK', 'VUR'},
10: {'OFR', 'RUF'},
11: {'IEU'},
12: {'IEU', 'PIM'}}}

You can explode 'Set Names', then groupby the exploded columns and merge the 'Unnamed: 0' into a list per group:
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(list)
)
output:
Set Names
BPO [PROIRKIQARTIBPO]
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR [QAZWREDCQIBR]
IDJ [IDJWPZSUR]
... ...
WJU [FROKDUWJU]
WPZ [IDJWPZSUR]
WRE [QAZWREDCQIBR]
ZSR [ORIUWJZSRFRO]
If you want to filter the output to have a minimal number of items per group (here > 1):
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
.dropna()
)
output:
Set Names
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU [FROIEUSKIKIR, IEU, PIMIEU]
ORI [ORI, ORIUWJZSRFRO]
RUF [PLPRUFSWURKI, RUFOFR]
URK [UCFURKIRODCQ, URKIFJVUR]

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

I am performing df.apply() on a dataframe and I am getting the following error:
IndexingError: ('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).', 'occurred at index 4061')
This error comes from the following line of my df (at index 4061)
The relevant code is:
i = pd.DataFrame()
i = df1.apply(
lambda row: i.append(
df.loc[
(df1["ID"] == row["ID"])
& (df1["Date"] >= (row["Date"] + timedelta(-5)))
& (df1["Date"] <= (row["Date"] + timedelta(20)))
],
ignore_index=True,
inplace=True,
)
if row["Flag"] == 1
else None,
axis=1,
)
And an example of the first 5 rows of the df on which I am using the function:
{'ID': {1: 'A US Equity',
2: 'A US Equity',
3: 'A US Equity',
4: 'A US Equity',
5: 'A US Equity'},
'Date': {1: Timestamp('2020-12-22 00:00:00'),
2: Timestamp('2020-12-23 00:00:00'),
3: Timestamp('2020-12-24 00:00:00'),
4: Timestamp('2020-12-28 00:00:00'),
5: Timestamp('2020-12-29 00:00:00')},
'PX_Last': {1: 117.37, 2: 117.3, 3: 117.31, 4: 117.83, 5: 117.23},
'Short_Int': {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0},
'Total_Call_Volume': {1: 187.0, 2: 353.0, 3: 141.0, 4: 467.0, 5: 329.0},
'Total_Put_Volume': {1: 54.0, 2: 30.0, 3: 218.0, 4: 282.0, 5: 173.0},
'Put_OI': {1: 13354.0, 2: 13350.0, 3: 13522.0, 4: 13678.0, 5: 13785.0},
'Call_OI': {1: 8923.0, 2: 8943.0, 3: 8973.0, 4: 9075.0, 5: 9040.0},
'pct_chng': {1: -0.34810663949736975,
2: -0.059640453267451043,
3: 0.008525149190119485,
4: 0.4432699684596253,
5: -0.5092081812781091},
'Short_Int_Category': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Put/Call': {1: 0.2887700534759358,
2: 0.08498583569405099,
3: 1.5460992907801419,
4: 0.6038543897216274,
5: 0.5258358662613982},
'10% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'10%-20% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'20%-30% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'30% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Time_to_pop': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan}}
The row at index 4061 that is causing the error is:
ID ADI US Equity
Date 2021-02-24 00:00:00
PX_Last 161.76
Short_Int 15.1847
Total_Call_Volume 52502
Total_Put_Volume 1929
Put_OI 32219
Call_OI 45557
pct_chng 2.57451
Short_Int_Category 15-20
Put/Call 0.0367415
10% + Pop Flag 0
10%-20% Pop Flag 0
20%-30% Pop Flag 0
30% + Pop Flag 0
Flag 1
Time_to_pop NaN
Name: 4061, dtype: object
How do I perform the function without getting the error mentioned above?

Extract strings from a Dataframe looping over a single row

I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()

Replace cell in DF where cell of DF is index of row desired in other DF

In df1, each cell value is the index of the row I want from df2.
I would like to grab the information for the row in df2 trial_ms column and then rename the column in df1 based on the df2 column that was grabbed.
Reproducible DF's:
# df1
nan = np.NaN
df1 = {'n1': {0: 1, 1: 2, 2: 8, 3: 2, 4: 8, 5: 8},
'n2': {0: nan, 1: 3.0, 2: 9.0, 3: nan, 4: 9.0, 5: nan},
'n3': {0: nan, 1: nan, 2: 10.0, 3: nan, 4: nan, 5: nan}}
df1 = pd.DataFrame().from_dict(df1)
# df2
df2 = {
'trial_ms': {1: -18963961, 2: 31992270, 3: -13028311},
'user_entries_error_no': {1: 2, 2: 6, 3: 2},
'user_entries_plybs': {1: 3, 2: 3, 3: 2},
'user_id': {1: 'seb', 2: 'seb', 3: 'seb'}}
df2 = pd.DataFrame().from_dict(df2)
Expected Output:
**n1_trial_ms n2_trial_ms n3_trial_ms**
31992270 NaN NaN
-13028311 -18934961 NaN
etc.
Attempt:
for index, row in ch.iterrows():
print(row)
b = df1.iloc[row]['trial_ms']
Gives me the error:
IndexError: positional indexers are out-of-bounds

I believe you need dictionary from trial_ms column - keys are index of df1 and replace values with get, if not matched values is get mising value NaN:
d = df2['trial_ms'].to_dict()
df3 = df1.applymap(lambda x: d.get(x, np.nan)).add_suffix('_trial_ms')
print (df3)
n1_trial_ms n2_trial_ms n3_trial_ms
0 -18963961.0 NaN NaN
1 31992270.0 -13028311.0 NaN
2 NaN NaN NaN
3 31992270.0 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?

What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting first non null value after group by function - python

Related

How to get categories of words containing unique 3-letter set from the columns of pandas dataframe in Python?

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

Extract strings from a Dataframe looping over a single row

Replace cell in DF where cell of DF is index of row desired in other DF

Pandas: population new columns from other column's values

Categories

Resources