Related
I have a dataframe df which looks as
Unnamed: 0 Characters Split A B C D Set Names
0 FROKDUWJU [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1 IDJWPZSUR [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2 UCFURKIRODCQ [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3 ORI [ORI] ORI NaN NaN NaN {ORI}
4 PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5 QAZWREDCQIBR [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6 PLPRUFSWURKI [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7 FROIEUSKIKIR [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8 ORIUWJZSRFRO [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9 URKIFJVUR [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10 RUFOFR [RUF, OFR] RUF OFR NaN NaN {OFR, RUF}
11 IEU [IEU] IEU NaN NaN NaN {IEU}
12 PIMIEU [PIM, IEU] PIM IEU NaN NaN {PIM, IEU}
The first column contains certain names. The Characters Split column contains the name split into every 3 letters in the form of a list. Columns A, B, C, and D contain the breakdown of those 3-letters. Column Set Names have the same 3-letters but in the form of a set.
Some of the 3-letters are common in different names. For example: "FRO" is present in name in index 0, 7 and 8. For these names which have one 3-letter set in common, I'd like to put them into one category, perferably in the form of list. Is it possible to have these categories for each unique 3-letter set? What would be the suitable way to do it?
df.to_dict() is as shown:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Set Names': {0: {'FRO', 'KDU', 'WJU'},
1: {'IDJ', 'SUR', 'WPZ'},
2: {'DCQ', 'IRO', 'UCF', 'URK'},
3: {'ORI'},
4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
6: {'PLP', 'RKI', 'RUF', 'SWU'},
7: {'FRO', 'IEU', 'KIR', 'SKI'},
8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
9: {'IFJ', 'URK', 'VUR'},
10: {'OFR', 'RUF'},
11: {'IEU'},
12: {'IEU', 'PIM'}}}
You can explode 'Set Names', then groupby the exploded columns and merge the 'Unnamed: 0' into a list per group:
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(list)
)
output:
Set Names
BPO [PROIRKIQARTIBPO]
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR [QAZWREDCQIBR]
IDJ [IDJWPZSUR]
... ...
WJU [FROKDUWJU]
WPZ [IDJWPZSUR]
WRE [QAZWREDCQIBR]
ZSR [ORIUWJZSRFRO]
If you want to filter the output to have a minimal number of items per group (here > 1):
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
.dropna()
)
output:
Set Names
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU [FROIEUSKIKIR, IEU, PIMIEU]
ORI [ORI, ORIUWJZSRFRO]
RUF [PLPRUFSWURKI, RUFOFR]
URK [UCFURKIRODCQ, URKIFJVUR]
I am performing df.apply() on a dataframe and I am getting the following error:
IndexingError: ('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).', 'occurred at index 4061')
This error comes from the following line of my df (at index 4061)
The relevant code is:
i = pd.DataFrame()
i = df1.apply(
lambda row: i.append(
df.loc[
(df1["ID"] == row["ID"])
& (df1["Date"] >= (row["Date"] + timedelta(-5)))
& (df1["Date"] <= (row["Date"] + timedelta(20)))
],
ignore_index=True,
inplace=True,
)
if row["Flag"] == 1
else None,
axis=1,
)
And an example of the first 5 rows of the df on which I am using the function:
{'ID': {1: 'A US Equity',
2: 'A US Equity',
3: 'A US Equity',
4: 'A US Equity',
5: 'A US Equity'},
'Date': {1: Timestamp('2020-12-22 00:00:00'),
2: Timestamp('2020-12-23 00:00:00'),
3: Timestamp('2020-12-24 00:00:00'),
4: Timestamp('2020-12-28 00:00:00'),
5: Timestamp('2020-12-29 00:00:00')},
'PX_Last': {1: 117.37, 2: 117.3, 3: 117.31, 4: 117.83, 5: 117.23},
'Short_Int': {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0},
'Total_Call_Volume': {1: 187.0, 2: 353.0, 3: 141.0, 4: 467.0, 5: 329.0},
'Total_Put_Volume': {1: 54.0, 2: 30.0, 3: 218.0, 4: 282.0, 5: 173.0},
'Put_OI': {1: 13354.0, 2: 13350.0, 3: 13522.0, 4: 13678.0, 5: 13785.0},
'Call_OI': {1: 8923.0, 2: 8943.0, 3: 8973.0, 4: 9075.0, 5: 9040.0},
'pct_chng': {1: -0.34810663949736975,
2: -0.059640453267451043,
3: 0.008525149190119485,
4: 0.4432699684596253,
5: -0.5092081812781091},
'Short_Int_Category': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Put/Call': {1: 0.2887700534759358,
2: 0.08498583569405099,
3: 1.5460992907801419,
4: 0.6038543897216274,
5: 0.5258358662613982},
'10% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'10%-20% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'20%-30% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'30% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Time_to_pop': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan}}
The row at index 4061 that is causing the error is:
ID ADI US Equity
Date 2021-02-24 00:00:00
PX_Last 161.76
Short_Int 15.1847
Total_Call_Volume 52502
Total_Put_Volume 1929
Put_OI 32219
Call_OI 45557
pct_chng 2.57451
Short_Int_Category 15-20
Put/Call 0.0367415
10% + Pop Flag 0
10%-20% Pop Flag 0
20%-30% Pop Flag 0
30% + Pop Flag 0
Flag 1
Time_to_pop NaN
Name: 4061, dtype: object
How do I perform the function without getting the error mentioned above?
I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?
In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()
In df1, each cell value is the index of the row I want from df2.
I would like to grab the information for the row in df2 trial_ms column and then rename the column in df1 based on the df2 column that was grabbed.
Reproducible DF's:
# df1
nan = np.NaN
df1 = {'n1': {0: 1, 1: 2, 2: 8, 3: 2, 4: 8, 5: 8},
'n2': {0: nan, 1: 3.0, 2: 9.0, 3: nan, 4: 9.0, 5: nan},
'n3': {0: nan, 1: nan, 2: 10.0, 3: nan, 4: nan, 5: nan}}
df1 = pd.DataFrame().from_dict(df1)
# df2
df2 = {
'trial_ms': {1: -18963961, 2: 31992270, 3: -13028311},
'user_entries_error_no': {1: 2, 2: 6, 3: 2},
'user_entries_plybs': {1: 3, 2: 3, 3: 2},
'user_id': {1: 'seb', 2: 'seb', 3: 'seb'}}
df2 = pd.DataFrame().from_dict(df2)
Expected Output:
**n1_trial_ms n2_trial_ms n3_trial_ms**
31992270 NaN NaN
-13028311 -18934961 NaN
etc.
Attempt:
for index, row in ch.iterrows():
print(row)
b = df1.iloc[row]['trial_ms']
Gives me the error:
IndexError: positional indexers are out-of-bounds
I believe you need dictionary from trial_ms column - keys are index of df1 and replace values with get, if not matched values is get mising value NaN:
d = df2['trial_ms'].to_dict()
df3 = df1.applymap(lambda x: d.get(x, np.nan)).add_suffix('_trial_ms')
print (df3)
n1_trial_ms n2_trial_ms n3_trial_ms
0 -18963961.0 NaN NaN
1 31992270.0 -13028311.0 NaN
2 NaN NaN NaN
3 31992270.0 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()