Multi-criteria pandas dataframe exceptions reporting - python

Given the following pandas df -
Holding Account
Entity ID
Holding Account Number
% Ownership
Entity ID %
Account # %
Ownership Audit Note
11 West Summit Drive LLC (80008660955)
3423435
54353453454
100
100
100
NaN
110 Goodwill LLC (91928475)
7653453
65464565
50
50
50
Partial Ownership [50.00%]
1110 Webbers St LLC (14219739)
1235734
12343535
100
100
100
NaN
120 Goodwill LLC (30271633)
9572953
96839592
55
55
55
Inactive Client [10.00%]
Objective - I am trying to create an Exceptions Report and only inc. those rows based on the following logic:
% Ownership =! 100% OR
(Ownership Audit Note == "-") & (Account # % OR Entity ID % ==100%)
Attempt - I am able to produce components, which make up my required logic, however can't seem to bring them together:
# This gets me rows which meet 1.
df = df[df['% Ownership'].eq(100)==False]
# Something 'like' this would get me 2.
df = df[df['Ownership Audit Note'] == "-"] & df[df['Account # %'|'Entity ID %'] == "None"]
I am looking for some hints/tips to help me bring all this together in the most pythonic way.

Use:
df = df[df['% Ownership'].ne(100) | (df['Ownership Audit Note'].eq("-") & (df['Account # %'].eq(100) | df['Entity ID %'].eq(100)))]

Related

How to save pandas dataframe rows as seperate files with the first row fixed for all?

I have a DataFrame with multiple columns and rows. The rows are student names with marks and the columns are marking criteria. I want to save the first row (column names) along with each row in seperate files with the name of the student as the name file.
Example of my data:
Marking_Rubric
Requirements and Delivery\nWeight 45.00%
Coding Standards\nWeight 10.00%
Documentation\nWeight 25.00%
Runtime - Effectiveness\nWeight 10.00%
Efficiency\nWeight 10.00%
Total
Comments
John Doe
54
50
90
45
50
31
Limited documentation
Jane Doe
23
12
87
10
34
98
No comments
Desired output:
Marking_Rubric
Requirements and Delivery
Coding Standards
Documentation
Runtime - Effectiveness
Efficiency
Total
Comments
John Doe
54
50
90
45
50
31
Limited documentation
Marking_Rubric
Requirements and Delivery
Coding Standards
Documentation
Runtime - Effectiveness
Efficiency
Total
Comments
Jane Doe
23
12
87
10
34
98
No comments
Just note that you have to have a unique name to save a file. Otherwise files with the same name will overwrite each other.
# `````````````````````````````````````````````````````````````````````````
### create dummy data
column1_list = ['John Doe','John Doe','Not John Doe','special ß ß %&^ character name', 'no special character name again']
column2_list = [53,23,100,0,10]
column3_list = [50,12,200,0,10]
df = pd.DataFrame({'Marking_Rubric': column1_list,
'Requirements and Delivery': column2_list,
'Coding Standards': column3_list})
# `````````````````````````````````````````````````````````````````````````
### create unique identifier that will be used as name of file, otherwise
### you will overwrite files with the same name
df['row_number'] = df.index
df['Marking_Rubric_Rowed'] = df.Marking_Rubric + " " + df.row_number.astype(str)
df
Output 1
# `````````````````````````````````````````````````````````````````````````
### create a loop the length of your dataframe and save each row as a csv
for x in range(0,len(df)):
### try to save file
try:
### get your current row of data first then selecting name of your file ,
### if you want another name just change column
df[x:x+1].to_csv(df[x:x+1].Marking_Rubric_Rowed.iloc[0]+'.csv', #### selecting name for your file here
index=False)
### catch and print out exception if something went wrong
except Exception as e:
print(e)
### continue your loop, you could also put "break" to break your loop
continue
Output 2

Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between

Basically, I have smaller table of assets purchased this year and a table of assets the company holds. I want to be able to get the value of certain symbols from the table of assets the company holds and merge into the assets purchased dataset. I want to use the CUSIP. If there is a CUSIP in the assets purchased this year that is blank, this code can return blank or NaN. If there are duplicate CUSIPS in the Holdings dataset, then return the first value. I have tried 4 different ways of merging these tables now without much luck. I run into a memory error for some reason
The equivalent excel code would be:
=IFNA(INDEX(asset_holdings!ADMIN_SYMBOLS,MATCH(asset_purchases!CUSIP_n, asset_holdings!CUSIPs, 0)),"")
Holdings Table
CUSIP
SYMBOL
353187EV5
1A
74727PAY7
3A
80413TAJ8
FE
02765UCR3
3G
000000000
3G
74727PAYA
3E
000000000
4E
Purchase Table
CUSIP
SHARES
353187EV5
10
74727PAY7
67
80413TAJ8
35
02765UCR4
3666
74727PAY7
3613
74727PAYA
13
000000000
14
Desired Result
CUSIP
SHARES
SYMBOL
353187EV5
10
1A
74727PAY7
67
3A
80413TAJ8
35
FE
02765UCR4
3666
""
74727PAY7
3613
3A
74727PAYA
13
3E
000000000
14
3G
C:\ProgramData\Continuum\Anaconda\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1140 join_func = _join_functions[how]
1141
-> 1142 return join_func(lkey, rkey, count, **kwargs)
1143
1144
pandas\_libs\join.pyx in pandas._libs.join.left_outer_join()
MemoryError:
What I tried:
dfnew = dfPurchases.merge(dfHoldings[['CUSIP','SYMBOL']],how='left', on='CUSIP')
dfPurchases = dfPurchases.set_index('CUSIP')
dfPurchases['SYMBOL'] = dfHoldings.lookup(dfHoldings['CUSIP'], df1['SYMBOL'])
enter image description here
Let me elaborate on the question a little bit so that you can review if I have the correct understanding of your question. You want to do a left outer join of purchase dataset with holdings dataset. But, since your holding data set has duplicates for CUSIP ids, It will not be a One-to-one join.
Now you have two options:
Accept multiple rows for one row of the purchase dataset
Make CUSIP id unique in the Holdings dataset and then perform the merge
First way:
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
result = pd.merge(left, right, on="CUSIP", how='left')
print(result)
But, As per your question, the above result isn't acceptable so, We are gonna make CUSIP column unique in the right dataset.
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
# By default it takes first but i have added explicitly for better understanding
right_unique = right.drop_duplicates('CUSIP', keep='first')
result = pd.merge(left, right_unique, on="CUSIP", how='left', validate="many_to_one")
print(result)
Bonus: You can also explore the validation param by putting it into the first version and see the validation errors.

Setting Character Limit on Pandas DataFrame Column

Background:
Given the following pandas df -
Holding Account
Model Type
Entity ID
Direct Owner ID
WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Based Gross USA Only (486941515)
51364633
4564564
5646546
RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring Net of Fees Worldwide Fund (456456218)
46256325
1645365
4926654
The ask:
What is the most pythonic way to enforce a 80 character limit to the Holding Account column (dtype = object) values?
Context: I am writing df to a .csv and then subsequently uploading to a system with an 80-character limit. The values of Holding Account column are unique, so I just want to sacrifice those characters that take the string over 80-characters.
My attempt:
This is what I attempted - df['column'] = df['column'].str[:80]
Why not just use .str, like you were doing?
df['Holding Account'] = df['Holding Account'].str[:80]
Output:
>>> df
Holding Account Model Type Entity ID Direct Owner ID
0 WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Bas 51364633 4564564 5646546
1 RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring N 46256325 1645365 4926654
Using slice will loss some information, I will suggest create a mapping table after get the factorized. This also save the storage space for server or db
s = df['Holding Account'].factorize()[0]
df['Holding Account'] = df['Holding Account'].factorize()[0]
d = dict(zip(s, df['Holding Account']))
If you would like get the databank just do
df['new'] = df['Holding Account'] .map(d)

A way to make the [apply + lambda + loc] efficient

The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?
You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end

pandas operations inside a for-loop

Here is a sample of my data
threats =
binomial_name
continent threat_type
Africa Agriculture & Aquaculture 143
Biological Resource Use 102
Climate Change 3
Commercial Development 36
Energy Production & Mining 30
... ... ...
South America Human Intrusions 1
Invasive Species 3
Natural System Modifications 1
Transportation Corridor 2
Unknown 38
I want to use a for loop and obtain and append together the top 5 values of each continent into a data frame.
Here is my code -
continents = threats.continent.unique()
for i in continents:
continen = (threats
.query('continent == i')
.groupby(['continent','threat_type'])
.sort_values(by=('binomial_name'), ascending=False).
.head())
top5 = appended_data.append(continen)
I am however getting the error - KeyError: 'i'
Where am I going wrong?
So, the canonical way to do this:
df.groupby('continent', as_index=False).apply(
lambda grp: grp.nlargest(5, 'binomial_value'))
If you want to do this in a loop, replace this part:
for i in continents:
continen = threats[threats['continent'] == i].nlargest(2, 'binomial_name')
appended_data.append(continen)

Categories

Resources