Cannot insert subtotals into pandas dataframe - python

I'm rather new to Python and to Pandas. With the help of Google and StackOverflow, I've been able to get most of what I'm after. However, this one has me stumped. I have a dataframe that looks like this:
SalesPerson 1 SalesPerson 2 SalesPerson 3
Revenue Number of Orders Revenue Number of Orders Revenue Number of Orders
In Process Stage 1 8347 8 9941 5 5105 7
In Process Stage 2 3879 2 3712 3 1350 10
In Process Stage 3 7885 4 6513 8 2218 2
Won Not Invoiced 4369 1 1736 5 4950 9
Won Invoiced 7169 5 5308 3 9832 2
Lost to Competitor 8780 1 3836 7 2851 3
Lost to No Action 2835 5 4653 1 1270 2
I would like to add subtotal rows for In Process, Won, and Lost, so that my data looks like:
SalesPerson 1 SalesPerson 2 SalesPerson 3
Revenue Number of Orders Revenue Number of Orders Revenue Number of Orders
In Process Stage 1 8347 8 9941 5 5105 7
In Process Stage 2 3879 2 3712 3 1350 10
In Process Stage 3 7885 4 6513 8 2218 2
In Process Subtotal 20111 14 20166 16 8673 19
Won Not Invoiced 4369 1 1736 5 4950 9
Won Invoiced 7169 5 5308 3 9832 2
Won Subtotal 11538 6 7044 8 14782 11
Won Percent 27% 23% 20% 25% 54% 31%
Lost to Competitor 8780 1 3836 7 2851 3
Lost to No Action 2835 5 4653 1 1270 2
Lost Subtotal 11615 6 8489 8 4121 5
Lost Percent 27% 23% 24% 25% 15% 14%
Total 43264 26 35699 32 27576 35
So far, my code looks like:
def create_win_lose_table(dataframe):
in_process_stagename_list = {'In Process Stage 1', 'In Process Stage 2', 'In Process Stage 3'}
won_stagename_list = {'Won Invoiced', 'Won Not Invoiced'}
lost_stagename_list = {'Lost to Competitor', 'Lost to No Action'}
temp_Pipeline_df = dataframe.copy()
for index, row in temp_Pipeline_df.iterrows():
if index not in in_process_stagename_list:
temp_Pipeline_df.drop([index], inplace = True)
Pipeline_sum = temp_Pipeline_df.sum()
#at the end I was going to concat the sum to the original dataframe, but that's where I'm stuck
I have only started to work on the in process dataframe. My thought was that once I figured that out I could then just duplicate that process for the Won and Lost categories. Any thoughts or approaches are welcome.
Thank you!
Jon

Simple Example for you.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5, 5))
df_total = pd.DataFrame(np.sum(df.iloc[:, :].values, axis=0)).transpose()
df_with_totals = df.append(df_total)
df_with_totals
0 1 2 3 4
0 0.743746 0.668769 0.894739 0.947641 0.753029
1 0.236587 0.862352 0.329624 0.637625 0.288876
2 0.817637 0.250593 0.363517 0.572789 0.785234
3 0.140941 0.221746 0.673470 0.792831 0.170667
4 0.965435 0.836577 0.790037 0.996000 0.229857
0 2.904346 2.840037 3.051388 3.946885 2.227662
You can use the rename argument in Pandas to call the summary row whatever you want.

Related

NaN issue after merge two tables | Python

I tried to merge two tables on person_skills, but recieved a merged table that has a lot NaN value.
I'm sure the second table has no duplicate value and tried to zoom out the possible issues caused by datatype or NA value, but still receive the same wrong result.
Please help me and have a look at the following code.
Table 1
lst_col = 'person_skills'
skills = skills.assign(**{lst_col:skills[lst_col].str.split(',')})
skills = skills.explode(['person_skills'])
skills['person_id'] = skills['person_id'].astype(int)
skills['person_skills'] = skills['person_skills'].astype(str)
skills.head(10)
person_id person_skills
0 1 Talent Management
0 1 Human Resources
0 1 Performance Management
0 1 Leadership
0 1 Business Analysis
0 1 Policy
0 1 Talent Acquisition
0 1 Interviews
0 1 Employee Relations
Table 2
standard_skills = df["person_skills"].str.split(',', expand=True)
series1 = pd.Series(standard_skills[0])
standard_skills = series1.unique()
standard_skills= pd.DataFrame(standard_skills, columns = ["person_skills"])
standard_skills.insert(0, 'skill_id', range(1, 1 + len(standard_skills)))
standard_skills['skill_id'] = standard_skills['skill_id'].astype(int)
standard_skills['person_skills'] = standard_skills['person_skills'].astype(str)
standard_skills = standard_skills.drop_duplicates(subset='person_skills').reset_index(drop=True)
standard_skills = standard_skills.dropna(axis=0)
standard_skills.head(10)
skill_id person_skills
0 1 Talent Management
1 2 SEM
2 3 Proficient with Microsoft Windows: Word
3 4 Recruiting
4 5 Employee Benefits
5 6 PowerPoint
6 7 Marketing
7 8 nan
8 9 Human Resources (HR)
9 10 Event Planning
Merged table
combine_skill = skills.merge(standard_skills,on='person_skills', how='left')
combine_skill.head(10)
person_id person_skills skill_id
0 1 Talent Management 1.0
1 1 Human Resources NaN
2 1 Performance Management NaN
3 1 Leadership NaN
4 1 Business Analysis NaN
5 1 Policy NaN
6 1 Talent Acquisition NaN
7 1 Interviews NaN
8 1 Employee Relations NaN
9 1 Staff Development NaN
Please let me know where I made mistakes, thanks!

Data not showing in table form when using Jupyter Notebook

I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in

Same observations on data frame feature appear as independent

I have got the following DF:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afrix 72
afr-ix 4
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at tokyo 1
at&t 5041
att 2
batelco 723
batelco 2
batelco (manual) 4
beeline 1702
beeline - 01 6
beeline - 02 6
i need to get a unique list of carrier_name so I have done some basic housekeeping as I only want to keep the names with no white spaces at the beginign or end of the observation with the following code:
`carrier = pd.DataFrame(data['sol_carrier'].value_counts(dropna=False))
carrier['carrier_name'] = carrier.index
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier['carrier_name'] = carrier['carrier_name'].str.replace('[^a-zA-Z]', ' ')
carrier['carrier_name'] = np.where(carrier['carrier_name']==' ',np.NaN,carrier['carrier_name'])
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier = carrier.reset_index(drop=True)
carrier = carrier[['carrier_name','sol_carrier']]
carrier.sort_values(by='carrier_name')`
what happens here is that i get a list of carrier_name but still get some duplicate observations like airtel or beelinefor example. I dont understand why this is happening as both observations are the same and and there are no more whitespaces at the begining or the end of the observation and, this observations are followed by its respective value_counts()so there is no reason for them to be duplicated. Here is the same DF but after the above code has been applied:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afr ix 4
afrix 72
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at t 5041
at tokyo 1
att 2
batelco 723
batelco 2
batelco manual 4
beeline 1702
beeline 6
beeline 6
That happens because you don't aggregate the results you just change the values in 'carrier_name' columns.
To aggregate the results call
carrier.groupby('carrier_name').sol_carrier.sum()
or modify the 'data' dataframe and then call
data['sol_carrier'].value_counts()

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

select rows with except sqllite3

I have a database with a dataframe that contains the columns: Name, Award, Winner(1 means won and 0 means did not win) and some other things that are irrelevant for this question.
I want to make a dataframe with the names of people that were selected for the award actress(al awards with the name actress in them count), but never won, using sqlite 3 in python.
These are the first five rows of the dataframe:
Unnamed: 0 CeremonyNumber CeremonyYear CeremonyMonth CeremonyDay FilmYear Award Winner Name FilmDetails
0 0 1 1929 5 16 1927 Actor 1 Emil Jannings The Last Command
1 1 1 1929 5 16 1927 Actor 0 Richard Barthelmess The Noose
2 2 1 1929 5 16 1927 Actress 1 Janet Gaynor 7th Heaven
3 3 1 1929 5 16 1927 Actress 0 Louise Dresser A Ship Comes In
4 4 1 1929 5 16 1927 Actress 0 Gloria Swanson Sadie Thompson
I tried it with this query, but this resulted not in the correct result.
query = '''
select Name
from oscars
where Award like "Actress%"
except select Name
from oscars
where Award like "Actress%" and Winner == 1
'''
The outcome of this query should be a dataframe like this:
Name
0 Abigail Breslin
1 Adriana Barraza
2 Agnes Moorehead
3 Alfre Woodard
4 Ali MacGraw
In order to select all the actresses who were selected for the award and never won, you should use AND rather than EXCEPT. Something like this should work:
SELECT Name from Oscars WHERE Award LIKE "Actress%" AND Winner = 0
Refer to the sqlite docs at https://www.sqlite.org/index.html for more information.

Categories

Resources