Count with conditional in pandas - python

i'm having a problem trying to count diferents variables for the same Name. The thing is: i have a sheet with the Name of all my workers and i need to count how many trainings they had, but thoses trainings have different classifications: "Comercial", "Funcional" and others...
One of my columns is "Name" and the other is "Trainings". How can i filter those trainings and aggregate per name
import pandas as pd
import numpy as np
xls = pd.ExcelFile('BASE_Indicadores_treinamento_2021 - V3.xlsx')
df = pd.read_excel(xls, 'Base')
display(df)
df2 = df.groupby("Nome").agg({'Eixo':'count'}).reset_index()
display(df2)
What im getting is the TOTAL of trainings per Name, but i need the count of all categories i have in trainings (there are 5 of them). Does anyone know what i need to do?
Thankss

df.groupby("Nome").agg('count') should give you the total number of training for each person.
df.groupby(["Nome","Eixo"]).agg({'Eixo':'count'}) should give you the count per each person per each training.

Problem solved!
Here's what i did
import pandas as pd
import numpy as np
xls = pd.ExcelFile('BASE_Indicadores_treinamento_2021 - V3.xlsx')
df = pd.read_excel(xls, 'Base')
display(df)
filt_funcional = df['Eixo'] == 'Funcional'
filt_comercial = df['Eixo'] == 'Comercial'
filt_liderança = df['Eixo'] == 'Liderança'
filt_negocio = df['Eixo'] == 'Negócio'
filt_obr_cert = df['Eixo'] == 'Obrigatórios e Certificações'
df.loc[filt_funcional]['Nome'].value_counts()
Much easier than i thought!
And giving the credits, i only did bc of this video: https://www.youtube.com/watch?v=txMdrV1Ut64

Related

Append Pandas Dataframes in a Loop Function _ Investpy

I am using investpy to get historical stock data for 2 stocks ( TRP_pb , TRP_pc )
import investpy
import pandas as pd
import numpy as np
TRP_pb = investpy.get_stock_historical_data(stock='TRP_pb',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pb.head())
TRP_pc = investpy.get_stock_historical_data(stock='TRP_pc',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pc.head())
I can append the two tables by using the append method
appendedtable = TRP_pb.append(TRP_pc, ignore_index=False)
What I am trying to do is to use a loop function in order to combine these two tables
Here is what I have tried so far
preferredlist = ['TRP_pb','TRP_pc']
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
new.append(new, ignore_index=True)
However this doesnt work.
I would appreciate any help
Since get_stock_historical_data returns a DataFrame, you can create an empty dataframe before the for loop and concat in the loop.
preferredlist = ['TRP_pb','TRP_pc']
final_list = pd.DataFrame()
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
final_list = pd.concat([final_list, new])

Summing a column in a Python dataframe

This table from Wikipedia shows the 10 biggest box office hits. I can't seem to get the total of the 'worldwide_gross' column. Can someone help? Thank you.
import pandas as pd
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films.worldwide_gross.sum(axis=0)
This is the output I get when I try calculating the total global earnings:
Total =films['worldwide_gross'].astype('Int32').sum()
or convert data-types 1st.
films = films.convert_dtypes()
Total = films['worldwide_gross'].sum()
films.astype({"worldwide_gross": int})
Total =films['worldwide_gross'].sum()
You will have to keep only digits in column worldwide_gross using regex and then convert the column to float using series.astype('float')
Add:
films.worldwide_gross = films.worldwide_gross.str.replace('\D',"",regex = True).astype(float)
Complete Code:
import pandas as pd
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films.worldwide_gross = films.worldwide_gross.str.replace('\D',"",regex = True).astype(float)
films.worldwide_gross.sum(axis=0)
Here's one way you can do it.
This code will convert the values in the worldwide_gross to integers and then sum the column to get the total gross.
import pandas as pd
def get_gross(gross_text):
pos = gross_text.index('$')
return int(gross_text[pos+1:].replace(',', ''))
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films['gross_numeric'] = films['worldwide_gross'].apply(lambda x: get_gross(x))
total_gross = films['gross_numeric'].sum()
print(f'Total gross: ${total_gross}')

Python how to create new dataset from an existing one based on condition

For example:
I have this code:
import pandas
df = pandas.read_csv('covid_19_data.csv')
this dataset has a column called countryterritoryCode which is the country code of the country.sample data from the dataset
This dataset has information about covid19 cases from all the countries in the world.
How do I create a new dataset where only the USA info appears
(where countryterritoryCode == USA)
import pandas
df = pandas.read_csv('covid_19_data.csv')
new_df = df[df["country"] == "USA"]
or
new_df = df[df.country == "USA"]
Use df.groupby:
df = pandas.read_csv('covid_19_data.csv')
df_new = df.groupby('countryterritoryCode', axis = 1)

Pandas Correction Previous Row

I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.

Pandas: fastest way to the DF by date

I have an efficiency question for you. I wrote some code to analyze a report that holds over 70k records and over 400+ unique organizations to allow my supervisor to enter in year/month/date they are interested in and have it pop out the information.
The beginning of my code is:
import pandas as pd
import numpy as np
import datetime
main_data = pd.read_excel("UpdatedData.xlsx", encoding= 'utf8')
#column names from DF
epi_expose = "EpitheliumExposureSeverity"
sloughing = "EpitheliumSloughingPercentageSurface"
organization = "OrgName"
region = "Region"
date = "DeathOn"
#list storage of definitions
sl_list = ["",'None','Mild','Mild to Moderate']
epi_list= ['Moderate','Moderate to Severe','Severe']
#Create DF with four columns
df = main_data[[region, organization, epi_expose, sloughing, date]]
#filter it down to months
starting_date = datetime.date(2017,2,1)
ending_date = datetime.date(2017,2,28)
df = df[(df[date] > starting_date) & (df[date] < ending_date)]
I am then performing conditional filtering below to get counts by region and organization. It works, but is slow. Is there a more efficient way to query my DF and set up a DF that ONLY has the dates that it is supposed to sit between? Or is this the most efficient way without altering how the Database I am using is set up?
I can provide more of my code but if I filter it out by month before exporting to excel, the code runs in a matter of seconds so I am not concerned about the speed of it besides getting the correct date fields.
Thank you!

Categories

Resources