How to correctly group columns? - python

I have a Data Frame with this columns:
DF.head():
Email Month Year
abc#Mail.com 1 2018
abb#Mail.com 1 2018
abd#Mail.com 2 2019
.
.
abbb#Mail.com 6 2019
What I want to do is to get the total of email adresses in each month for both years 2018 and 2019 (knowing that I don't need to filter, since I have only this two years).
This is what I've done, but I want to make sure that this is right:
Stats = DF.groupby(['Year','Month'])['Email'].count()
Any Suggestion?

It depends what need.
If need exclude missing values or missing values not exist in Email column, your solution is right, use GroupBy.count:
Stats = DF.groupby(['Year','Month'])['Email'].count()
If need count all groups also with missing values (if exist) use GroupBy.size:
Stats = DF.groupby(['Year','Month']).size()

Related

Remove Date Grouping from Data

Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:

Pivoting data in Python using Pandas

I am doing a time series analysis. I have run the below code to generate random year in the dataframe as the original year did not have year values:
wc['Random_date'] = wc.Monthdate.apply(lambda val: f'{val} {randint(2019,2022)}')
#Generating random year from 2019 to 2022 to create ideal conditions
And now I have a dataframe that looks like this:
wc.head()
The ID column is the index currently, and I would like to generate a pivoted dataframe that looks like this:
Random_date
Count_of_ID
Jul 3 2019
2
Jul 4 2019
3
I do understand that aggregation will be needed to be done after I pivot the data, but the following code is not working:
abscount = wc.pivot(index= 'Random_date', columns= 'Random_date', values= 'ID')
Here is the ending part of the error that I see:
Please help. Thanks.
You may check with
df['Random_date'].value_counts()
If need unique count
df.reset_index().drop_duplicates('ID')['Random_date'].value_counts()
Or
df.reset_index().groupby('Random_date')['ID'].nunique()

Python: calculating difference between rows

I have a dataframe listing revenue by company and year. See below:
Company | Acc_Name |Date | Value
A2M Sales/Revenue 2016 167770000.0
A2M Sales/Revenue 2017 360842000.0
A2M Sales/Revenue 2018 68087000.0
A2M Sales/Revenue 2019 963000000.0
A2M Sales/Revenue 2020 143346000.0
In python I want to create a new column showing the difference year on year, so 2017 will show the variance between 2017 & 2016.
I'm wanting to run this on a large dataframe with multiple companies.
Here is my solution which creates a new column with previous year data and then simply takes the differences of them:
df["prev_val"] = df["Value"].shift(1) # creates new column with previous year data
df["Difference"] = df["Value"] - df["prev_val"]
Since you are willing to do this on several companies, make sure that you filter out other companies by
this_company_df = df[df["Company"] == "A2M"]
and order data in ascending order by
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
So, the final code code should look something like this:
this_company_df = df[df["Company"] == "A2M"]
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
this_company_df["prev_val"] = this_company_df["Value"].shift(1)
this_company_df["Difference"] = this_company_df["Value"] - this_company_df["prev_val"]
So, the result is stored in "Difference" column. One more thing you could improve is to take care of initial year by setting it to 0.
revenues['Revenue_Change'] = revenues['Value'].diff(periods=1)
Is the simplest way to do it. However, since your dataframe contains data for multiple companies, you can use this:
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff(periods=1)
This sets the first entry for each company in the set to NAN.
If, by any chance, the dataframe is not in order, you can use
revenues = revenues.sort_values('Company')
Groupby will correctly calculate YoY revenue change, even if entries are separated from one another, as long as the actual revenues are chronologically in order for each company.
EDIT:
If everything is out of order, then sort by the year, groupby and then sort by company name:
revenues = revenues.sort_values('Date')
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff()
revenues = revenues.sort_values('Company')

Looking to find the sum of a unique member's payment based of whether some dates fall in between a certain time in python

this is my first time asking on the community although I have used the website for help extensively in the past. I was not able to find a solution on this specific problem and I am fairly amatuer at python so having a hard time putting the logic in code although I think the logic is clear enough. I am using python via google colab for this and have shared a google sheet with data at the end.
In my scenario, we have a start month, length of time, and payout month. End month can be calculated through length. A person can be a part of multiple groups and thus can have multiple start, end and payout months.
The goal is to find how much is expected to be paid by a member as off today.
eg group begins in jan 2020, is 10 months long and will end in oct 2020. The monthly contribution is 5k. The payout month is lets say mar 2020. While we technically should be getting 10 payments (10 month group) we will expect only 9 payments ie 45k because when the payout month comes around, the member is not expected to pay for that month. If say the group began in dec 2020 and if it was 10 months long, then as off today we would only expect 5 payments (dec to apr 21).
These scenarios get complicated for eg when a member is part of 3 groups, so 3 start dates, 3 end dates and 3 payout dates and likely 3 different instalment amounts. lets say the start dates are jan 20, feb 20, mar 20 and all groups are 10 months long. lets also say that there is a payout in apr 20. In apr 20, all the groups will be active (the end month has not been reached yet), so in apr 20 (the payout month) we will expect no payments from all the groups.
Meaning that, if there are 3 groups and there is a payout that falls between any groups start and end month, then we will not expect a payment for that group in that month. If there are two payouts that fall in between the start and end months of the groups then we we will not expect 6 payments for that month, 2 for each group and so on. If say 3 groups and 1 payout falls between the dates of only 2 groups, then we will not expect instalments for only those two groups (what ever the instalment is for those groups)
The following google sheet has some sample data.
The group ID col is entirely unique and will have no dups (you can think of this an invoice since all invoices are unique). The member code col can have duplicates since a member can have more than one group. Do not worry about the days in the date, what matter is the month and year. We have start month, group length and payout month. we also have how much money is owed monthly by a member for that group.
https://docs.google.com/spreadsheets/d/1nAXlifIQdYiN1MWTv7vs2FqbFu2v6ykCzQjrJNPTBWI/edit#gid=0
any help or advice would be great.
EDITED -> I have tried the following but got an error: (i coded the months ie jan 2020 = 1, feb 2020 = 2 and so on so i dont have to mess around with dates)
deal_list = df['Group ID'].tolist()
def instalment(deal_list):
for member in df['Member Code'].unique():
if df['Coded Payout Month']>=df['Coded Start Month'] and df['Coded
Payout Month']<=df['Coded End Month']:
count_months = count_months + 1
return count_months * df['Instalment']
instalment(deal_list)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
EDITED - have also tried the following just now (took help from Pandas: Groupby and iterate with conditionals within groups?). It sort of worked in that it gave me a count of 1 for each row. I was trying to get the number of times each payout month appears within the dates of a group
grouped = df.groupby('Member Code')
for g_idx, group in grouped:
for r_idx, row in group.iterrows():
if (((row['Coded Payout Month'] >= group['Coded Start Month']).any())
& (row['Coded Payout Month'] <= group['Coded End Month']).any()):
df.loc[r_idx, 'payout_cut'] =+ 1
print(df)
I found a way around it. Essentially, rather than trying to iterate through all the rows, I transformed my data into long form first in Google sheets via transpose and filter (I filtered for all payout months for a member and transposed the results into the rows. I then pushed that into colab and through pd.melt transformed the data back into unique rows per deal with the additional payouts as required. Then running the condition was simple enough and finally summed for all true value.
I can explain a bit more of anyone needs.
I took inspiration from here:
https://youtu.be/pKvWD0f18Pc

how to delete whitespace from nd array python

I have yearly data sets with some missing data. I used this code to read but unable to omit white space present at the end of february. can anyone help to solve this problem?
df1 = pd.read_fwf('DQ404.7_77.txt',widths=ws,header=9, nrows=31, keep_default_na = False)
df1 = df1.drop('Day', 1)
df2 = np.array(df1).T
what I want is to arrange all the data in one column with respect to date. My data is uploaded in this link you can download
https://drive.google.com/open?id=0B2rkXkOkG7ExbEVwZUpHR29LNFE
what i wanted is to get time series data from this file and it should be like
Feb,25 13
Feb,26 13
Feb,27 13
Feb,28 13
March, 1 10
March, 2 10
March, 3 10
Not with empty strings in between february and March
So after a lot of comments it looks like df[df != ''] works for you

Categories

Resources