Cannot fill missing values of certain rows - python

I am working on imputing NaNs of rows based on certain columns. So my dataframe looks something like this:
Product
Store Name
January Sales
February Sales
March Sales
For example, January Sales would be NaN for a combination of Product and Store Name and am imputing data based on other months averages. Also, other attributes, February Sales might also have NaNs in the same row.
The code that I used was:
indexes = df.index[df['January Sales'].isna()].to_list()
fillCols = df.iloc[:, 3:]
df.loc[indexes, 'January Sales'].fillna(fillCols.mean(axis=0), inplace=True)
But the above code doesn't seem to be working, the code won't impute data, however, when broken down different pieces do work, how to solve this problem?

Related

Automatically Map columns from one dataframe to another using pandas

I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID
Sales year
Name
Acc year
10
1934
xyz
1834
11
1942
abc
1842
SLAVE DF:
PO ID
Yr
Amount
Year
12
1935
365.2
1839
13
1966
253.9
1855
RESULTANT DF:
PO ID
Sales Year
Acc Year
10
1934
1834
11
1942
1842
12
1935
1839
13
1966
1855
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically.
Any idea/logic would be helpful.
Thanks in advance!
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!

Python: calculating difference between rows

I have a dataframe listing revenue by company and year. See below:
Company | Acc_Name |Date | Value
A2M Sales/Revenue 2016 167770000.0
A2M Sales/Revenue 2017 360842000.0
A2M Sales/Revenue 2018 68087000.0
A2M Sales/Revenue 2019 963000000.0
A2M Sales/Revenue 2020 143346000.0
In python I want to create a new column showing the difference year on year, so 2017 will show the variance between 2017 & 2016.
I'm wanting to run this on a large dataframe with multiple companies.
Here is my solution which creates a new column with previous year data and then simply takes the differences of them:
df["prev_val"] = df["Value"].shift(1) # creates new column with previous year data
df["Difference"] = df["Value"] - df["prev_val"]
Since you are willing to do this on several companies, make sure that you filter out other companies by
this_company_df = df[df["Company"] == "A2M"]
and order data in ascending order by
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
So, the final code code should look something like this:
this_company_df = df[df["Company"] == "A2M"]
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
this_company_df["prev_val"] = this_company_df["Value"].shift(1)
this_company_df["Difference"] = this_company_df["Value"] - this_company_df["prev_val"]
So, the result is stored in "Difference" column. One more thing you could improve is to take care of initial year by setting it to 0.
revenues['Revenue_Change'] = revenues['Value'].diff(periods=1)
Is the simplest way to do it. However, since your dataframe contains data for multiple companies, you can use this:
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff(periods=1)
This sets the first entry for each company in the set to NAN.
If, by any chance, the dataframe is not in order, you can use
revenues = revenues.sort_values('Company')
Groupby will correctly calculate YoY revenue change, even if entries are separated from one another, as long as the actual revenues are chronologically in order for each company.
EDIT:
If everything is out of order, then sort by the year, groupby and then sort by company name:
revenues = revenues.sort_values('Date')
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff()
revenues = revenues.sort_values('Company')

For each NAME, calculate the average SNOW for each month

import pandas as pd
import numpy as np
# Show the specified columns and save it to a new file
col_list= ["STATION", "NAME", "DATE", "AWND", "SNOW"]
df = pd.read_csv('Data.csv', usecols=col_list)
df.to_csv('filteredData.csv')
df['year'] = pd.DatetimeIndex(df['DATE']).year
df2016 = df[(df.year==2016)]
df_2016 = df2016.groupby(['NAME', 'DATE'])['SNOW'].mean()
df_2016.to_csv('average2016.csv')
How come my dates are not ordered correctly here? Row 12 should be on the top but it's on the bottom of May instead and same goes for row 25
The average of SNOW per NAME/month is also not being displayed on my excel sheet. Why is that? Basically, I'm trying to calculate the average SNOW for May in ADA 0.7 SE, MI US. Then calculate the average SNOW for June in ADA 0.7 SE, MI US. etc..
I've spent all day and this is all I have got... Any help will be appreciated. Thanks in advance.
original data
https://gofile.io/?c=1gpbyT
Please try
Data
df=pd.read_csv(r'directorywhere the data is\data.csv')
df
Working
df.dtypes# Checking the datatype on each column
df.columns#listing columns
df['DATE']=pd.to_datetime(df['DATE'])#Converting date from object to a date format
df.set_index(df['DATE'], inplace=True)#Seeting the date as index
df['SNOW'].fillna(0)#filling all Not a Number values with zeros to make aggregation possible
df['SnowMean']=df.groupby([df.index.month, df.NAME])['SNOW'].transform('mean')#Groupby name, month and calculate the mean of snow. Store the result in anew column called df['SnowMean']
df
Checking
df.loc[:,['DATE','Month','SnowMean']]# Slice relevant columns to check
I realize you have multiple years. If you wanted mean per month in each year, again extract the year and add it in the groups to groupby as follows
df['SnowMeanPerYearPerMonth']=df.groupby([df.index.month,df.index.year,df.NAME])['SNOW'].transform('mean')
df
Check again
pd.set_option('display.max_rows',999)#diaplay upto 999 rows to check
df.loc[:,['DATE','Month','Year','SnowMean']]# Slice relevant columns to check

How can I group monthly over years in Python with pandas?

I have a dataset ranging from 2009 to 2019. The Dates include Years, months and days. I have two columns: one with dates and the other with values. I need to group my Dataframe monthly summing up the Values in the other column. At the moment what I am doing is setting the date column as index and using "df.resample('M').sum()".
The problem is that this is grouping my Dataframe monthly but for each different year (so I have 128 values in the "date" column). How can I group my data only for the 12 months without taking into consideration years?
Thank you very much in advance
I attached two images as example of the Dataset I have and the one I want to obtain.
Dataframe I have
Dataframe I want to obtain
use dt.month on your date column.
Example is
df.groupby(df['date'].dt.month).agg({'value':'sum'})

Margins created with pivot_table have issues with Period datatype

I have a large (+10m rows) dataframe with three columns: sales dates (dtype: datetime64[ns]), customer names and sales per customer. Sales dates include day, month and year in the form yyyy-mm-dd (i.e. 2019-04-19). I discovered the pandas to_period function and like to use the period[A-MAR] dtype. As the business year (ending in March) is different from the calendar year that is exactly what I was looking for. With the to_period function I am able to assign the respective sales dates to the correct business year while avoiding to create new columns with additional information.
I convert the date column as follows:
df_input['Date'] = pd.DatetimeIndex(df_input['Date']).to_period("A-MAR")
Now a peculiar issue arrises when I use pivot_table to aggregate the data and set margins=True. The aggfunc returns the correct values in the output table. However, the results in the last row (total value, created by the margins) are wrong as NaN is shown (or in my case a 0 as I set fill_value = 0). The function I use:
df_output = df_input.pivot_table(index="Customer",
columns = "Date",
values = "Sales",
aggfunc ={"Sales": np.sum},
fill_value = 0,
margins= True)
When I do not convert the dates to a period but use a simple year (integer) instead, the margins are calculated correctly and no NaN appears in the last row of the pivot output table.
I searched all over the internet but could not find a solution that was working. I would like to keep working with the period datatype and just need the margins to be calculated correctly. I hope someone can help me out here. Thank you!

Categories

Resources