I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID
Sales year
Name
Acc year
10
1934
xyz
1834
11
1942
abc
1842
SLAVE DF:
PO ID
Yr
Amount
Year
12
1935
365.2
1839
13
1966
253.9
1855
RESULTANT DF:
PO ID
Sales Year
Acc Year
10
1934
1834
11
1942
1842
12
1935
1839
13
1966
1855
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically.
Any idea/logic would be helpful.
Thanks in advance!
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!
Related
How do I perform an arithmetic operation across rows and columns for a data frame like the one shown below?
For example I want to calculate gross margin (gross profit/Revenue) - this is basically dividing one row by another row. I want to do this across all columns.
I think you need to restructure your dataframe a little bit to do this most effectively. If you transposed your dataframe such that Revenue, etc were columns and the years were the index, you could do:
df["gross_margin"] = df["Gross profit"] / df["Revenue"]
If you don't want to make so many changes, you should at least set the metric as the index.
df = df.set_index("Metric")
And then you could:
gross_margin = df.loc["Gross profit", :] / df.loc["Revenue", :]
here is one way to do it
df2=df.T
df2['3']=df2.iloc[1:,2]/df2.iloc[1:,0]
df2=df2.T
df2.iloc[3,0] = 'Gross Margin'
df2
Metric 2012 2013 2014 2015 2016
0 Revenue 116707394.0 133084076.0 143328982.0 151271526.0 181910977.0
1 Cost_of_Sales -66538762.0 -76298147.0 -82099051.0 -83925957.0 -106583385.0
2 Gross_profit 501686320.0 56785929.0 612299310.0 67345569.0 75327592.0
3 Gross Margin 4.298668 0.426692 4.271985 0.445197 0.41409
import pandas as pd
import numpy as np
# Show the specified columns and save it to a new file
col_list= ["STATION", "NAME", "DATE", "AWND", "SNOW"]
df = pd.read_csv('Data.csv', usecols=col_list)
df.to_csv('filteredData.csv')
df['year'] = pd.DatetimeIndex(df['DATE']).year
df2016 = df[(df.year==2016)]
df_2016 = df2016.groupby(['NAME', 'DATE'])['SNOW'].mean()
df_2016.to_csv('average2016.csv')
How come my dates are not ordered correctly here? Row 12 should be on the top but it's on the bottom of May instead and same goes for row 25
The average of SNOW per NAME/month is also not being displayed on my excel sheet. Why is that? Basically, I'm trying to calculate the average SNOW for May in ADA 0.7 SE, MI US. Then calculate the average SNOW for June in ADA 0.7 SE, MI US. etc..
I've spent all day and this is all I have got... Any help will be appreciated. Thanks in advance.
original data
https://gofile.io/?c=1gpbyT
Please try
Data
df=pd.read_csv(r'directorywhere the data is\data.csv')
df
Working
df.dtypes# Checking the datatype on each column
df.columns#listing columns
df['DATE']=pd.to_datetime(df['DATE'])#Converting date from object to a date format
df.set_index(df['DATE'], inplace=True)#Seeting the date as index
df['SNOW'].fillna(0)#filling all Not a Number values with zeros to make aggregation possible
df['SnowMean']=df.groupby([df.index.month, df.NAME])['SNOW'].transform('mean')#Groupby name, month and calculate the mean of snow. Store the result in anew column called df['SnowMean']
df
Checking
df.loc[:,['DATE','Month','SnowMean']]# Slice relevant columns to check
I realize you have multiple years. If you wanted mean per month in each year, again extract the year and add it in the groups to groupby as follows
df['SnowMeanPerYearPerMonth']=df.groupby([df.index.month,df.index.year,df.NAME])['SNOW'].transform('mean')
df
Check again
pd.set_option('display.max_rows',999)#diaplay upto 999 rows to check
df.loc[:,['DATE','Month','Year','SnowMean']]# Slice relevant columns to check
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
I have a number of dataframes all which contain columns labeled 'Date' and 'Cost' along with additional columns. I'd like to add the numerical data in the 'Cost' columns across the different frames based on lining up the dates in the 'Date' columns to provide a timeseries of total costs for each of the dates.
There are different numbers of rows in each of the dataframes.
This seems like something that Pandas should be well suited to doing, but I can't find a clean solution.
Any help appreciated!
Here are two of the dataframes:
df1:
Date Total Cost Funded Costs
0 2015-09-30 724824 940451
1 2015-10-31 757605 940451
2 2015-11-15 788051 940451
3 2015-11-30 809368 940451
df2:
Date Total Cost Funded Costs
0 2015-11-30 3022 60000
1 2016-01-15 3051 60000
I want to have the resulting dataframe have five rows (there are five different dates) and a single column with the total of the 'Total Cost' column from each of the dataframes. Initially I used the following:
totalFunding = df1['Total Cost'].values + df2['Total Cost'].values
This worked fine until there were different dates in each of the dataframes.
Thanks!
The solution posted below works great, except that I need to do this recursively as I have a number of data frames. I created the following function:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
return dfTotal
Which works fine when adding the first two dataframes. However, the addition method appears to convert my Date column into an index in the resulting sum and therefore subsequent passes through the function fail. Here is what dfTotal looks like after the first two data frames are added together:
Total Cost Funded Costs Remaining Cost Total Employee Hours
Date
2015-09-30 1449648 1880902 431254 7410.6
2015-10-31 1515210 1880902 365692 7874.4
2015-11-15 1576102 1880902 304800 8367.2
2015-11-30 1618736 1880902 262166 8578.0
2015-12-15 1671462 1880902 209440 8945.2
2015-12-31 1721840 1880902 159062 9161.2
2016-01-15 1764894 1880902 116008 9495.0
Note that what was originally a column in the dataframe called 'Date' is now listed as the index causing df.set_index('Date') to generate an error on subsequent passes through my function.
DataFrame.add does exactly what you're looking for; it matches the DataFrames based on index, so:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)
should do the trick. If you just want the Total Cost column and you want it as a DataFrame:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)[['Total Cost']]
See also the documentation for DataFrame.add at:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.add.html
Solution found. As mentioned, the add method converted the 'Date' column into the dataframe index. This was resolved using:
dfTotal['Date'] = dfTotal.index
The complete function is then:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
dfTotal['Date'] = dfTotal.index
return dfTotal
I have a pandas dataframe that looks like
Name Date Value
Sarah 11-01-2015 3
Sarah 11-02-2015 2
Sarah 11-03-2015 27
Bill 11-01-2015 42
Bill 11-02-2015 5
Bill 11-03-2015 15
.... (a couple hundred rows)
How do I get a 30 day (or x day) rolling sum of these values broken out by whoever is in the 'Name' column? The ideal output would have the same columns as the current dataframe, but instead of having the values for each row be what that person had as a value for that day, it would be the cumulative sum of what their values over the past 30 days.
I know I can do
result = pd.rolling_sum(df, 30)
to get the rolling sum overall. But how do I return a dataframe with that rolling sum grouped by the 'Name' column?
Figured it out using the grigri group_resample function.
df = group_resample(df,date_column='Date',groupby=group_by,value_column='Value',how='sum',freq='d')
df = df.unstack(group_by).fillna(0)
result = pd.rolling_mean(df,30)
Note that if you don't need a precise temporal window, or if your dataset has 1 line per [day , user] (which seems to be your case), then the standard groupby of pandas is perfectly suited. See this very similar question
Otherwise, something like:
df.groupby('Name').rolling('30D', on="Date").Value.sum()
should work.