How do I perform an arithmetic operation across rows and columns for a data frame like the one shown below?
For example I want to calculate gross margin (gross profit/Revenue) - this is basically dividing one row by another row. I want to do this across all columns.
I think you need to restructure your dataframe a little bit to do this most effectively. If you transposed your dataframe such that Revenue, etc were columns and the years were the index, you could do:
df["gross_margin"] = df["Gross profit"] / df["Revenue"]
If you don't want to make so many changes, you should at least set the metric as the index.
df = df.set_index("Metric")
And then you could:
gross_margin = df.loc["Gross profit", :] / df.loc["Revenue", :]
here is one way to do it
df2=df.T
df2['3']=df2.iloc[1:,2]/df2.iloc[1:,0]
df2=df2.T
df2.iloc[3,0] = 'Gross Margin'
df2
Metric 2012 2013 2014 2015 2016
0 Revenue 116707394.0 133084076.0 143328982.0 151271526.0 181910977.0
1 Cost_of_Sales -66538762.0 -76298147.0 -82099051.0 -83925957.0 -106583385.0
2 Gross_profit 501686320.0 56785929.0 612299310.0 67345569.0 75327592.0
3 Gross Margin 4.298668 0.426692 4.271985 0.445197 0.41409
Related
I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID
Sales year
Name
Acc year
10
1934
xyz
1834
11
1942
abc
1842
SLAVE DF:
PO ID
Yr
Amount
Year
12
1935
365.2
1839
13
1966
253.9
1855
RESULTANT DF:
PO ID
Sales Year
Acc Year
10
1934
1834
11
1942
1842
12
1935
1839
13
1966
1855
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically.
Any idea/logic would be helpful.
Thanks in advance!
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.
I have a pandas dataframe that looks like
Name Date Value
Sarah 11-01-2015 3
Sarah 11-02-2015 2
Sarah 11-03-2015 27
Bill 11-01-2015 42
Bill 11-02-2015 5
Bill 11-03-2015 15
.... (a couple hundred rows)
How do I get a 30 day (or x day) rolling sum of these values broken out by whoever is in the 'Name' column? The ideal output would have the same columns as the current dataframe, but instead of having the values for each row be what that person had as a value for that day, it would be the cumulative sum of what their values over the past 30 days.
I know I can do
result = pd.rolling_sum(df, 30)
to get the rolling sum overall. But how do I return a dataframe with that rolling sum grouped by the 'Name' column?
Figured it out using the grigri group_resample function.
df = group_resample(df,date_column='Date',groupby=group_by,value_column='Value',how='sum',freq='d')
df = df.unstack(group_by).fillna(0)
result = pd.rolling_mean(df,30)
Note that if you don't need a precise temporal window, or if your dataset has 1 line per [day , user] (which seems to be your case), then the standard groupby of pandas is perfectly suited. See this very similar question
Otherwise, something like:
df.groupby('Name').rolling('30D', on="Date").Value.sum()
should work.
I have a Pandas DataFrame that includes rows that I want to drop based on values in a column "population":
data['population'].value_counts()
general population 21
developmental delay 20
sibling 2
general population + developmental delay 1
dtype: int64
here, I want to drop the two rows that have sibling as the value. So, I believe the following should do the trick:
data = data.drop(data.population=='sibling', axis=0)
It does drop 2 rows, as you can see in the resulting value counts, but they were not the rows with the specified value.
data.population.value_counts()
developmental delay 20
general population 19
sibling 2
general population + developmental delay 1
dtype: int64
Any idea what is going on here?
dataFrame.drop accepts an index (list of labels) as a parameter, not a mask.
To use drop you should do:
data = data.drop(data.index[data.population == 'sibling'])
however it is much simpler to do
data = data[data.population != 'sibling']