Is there a way to do something equivalent to an Excel pivot table in Python? What I'm looking for is to take data that says, effectively the following (and please excuse my lack of formatting, I have been awake and working on this for something like 14 hours now):
And make it look like this:
I'm obviously looking at multiple rows of sales for each category and year. I definitely need them totalled. I'm sure there's a way to do this without iterating through each and every line and keeping running totals of each possible combination, but for the life of me, I can't find it.
Use pandas pivot table.
See the following code.
import pandas as pd
df = pd.DataFrame({"Category":["Furniture", "Furniture", "Clothing", "Shoes", "Furniture", "Clothing", "Shoes"],
"Year":[2009, 2009, 2009, 2009, 2010, 2010, 2010],
"Sales":[50000, 5000, 10000, 20000, 70000, 30000, 10000]})
pd.pivot_table(df, index = 'Category', columns = 'Year', values = 'Sales')
Related
I am trying to write a function in python using pandas, The function is named:
example_df(df, year = 2008):
Note: Year includes, 2008, 2009, 2010. The default value is 2010,
If the specified year is 2010, the function should take df and drop all columns except: (salary, tips, employees)
If the specified year is 2009, the function should take df and drop all columns except: (salary, tips, employees)
and rename the corresponding columns that differ from 2010 to the 2010 names.
Thanks
I'm trying to figure out how to drop pandas columns based on conditions
I have a dataframe that I want to groupby year and followed by months within each year. Due to the fact that the data are quite huge (recorded from 3 decades ago till now), I would like to have the output presented as shown below for subsequent calculation but without any aggregate function such ".mean()" behind.
However, I am unable to do so because groupby always require an .agg, else it will show this error: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022BF79A52E0>
On the other hand, I am a bit worried about importing as Series because I do not know how to set the parameters to get exactly the same format as below. Another reason is that I used the below lines to import the .csv into dataframe:
df=pd.read_csv(r'file directory', index_col = 'date')
df.index = pd.to_datetime(df.index)
For some weird reasons, if I define the date string format in pd.read_csv to import and subsequently, try to sort by other methods according to years and month, function or method that gets confused when the records have date starts off with 01(day)/01(month)/1990 and 01(day)/02(month)/1990. For example, it interprets the first number in Jan as day and the second number as month and sorts all chronologically but when it comes to Feb, when the day is should be 01, the method thought that 01 is the month and 02 is the day portion and move that Feb record to the Jan group.
Are there any ways to achieve the same format?
Methods shown in the post below does not seem to help me get the format I want: Pandas - Groupby dataframe store as dataframe without aggregating
IIUC:
You can use dayfirst parameter in to_datetime() and set that equal to True then create 'Year' and 'Month' column and make them index and sort index:
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df['Year']=df['date'].dt.year
df['Month']=df['date'].dt.month
df=df.set_index(['Year','Month']).sort_index()
OR in 3 steps via assign():
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df=(df.assign(Year=df['date'].dt.year,Month=df['date'].dt.month)
.set_index(['Year','Month']).sort_index())
you can iterate through the groups of the groupby results.
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
groupby_obj = df.groupby(['A'])
for k, gdf in groupby_obj:
print('Groupby Key:', k)
print('Dataframe:\n', gdf, '\n')
you can apply all dataframe methods on gdf
I'm brand new to using pandas, and I've tried searching for a solution to this (what seems to be simple) problem. I'm trying to conditionally add a column to some of the rows of one dataframe from another dataframe. Here's my data:
import pandas as pd
df_1 = pd.DataFrame(
{
'Acme ID':["A-123","A-345","A-678"],
'Active':['Y','N','Y'],
'Other Col':["some","other","data"]})
df_2 = pd.DataFrame(
{
'Acme ID':["A-123","A-678"],
'Active Date':['2020-05-15','2020-07-20']})
I'm trying to add the Active Date from df_2 to all rows in df_1 where the Active flag is 'Y'. The items in df_2 can join to the items in df_1 using the Acme ID column. Here's what I would expect the resulting dataframe to look like:
df_final = pd.DataFrame(
{
'Acme ID':["A-123","A-345","A-678"],
'Active':['Y','N','Y'],
'Other Col':["some","other","data"],
'Active Date':['2020-05-15',pd.NaT,'2020-07-20']})
I've tried a number of different approaches like just iterating through df_1 (but I keep getting SettingWithCopyWarning) and I figure there's a better way. I've also tried using some of the other operations like assign, but they don't seem to like that the dataframes are different lengths. Any help would be greatly appreciated.
Thanks to #It_is_Chris, df_1.merge(df_2, on='Acme ID', how='left') was what I was looking for. I thought I had tried something similar to that, but guess not.
Hi I have a question regarding a resampling in Pandas.
In my data i have a date range from 31/12/2018 to 25/3/2019 with an interval of 7 days(e.g. 31/12/2018, 7/1/2019,14,2019 etc.), I want to resample the sales corresponding to those dates to a new range of dates, say 30/4/2020 to 24/9/2020 with a 7 day interval as previously used. Is there a way to do it using pandas resample function? As shown in the picture, I want to resample the sales from the dataframe on the left and populate the dataframe on the right.
Just to be clear: the left dataframe consists of 13 rows and the right consists of 22 rows.
lets try this:
df=pd.date_range(start='30/4/2020', end='24/9/2020')
The new data frame can be created from the old values, the 'index' is necessary because of the different length. If you wish you can apply df2.fillna(0),too.
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales":df1.sales},index=np.arange(22) )
Or without using 'index':
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales": np.concatenate([df1.sales.values,np.zeros(9)])})
I have been able to use pandas groupby to create a new DataFrame but I'm getting an error when I create a barplot.
The groupby command:
invYr = invoices.groupby(['FinYear']).sum()[['Amount']]
Which creates a new DataFrame that looks correct to me.
New DataFrame invYr
Running:
sns.barplot(x='FinYear', y='Amount', data=invYr)
I get the error:
ValueError: Could not interperet input 'FinYear'
It appears that the issue is related to the index, being FinYear but unfortunately I have not been able to solve the issue even when using reindex.
import pandas as pd
import seaborn as sns
invoices = pd.DataFrame({'FinYear': [2015, 2015, 2014], 'Amount': [10, 10, 15]})
invYr = invoices.groupby(['FinYear']).sum()[['Amount']]
>>> invYr
Amount
FinYear
2014 15
2015 20
The reason that you are getting the error is that when you created invYr by grouping invoices, the FinYear column becomes the index and is no longer a column. There are a few solutions:
1) One solution is to specify the source data directly. You need to specify the correct datasource for the chart. If you do not specify a data parameter, Seaborn does not know which dataframe/series has the columns 'FinYear' or 'Amount' as these are just text values. You must specify, for example, y=invYr.Amount to specify both the dataframe/series and the column you'd like to graph. The trick here is directly accessing the index of the dataframe.
sns.barplot(x=invYr.index, y=invYr.Amount)
2) Alternatively, you can specify the data source and then directly refer to its columns. Note that the grouped data frame had its index reset so that the column again becomes available.
sns.barplot(x='FinYear', y='Amount', data=invYr.reset_index())
3) A third solution is to specify as_index=False when you perform the groupby, making the column available in the grouped dataframe.
invYr = invoices.groupby('FinYear', as_index=False).Amount.sum()
sns.barplot(x='FinYear', y='Amount', data=invYr)
All solutions above produce the same plot below.