I am trying to write a function in python using pandas, The function is named:
example_df(df, year = 2008):
Note: Year includes, 2008, 2009, 2010. The default value is 2010,
If the specified year is 2010, the function should take df and drop all columns except: (salary, tips, employees)
If the specified year is 2009, the function should take df and drop all columns except: (salary, tips, employees)
and rename the corresponding columns that differ from 2010 to the 2010 names.
Thanks
I'm trying to figure out how to drop pandas columns based on conditions
Related
I have a dataframe that I want to groupby year and followed by months within each year. Due to the fact that the data are quite huge (recorded from 3 decades ago till now), I would like to have the output presented as shown below for subsequent calculation but without any aggregate function such ".mean()" behind.
However, I am unable to do so because groupby always require an .agg, else it will show this error: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022BF79A52E0>
On the other hand, I am a bit worried about importing as Series because I do not know how to set the parameters to get exactly the same format as below. Another reason is that I used the below lines to import the .csv into dataframe:
df=pd.read_csv(r'file directory', index_col = 'date')
df.index = pd.to_datetime(df.index)
For some weird reasons, if I define the date string format in pd.read_csv to import and subsequently, try to sort by other methods according to years and month, function or method that gets confused when the records have date starts off with 01(day)/01(month)/1990 and 01(day)/02(month)/1990. For example, it interprets the first number in Jan as day and the second number as month and sorts all chronologically but when it comes to Feb, when the day is should be 01, the method thought that 01 is the month and 02 is the day portion and move that Feb record to the Jan group.
Are there any ways to achieve the same format?
Methods shown in the post below does not seem to help me get the format I want: Pandas - Groupby dataframe store as dataframe without aggregating
IIUC:
You can use dayfirst parameter in to_datetime() and set that equal to True then create 'Year' and 'Month' column and make them index and sort index:
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df['Year']=df['date'].dt.year
df['Month']=df['date'].dt.month
df=df.set_index(['Year','Month']).sort_index()
OR in 3 steps via assign():
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df=(df.assign(Year=df['date'].dt.year,Month=df['date'].dt.month)
.set_index(['Year','Month']).sort_index())
you can iterate through the groups of the groupby results.
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
groupby_obj = df.groupby(['A'])
for k, gdf in groupby_obj:
print('Groupby Key:', k)
print('Dataframe:\n', gdf, '\n')
you can apply all dataframe methods on gdf
Is there a way to do something equivalent to an Excel pivot table in Python? What I'm looking for is to take data that says, effectively the following (and please excuse my lack of formatting, I have been awake and working on this for something like 14 hours now):
And make it look like this:
I'm obviously looking at multiple rows of sales for each category and year. I definitely need them totalled. I'm sure there's a way to do this without iterating through each and every line and keeping running totals of each possible combination, but for the life of me, I can't find it.
Use pandas pivot table.
See the following code.
import pandas as pd
df = pd.DataFrame({"Category":["Furniture", "Furniture", "Clothing", "Shoes", "Furniture", "Clothing", "Shoes"],
"Year":[2009, 2009, 2009, 2009, 2010, 2010, 2010],
"Sales":[50000, 5000, 10000, 20000, 70000, 30000, 10000]})
pd.pivot_table(df, index = 'Category', columns = 'Year', values = 'Sales')
I have a dataframe that looks like this (df1):
I want to recreate the following dataframe(df2) to look like df1:
The number of years in df2 goes up to 2020.
So, essentially for each row in df2, a new row for each year should be created. Then, new columns should be created for each month. Finally, the value for % in each row should be copied to the column corresponding to the month in the "Month" column.
Any ideas?
Many thanks.
This is pivot:
(df2.assign(Year=df2.Month.str[:4],
Month=df2.Month.str[5:])
.pivot(index='Year', columns='Month', values='%')
)
More details about pivoting a dataframe here.
Hi I have a question regarding a resampling in Pandas.
In my data i have a date range from 31/12/2018 to 25/3/2019 with an interval of 7 days(e.g. 31/12/2018, 7/1/2019,14,2019 etc.), I want to resample the sales corresponding to those dates to a new range of dates, say 30/4/2020 to 24/9/2020 with a 7 day interval as previously used. Is there a way to do it using pandas resample function? As shown in the picture, I want to resample the sales from the dataframe on the left and populate the dataframe on the right.
Just to be clear: the left dataframe consists of 13 rows and the right consists of 22 rows.
lets try this:
df=pd.date_range(start='30/4/2020', end='24/9/2020')
The new data frame can be created from the old values, the 'index' is necessary because of the different length. If you wish you can apply df2.fillna(0),too.
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales":df1.sales},index=np.arange(22) )
Or without using 'index':
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales": np.concatenate([df1.sales.values,np.zeros(9)])})
I have a dataframe created from a .csv document. Since one of the columns has dates, I have used pandas read_csv with parse_dates:
df = pd.read_csv('CSVdata.csv', encoding = "ISO-8859-1", parse_dates=['Dates_column'])
The dates range from 2012 to 2016. I want to crate a sub-dataframe, containing only the rows from 2014.
The only way I have managed to do this, is with two subsequent Boolean filters:
df_a = df[df.Dates_column>pd.Timestamp('2014')] # To create a dataframe from 01/Jan/2014 onwards.
df = df_a[df_a.Dates_column<pd.Timestamp('2015')] # To remove all the values after 01/jan/2015
Is there a way of doing this in one step, more efficiently?
Many thanks!
You can use the dt accessor:
df = df[df.Dates_column.dt.year == 2014]