I have 3 tables/df. All have same column names. Bascially they are df for data from different months
October (df1 name)
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
November (df2 name)
Sales_value Sales_units Unique_Customer_id Countries Month
2000 1000 40 14 Nov
112 200 30 10 Nov
December (df3 name)
Sales_value Sales_units Unique_Customer_id Countries Month
20009090 4809509 4500 30 Dec
etc. This is dummy data. Each table has thousands of rows in reality. How to combine all these 3 tables such that columns come only once and all rows are displayed such that rows from October df come first, followed by November df rows followed by December df rows. When i use joins I am getting column names repeated.
Expected output:
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
2000 1000 40 14 Nov
112 200 30 10 Nov
20009090 4809509 4500 30 Dec
Concat combines rows from different tables based on common columns
pd.concat([df1, df2, df3])
Related
I have a dataframe as such
Date Search Volume
Jan 2004 80,000
Feb 2004 90,000
Mar 2004 100,000
Apr 2004 40,000
May 2004 60,000
Jun 2004 50,000
I wish to have an output like this:
Date Search Volume Total Quarter
Jan 2004 80,000 270,000 2004Q1
Feb 2004 90,000 270,000 2004Q1
Mar 2004 100,000 270,000 2004Q1
Apr 2004 40,000 150,000 2004Q2
May 2004 60,000 150,000 2004Q2
Jun 2004 50,000 150,000 2004Q2
...
...
Aug 2022 50,000 100,000 2022Q3
Sep 2022 10,000 100,000 2022Q3
Oct 2022 40,000 100,000 2022Q3
So what I'm trying to do is sum every 3 rows (quarter) and create a new column called total, and apply the sum to every row that belongs to that quarter. The other column should be Quarter, which represents the quarter that the month belongs to.
I have tried this:
N = 3
keyvolume=keyvol.groupby(keyvol.index // 3).sum()
but this just results in a sum, not sure how to apply the values every 3 rows that quarter, and I don't know how to generate the quarter column.
Appreciate your help.
First convert column Search Volume to numeric by Series.str.replace and casting to integers or floats, then convert dates to quarters by to_datetime and Series.dt.to_period and for new column use GroupBy.transform with sum per quarters:
def func(df):
df['Search Volume'] = df['Search Volume'].str.replace(',','', regex=True).astype(int)
q = pd.to_datetime(df['Date']).dt.to_period('q')
df['Total'] = df['Search Volume'].groupby(q).transform('sum')
df['Quarter'] = q
return df
out = func(df)
print (out)
Date Search Volume Total Quarter
0 Jan 2004 80000 270000 2004Q1
1 Feb 2004 90000 270000 2004Q1
2 Mar 2004 100000 270000 2004Q1
3 Apr 2004 40000 150000 2004Q2
4 May 2004 60000 150000 2004Q2
5 Jun 2004 50000 150000 2004Q2
I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ
i have a dataframe column name like this
id salary year emp_type salary1 year1 emp_type1 salary2 year2 emp_type2 .. salary9 year9 emp_type9
1 xx xx xx .. ..
2 .. ..
3
i wan to pivot column to row like this
id salary year emp_type
-------------------------------------------------------------------
value of salary value of year value of emp_type
value of salary1 value of year1 value of emp_type1
.. .. ..
.. .. ..
value of salary9 value of year9 value of emp_type9
If columns are guaranteed to be in this order, you can simply create a new dataframe from the reshaped old one:
new_df = pd.DataFrame(old_df.values.reshape((-1, 3)),
columns=['salary', 'year', 'emp_type'])
The new dataframe will not keep the old index, though.
The solution given by #Marat should work. Here I used 9 columns and it works.
df = pd.DataFrame(['1000', 2011, 'Type1', '2000', 2012, 'Type2', '3000', 2013, 'Type3',
'4000', 2014, 'Type4', '5000', 2015, 'Type5', '6000', 2016, 'Type6',
'8000', 2018, 'Type7', '8000', 2018, 'Type8', '9000', 2019, 'Type9'])
df = pd.DataFrame(df.values.reshape(-1,3),columns=['salary', 'year', 'emp_type'])
print(df)
Output:
salary year emp_type
0 1000 2011 Type1
1 2000 2012 Type2
2 3000 2013 Type3
3 4000 2014 Type4
4 5000 2015 Type5
5 6000 2016 Type6
6 8000 2018 Type7
7 8000 2018 Type8
8 9000 2019 Type9
I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12
You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()