This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I want to help yours
if i have a pandas dataframe merge
first dataframe is
D = { Year, Age, Location, column1, column2... }
2013, 20 , america, ..., ...
2013, 35, usa, ..., ...
2011, 32, asia, ..., ...
2008, 45, japan, ..., ...
shape is 38654rows x 14 columns
second dataframe is
D = { Year, Location, column1, column2... }
2008, usa, ..., ...
2008, usa, ..., ...
2009, asia, ..., ...
2009, asia, ..., ...
2010, japna, ..., ...
shape is 96rows x 7 columns
I want to merge or join two different dataframe.
How can I do it?
thanks
IIUC you need merge with parameter how='left' if need left join on column Year and Location:
print (df1)
Year Age Location column1 column2
0 2013 20 america 7 5
1 2008 35 usa 8 1
2 2011 32 asia 9 3
3 2008 45 japan 7 1
print (df2)
Year Location column1 column2
0 2008 usa 8 9
1 2008 usa 7 2
2 2009 asia 8 2
3 2009 asia 0 1
4 2010 japna 9 3
df = pd.merge(df1,df2, on=['Year','Location'], how='left')
print (df)
Year Age Location column1_x column2_x column1_y column2_y
0 2013 20 america 7 5 NaN NaN
1 2008 35 usa 8 1 8.0 9.0
2 2008 35 usa 8 1 7.0 2.0
3 2011 32 asia 9 3 NaN NaN
4 2008 45 japan 7 1 NaN NaN
You can also check documentation.
Related
My df has USA states-related information. I want to rank the states based on its contribution.
My code:
df
State Value Year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
Expected Answer: Compute state_capacity by summing state values from all years. Then Rank the States based on the state capacity
df
State Value Year State_Capa. Rank
0 FL 100 2012 150 2
1 CA 150 2013 200 1
2 MA 25 2014 100 3
3 FL 150 2014 200 2
4 CA 50 2015 200 1
5 MA 75 2016 100 3
My approach: I am able to compute the state capacity using groupby. I ran into NaN when mapped it to the df.
state_capacity = df[['State','Value']].groupby(['State']).sum()
df['State_Capa.'] = df['State'].map(dict(state_cap))
df
State Value Year State_Capa.
0 FL 100 2012 NaN
1 CA 150 2013 NaN
2 MA 25 2014 NaN
3 FL 50 2014 NaN
4 CA 50 2015 NaN
5 MA 75 2016 NaN
Try with transform then rank
df['new'] = df.groupby('State').Value.transform('sum').rank(method='dense',ascending=False)
Out[42]:
0 2.0
1 1.0
2 3.0
3 2.0
4 1.0
5 3.0
Name: Value, dtype: float64
As mentioned in the comment, your question seems to have a problem. However, I guess this might be what you want:
df = pd.DataFrame({
'state': ['FL', 'CA', 'MA', 'FL', 'CA', 'MA'],
'value': [100, 150, 25, 50, 50, 75],
'year': [2012, 2013, 2014, 2014, 2015, 2016]
})
returns:
state value year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
and
groupby_sum = df.groupby('state')['state', 'value'].sum()
groupby_sum['rank'] = groupby_sum['value'].rank()
groupby_sum.reset_index()
returns:
state value rank
0 CA 200 3.0
1 FL 150 2.0
2 MA 100 1.0
How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?
You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
I have a dataframe (totaldf) such that:
... Hom ... March Plans March Ships April Plans April Ships ...
0 CAD ... 12 5 4 13
1 USA ... 7 6 2 11
2 CAD ... 4 9 6 14
3 CAD ... 13 3 9 7
... ... ... ... ... ... ...
for all months of the year. I would like it to be:
... Hom ... Month Plans Ships ...
0 CAD ... March 12 5
1 USA ... March 7 6
2 CAD ... March 4 9
3 CAD ... March 13 3
4 CAD ... April 4 13
5 USA ... April 2 11
6 CAD ... April 6 14
7 CAD ... April 9 7
... ... ... ... ... ...
Is there an easy way to do this without splitting string entries?
I have played around with totaldf.unstack() but since there are multiple columns I'm unsure as to how to properly reindex the dataframe.
If you convert the columns to a MultiIndex you can use stack:
In [11]: df1 = df.set_index("Hom")
In [12]: df1.columns = pd.MultiIndex.from_tuples(df1.columns.map(lambda x: tuple(x.split())))
In [13]: df1
Out[13]:
March April
Plans Ships Plans Ships
Hom
CAD 12 5 4 13
USA 7 6 2 11
CAD 4 9 6 14
CAD 13 3 9 7
In [14]: df1.stack(level=0)
Out[14]:
Plans Ships
Hom
CAD April 4 13
March 12 5
USA April 2 11
March 7 6
CAD April 6 14
March 4 9
April 9 7
March 13 3
In [21]: res = df1.stack(level=0)
In [22]: res.index.names = ["Hom", "Month"]
In [23]: res.reset_index()
Out[23]:
Hom Month Plans Ships
0 CAD April 4 13
1 CAD March 12 5
2 USA April 2 11
3 USA March 7 6
4 CAD April 6 14
5 CAD March 4 9
6 CAD April 9 7
7 CAD March 13 3
You can use pd.wide_to_long, with a little extra work in order to have the right stubnames, given that as mentioned in the docs:
The stub name(s). The wide format variables are assumed to start with the stub names.
So it will be necessary to slightly modify the column names so that the stubnames are at the beginning of each column name:
m = df.columns.str.contains('Plans|Ships')
cols = df.columns[m].str.split(' ')
df.columns.values[m] = [w+month for month, w in cols]
print(df)
Hom PlansMarch ShipsMarch PlansApril ShipsApril
0 CAD 12 5 4 13
1 USA 7 6 2 11
2 CAD 4 9 6 14
3 CAD 13 3 9 7
Now you can use pd.wide_to_long using ['Ships', 'Plans'] as stubnames in order to obtain the output you want:
((pd.wide_to_long(df.reset_index(), stubnames=['Ships', 'Plans'], i = 'index',
j = 'Month', suffix='\w+')).reset_index(drop=True, level=0)
.reset_index())
x Month Hom Ships Plans
0 March CAD 5 12
1 March USA 6 7
2 March CAD 9 4
3 March CAD 3 13
4 April CAD 13 4
5 April USA 11 2
6 April CAD 14 6
7 April CAD 7 9
I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()
in pandas I have a dataframe as follows (first line is the column, second is just a row now)
2012 2013 2012 2013
women women men men
0 14 43 24 45
1 34 54 35 65
and would like to get it like
women men
2012 0 14 24
2012 1 34 35
2013 0 43 45
2013 1 54 65
using df.stack, df.unstack did not get anywhere?
Any elegant solution?
In [5]: df
Out[5]:
2012 2013
women men women men
0 0 1 2 3
1 4 5 6 7
the idea is to first stack the first level of the column to the first level of index, and then swap two indices (pandas.DataFrame.swaplevel)
In [6]: df.stack(level=0).swaplevel(0,1,axis=0)
Out[6]:
men women
2012 0 1 0
2013 0 3 2
2012 1 5 4
2013 1 7 6
df.stack is most likely what you want. See below, you do need to specify that you want the first level.
In [79]: df = pd.DataFrame(0., index=[0,1], columns=pd.MultiIndex.from_product([[2012,2013], ['women','men']]))
In [83]: df.stack(level=0)
Out[83]:
men women
0 2012 0 0
2013 0 0
1 2012 0 0
2013 0 0