Adding a subindex to merged dataframes - python

I have 3 dataframes each with the same columns (years) and same indexes (countries).
Now I want to merge these 3 dataframes. But since all have the same columns it is appending those.
So 'd like to keep the country index and add a subindex for each dataframe because all represent different numbers for each year.
#dataframe 1
#CO2:
2005 2010 2015 2020
country
Afghanistan 169405 210161 259855 319447
Albania 762 940 1154 1408
Algeria 158336 215865 294768 400126
#dataframe 2
#Arrivals + Departures:
2005 2010 2015 2020
country
Afghanistan 977896 1326120 1794547 2414943
Albania 103132 154219 224308 319440
Algeria 3775374 5307448 7389427 10159656
#data frame 3
#Travel distance in km:
2005 2010 2015 2020
country
Afghanistan 9330447004 12529259781 16776152792 22337458954
Albania 63159063 82810491 107799357 139543748
Algeria 12254674181 17776784271 25782632480 37150057977
The result should be something like:
2005 2010 2015 2020
country
Afghanistan co2 169405 210161 259855 319447
flights 977896 1326120 1794547 2414943
traveldistance 9330447004 12529259781 16776152792 22337458954
Albania ....
How can I do this?
NOTE: The years are an input so these are not fixed. They could just be 2005,2010 for example.
Thanks in advance.

I have tried to solve the problem using concat and groupby using your dataset hope it helps
First concat the 3 dfs
l=[df,df2,df3]
f=pd.concat(l,keys= ['CO2','Flights','traveldistance'],axis=0,).reset_index().rename(columns={'level_0':'Category'})
the use groupby to get the values
result_df=f.groupby(['country', 'Category'])[f.columns[2:]].first()
Hope it helps and solve your problem
Output looks like this

Related

Using Query in Pandas to remove a vector of values

I work in R and this operation would be easy in tidyverse; However, I'm having trouble figuring out how to do it in Python and Pandas.
Let's say we're using the gapminder dataset
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
gapminder = pd.read_csv(data_url)
and let's say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn't:
vector = [1952, 1957]
gapminder.query("year isin(vector)")
I realize here that I've made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957) That doesn't work either.
In R, for instance, you would have to do something simple like
vector = c(1952, 1957)
gapminder %>% filter(year %in% vector)
#or
gapminder %>% filter(year %in% c(1952, 1957))
So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?
I've looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.
*Here I am updating my question: I found that this isn't working if I pull a vector from another dataset (or even from the same dataset); for instance:
vector = (1952, 1957)
#how to take a dataframe and make a vector
#how to make a vector
gapminder.vec = gapminder\
.query('year == [1952, 1958]')\
[['country']]\
.to_numpy()
gap_sum = gapminder.query("year != #gapminder.vec")
gap_sum
I receive the following error:
Thanks much!
James
You can use in or even == inside the query string like so:
# gapminder.query("year == #vector") returns the same result
print(gapminder.query("year in #vector"))
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
24 Algeria 1952 9279525.0 Africa 43.077 2449.008185
... ... ... ... ... ... ...
1669 Yemen Rep. 1957 5498090.0 Asia 33.970 804.830455
1680 Zambia 1952 2672000.0 Africa 42.038 1147.388831
1681 Zambia 1957 3016000.0 Africa 44.077 1311.956766
1692 Zimbabwe 1952 3080907.0 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340.0 Africa 50.469 518.764268
The # symbol tells the query string to look for a variable named vector outside of the context of the dataframe.
There are a couple of issues with the updated component of your question that I'll address:
The direct issue you're receiving is because you're using double square brackets to select a column. By using a double square bracket, you're forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The to_numpy is also not necessary.
in your gap_sum variable, you're checking where the values in "year" are not in your gapminder.vec - which is a pd.Series (array for more generic term) of country names. So these don't really make sense to compare.
Don't use . notation to create variables in python. You're not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. use gapminder_vec instead of gapminder.vec)
# countries that have years that are either 1952 or 1958
# will contain duplicate country names
gapminder_vec = gapminder.query('year == [1952, 1958]')['country']
# This won't actually filter anything- because `gapminder_vec` is
# a bunch of country names. Not years.
gapminder.query("year not in #gapminder_vec")
Also to perform a filter rather than a subset:
vec = (1952, 1958)
# returns a subset containing the rows who have a year in `vec`
subset_with_years_in_vec = gapminder.query('year in #vec')
# return subset containing rows who DO NOT have a year in `vec`
subset_without_years_in_vec = gapminder.query('year not in #vec')
To filter out years 1952 and 1957 you can use:
print(gapminder.loc[~(gapminder.year.isin([1952, 1957]))])
Prints:
country year pop continent lifeExp gdpPercap
2 Afghanistan 1962 1.026708e+07 Asia 31.99700 853.100710
3 Afghanistan 1967 1.153797e+07 Asia 34.02000 836.197138
4 Afghanistan 1972 1.307946e+07 Asia 36.08800 739.981106
5 Afghanistan 1977 1.488037e+07 Asia 38.43800 786.113360
6 Afghanistan 1982 1.288182e+07 Asia 39.85400 978.011439
7 Afghanistan 1987 1.386796e+07 Asia 40.82200 852.395945
8 Afghanistan 1992 1.631792e+07 Asia 41.67400 649.341395
9 Afghanistan 1997 2.222742e+07 Asia 41.76300 635.341351
10 Afghanistan 2002 2.526840e+07 Asia 42.12900 726.734055
11 Afghanistan 2007 3.188992e+07 Asia 43.82800 974.580338
14 Albania 1962 1.728137e+06 Europe 64.82000 2312.888958
15 Albania 1967 1.984060e+06 Europe 66.22000 2760.196931
16 Albania 1972 2.263554e+06 Europe 67.69000 3313.422188
17 Albania 1977 2.509048e+06 Europe 68.93000 3533.003910
...

Sorting grouped DataFrame column without changing index sorting

I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .
You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Pandas dataframe vertical merge

I have a query regarding merging two dataframes
For example i have 2 dataframes as below :
print(df1)
Year Location
0 2013 america
1 2008 usa
2 2011 asia
print(df2)
Year Location
0 2008 usa
1 2008 usa
2 2009 asia
My expected output :
Year Location
2013 america
2008 usa
2011 asia
Year Location
2008 usa
2008 usa
2009 asia
Output i am getting right now :
Year Location Year Location
2013 america 2008 usa
2008 usa 2008 usa
2011 asia 2009 asia
I have tried using pd.concat and pd.merge with no luck
Please help me with above
Simply specify the axis along which to concatenate (axis=1) in pd.concat:
df_merged=pd.concat([df1,df2],axis=1)
pd.concat([df1, df2]) should work. If all the column headings are the same, it will bind the second dataframe's rows below the first. This graphic from a pandas cheat sheet (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) explains it pretty well:
It's the same columns and same order, so that you can use: df1.append(df2)
output_df = pd.concat([df1, df2], ignore_index=False)
if you'd set ignore_index = True, you lost your original indexes and get 0..n-1 instead
It works for MultiIndex too

Creating a 3d matrix with pandas panel

My goal is to create a pandas panel, I currently have a csv, with the sample as follows:
Year From country To country Points
2005 Albania Albania 0
2005 Albania Bosnia & Herzegovina 0
2005 Albania Croatia 2
2005 Albania Cyprus 7
2005 Albania Denmark 0
I want to make a 3D array where the first axis is all the years range, which I have to search through the csv to find when 2005 turns to 2006, etc then the next axis will be the from country and the other axis will be the to country, and those axes will have the value of points... if that makes sense? Is the pandas panel the tool I should be using here, and how would scrape the years from the big dataframe to create a new dataframe for assumingly, all the years (2005 - 2016)
EDIT:
I found this picture, which is exactly what I'm trying to do for EACH year instead of the average of all the years. So it'd be like have one of those graphs for each year, 2005 - 2016
Format your dataframe where the index is a multiindex with two levels. Using the method to_panel will assume the Items is in the columns, Major_axis is in the first level of the index, and Minor_axis is in the second level of the index.
df.set_index(['From country', 'To country', 'Year']).Points.unstack().to_panel()
<class 'pandas.core.panel.Panel'>
Dimensions: 1 (items) x 1 (major_axis) x 5 (minor_axis)
Items axis: 2005 to 2005
Major_axis axis: Albania to Albania
Minor_axis axis: Albania to Denmark

Categories

Resources