Pandas merging/filtering dataframes

Pandas merging/filtering dataframes - python

I have two data frames,
the first is a list of the cities in europe that belong to the EU and
which country they're in:
cities_in_eu
country city
0 sweden stockholm
1 germany berlin
2 germany frankfurt
3 spain barcelona
4 spain madrid
5 france paris
...
assume the data goes on like this for many observations, with potentially
many observations of cities for each country.
the next data frame is all cities in europe, not exclusive
to belonging in the EU.
This data frame has information on the cities population:
cities_in_europe
country city population(100million)
sweden stockholm 2
germany berlin 8
germany frankfurt 5
spain barcelona 6
spain madrid 3
france paris 8
switzerland bern 1
russia moscow 6
...
(the numbers here are made up)
basically i want to test the difference in population between
EU cities and non-EU cities by filtering the data to only see
cities in/not in the EU.
Using only the data frame list of cities_in_eu, how would i
achieve this?

You could try this:
First, you will create a list of EU cities based on the cities_in_eu
EUcities = list(set(cities_in_eu.city))
Then you will create a table which contains all the population information of EU cities:
#create a list of booleans
filter = []
for city in cities_in_europe.city:
filter.append(True if city in EUcities else False)
filtered = pd.Series(filter)
#this one will remain only cities in EU
df_eu = cities_in_europe[filtered]
nonEU_filter = [not i for i in filter]
nonEU_filtered = pd.Series(nonEU_filter)
df_non_eu = cities_in_europe[nonEU_filtered]
There you go, now you have 2 df of EU cities with population and non-EU cities with population. Then you can do other stuff on these two

Related

Validate a dataframe based on another dataframe?

I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}

To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]

Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702

The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)

Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?

You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

How to update/create column in pandas based on values in a list

So, here is my dataframe
import pandas as pd
cols = ['Name','Country','Income']
vals = [['Steve','USA',40000],['Matt','UK',40000],['John','USA',40000],['Martin','France',40000],]
x = pd.DataFrame(vals,columns=cols)
I have another list:
europe = ['UK','France']
I want to create a new column 'Continent' if x.Country is in europe

You need numpy.where with condition with isin:
x['Continent'] = np.where(x['Country'].isin(europe), 'Europe', 'Not Europe')
print (x)
Name Country Income Continent
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe

Or you can using isin directly
x['New Column']='Not Europe'
x.loc[x.Country.isin(europe),'New Column']='Europe'
Out[612]:
Name Country Income New Column
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas merging/filtering dataframes - python

Related

Validate a dataframe based on another dataframe?

How to maintain the same index after sorting a Pandas series?

How to add a dictionary as the last element to a list of dictionaries?

Weird behaviour with pandas cut, groupby and multiindex in Python

How to update/create column in pandas based on values in a list

Categories

Resources