I would like to please ask your advice.
How can I transform the first dataframe into the second, below?
Continent, Country and Location are names of column indices.
Polution_level would be added as the column name of the values present on the first dataframe.
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
Continent Country Location Date Polution_Level
Asia Japan Tokyo 01 Jan 20 250
Asia Japan Tokyo 02 Jan 20 252
Asia Japan Tokyo 03 Jan 20 253
...
Europe Portugal Lisbon 03 Jan 20 138
Thank you.
The following should do what you want.
Modules
import io
import pandas as pd
Create data
df = pd.read_csv(io.StringIO("""
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
"""), sep="\s\s+", engine="python", header=[0,1,2], index_col=[0])
Verify multiindex
df.columns
MultiIndex([( 'Asia', 'Japan', 'Tokyo'),
( 'Asia', 'China', 'Shanghai'),
('Africa', 'Mozambique', 'Maputo'),
('Europe', 'Portugal', 'Lisbon')],
names=['Continent', 'Country', 'Location'])
Transpose table and stack values
ndf = df.T.stack().reset_index()
ndf.rename({0:'Polution_Level'}, axis=1)
Related
I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27
IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106
So I have a df that looks like this:
Year
code
Country
Quan1jan
Quan2jan
Quan1feb
Quan2feb
2020
08123
Japan
500
26
400
28
2020
08123
Taiwan
450
245
4500
87
And I would like for it to look like this:
Year
month
code
Country
Quan1
Quan2
2020
jan
08123
Japan
500
26
2020
feb
08123
Japan
400
28
2020
jan
08123
Taiwan
450
245
2020
feb
08123
Taiwan
4500
87
It doesn’t matter if the data follows this same order, but I need it to be in this format.
Ive tried to play around with melt, and unstack with no luck. Any help is very much appreciated.
Use wide_to_long:
pd.wide_to_long(
df,
['Quan1', 'Quan2'],
i=['Year', 'code', 'Country'],
j='month',
suffix='\w+'
).reset_index()
# Year code Country month Quan1 Quan2
#0 2020 8123 Japan jan 500 26
#1 2020 8123 Japan feb 400 28
#2 2020 8123 Taiwan jan 450 245
#3 2020 8123 Taiwan feb 4500 87
df1:
Id Country P_Type Sales
102 Portugal Industries 1265
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
df2:
Id Market Products Expenditure
123 Portugal ALL Wine 5642
136 Portugal St Wine 4568
158 India QA Housing 4529
168 India stm Housing 1576
749 Germany all Sports 4587
759 Germany sts Sports 4756
Output df:
Id Country P_Type Sales
102 Portugal Industries 1265
102 Portugal ALL Wine 5642
102 Portugal St Wine 4568
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
109 India QA Housing 4529
109 India stm Housing 1576
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
105 Germany all Sports 4587
105 Germany sts Sports 4756
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
I need to append two dataframe, but rows from df2 should append at specific location in df1.
For Example in df2 the first two rows "Market" Column belongs to Portugal and in my df1
Country Portugal first row Id is 102, it should append after 1st row of portugal with same Id.
Same follows for other countries.
I think I would do it by creating a psuedo sort key like this:
df1['sortkey'] = df1['Country'].duplicated()
df2 = df2.set_axis(df1.columns[:-1], axis=1)
df1['sortkey'] = df1['Country'].duplicated().replace({True:2, False:0})
df_sorted = (pd.concat([df1, df2.assign(sortkey=1)])
.sort_values(['Country', 'sortkey'],
key=lambda x: x.astype(str).str.split(' ').str[0]))
df_sorted['Id'] = df_sorted.groupby(df_sorted['Country'].str.split(' ').str[0])['Id'].transform('first')
print(df_sorted.drop('sortkey', axis=1))
Output:
Id Country P_Type Sales
8 105 Germany Industries 1451
4 105 Germany all Sports 4587
5 105 Germany sts Sports 4756
9 105 Germany Office 1635
10 105 Germany Clubs 1520
11 105 Germany cars 1265
4 109 India House_hold 1651
2 109 India QA Housing 4529
3 109 India stm Housing 1576
5 109 India Office 1125
6 109 India Bakery 1752
7 109 India House_hold 1259
0 102 Portugal Industries 1265
0 102 Portugal ALL Wine 5642
1 102 Portugal St Wine 4568
1 102 Portugal Office 1455
2 102 Portugal Clubs 1265
3 102 Portugal cars 1751
Note: Using pandas 1.1.0 with key parameter in sort_values method
from itertools import chain
#ensure the columns match for both dataframes
df1.columns = df.columns
#the Id from the first dataframe takes precedence, so we convert
#the Id in df1 to null
df1.Id = np.nan
#here we iterate through the group for df
#we get the first row for each group
#get the rows from df1 for that particular group
#then the rows from 1 to the end for df
#flatten the data using itertools' chain
#concatenate the data, fill down on the null values in the Id column
merger = ((
value.iloc[[0]],
df1.loc[df1.Country.str.split().str[0].isin(value.Country)],
value.iloc[1:])
for key, value in df.groupby("Country", sort=False).__iter__())
merger = chain.from_iterable(merger)
merger = pd.concat(merger, ignore_index=True).ffill().astype({"Id": "Int16"})
merger.head()
Id Country P_Type Sales
0 102 Portugal Industries 1265
1 102 Portugal ALL Wine 5642
2 102 Portugal St Wine 4568
3 163 Portugal Office 1455
4 111 Portugal Clubs 1265
df2.rename(columns = {'Market':'Country','Products':'P_Type','Expenditure':'Sales'}, inplace = True)
def Insert_row(row_number, df, row_value):
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
def proper_plc(index_2):
index_1 =0
for ids1 in df1.Country:
# print(ids1 in ids)
if ids1 in ids:
break
index_1+=1
abc = list(df2.loc[index_2])
abc[0] = list(df1.loc[index_1])[0]
return Insert_row(index_1+1,df1,abc )
index_2=0
for ids in df2.Country:
df1 =proper_plc(index_2)
index_2+=1
here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy