How to append two dataframe in pandas - python

df1:
Id Country P_Type Sales
102 Portugal Industries 1265
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
df2:
Id Market Products Expenditure
123 Portugal ALL Wine 5642
136 Portugal St Wine 4568
158 India QA Housing 4529
168 India stm Housing 1576
749 Germany all Sports 4587
759 Germany sts Sports 4756
Output df:
Id Country P_Type Sales
102 Portugal Industries 1265
102 Portugal ALL Wine 5642
102 Portugal St Wine 4568
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
109 India QA Housing 4529
109 India stm Housing 1576
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
105 Germany all Sports 4587
105 Germany sts Sports 4756
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
I need to append two dataframe, but rows from df2 should append at specific location in df1.
For Example in df2 the first two rows "Market" Column belongs to Portugal and in my df1
Country Portugal first row Id is 102, it should append after 1st row of portugal with same Id.
Same follows for other countries.

I think I would do it by creating a psuedo sort key like this:
df1['sortkey'] = df1['Country'].duplicated()
df2 = df2.set_axis(df1.columns[:-1], axis=1)
df1['sortkey'] = df1['Country'].duplicated().replace({True:2, False:0})
df_sorted = (pd.concat([df1, df2.assign(sortkey=1)])
.sort_values(['Country', 'sortkey'],
key=lambda x: x.astype(str).str.split(' ').str[0]))
df_sorted['Id'] = df_sorted.groupby(df_sorted['Country'].str.split(' ').str[0])['Id'].transform('first')
print(df_sorted.drop('sortkey', axis=1))
Output:
Id Country P_Type Sales
8 105 Germany Industries 1451
4 105 Germany all Sports 4587
5 105 Germany sts Sports 4756
9 105 Germany Office 1635
10 105 Germany Clubs 1520
11 105 Germany cars 1265
4 109 India House_hold 1651
2 109 India QA Housing 4529
3 109 India stm Housing 1576
5 109 India Office 1125
6 109 India Bakery 1752
7 109 India House_hold 1259
0 102 Portugal Industries 1265
0 102 Portugal ALL Wine 5642
1 102 Portugal St Wine 4568
1 102 Portugal Office 1455
2 102 Portugal Clubs 1265
3 102 Portugal cars 1751
Note: Using pandas 1.1.0 with key parameter in sort_values method

from itertools import chain
#ensure the columns match for both dataframes
df1.columns = df.columns
#the Id from the first dataframe takes precedence, so we convert
#the Id in df1 to null
df1.Id = np.nan
#here we iterate through the group for df
#we get the first row for each group
#get the rows from df1 for that particular group
#then the rows from 1 to the end for df
#flatten the data using itertools' chain
#concatenate the data, fill down on the null values in the Id column
merger = ((
value.iloc[[0]],
df1.loc[df1.Country.str.split().str[0].isin(value.Country)],
value.iloc[1:])
for key, value in df.groupby("Country", sort=False).__iter__())
merger = chain.from_iterable(merger)
merger = pd.concat(merger, ignore_index=True).ffill().astype({"Id": "Int16"})
merger.head()
Id Country P_Type Sales
0 102 Portugal Industries 1265
1 102 Portugal ALL Wine 5642
2 102 Portugal St Wine 4568
3 163 Portugal Office 1455
4 111 Portugal Clubs 1265

df2.rename(columns = {'Market':'Country','Products':'P_Type','Expenditure':'Sales'}, inplace = True)
def Insert_row(row_number, df, row_value):
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
def proper_plc(index_2):
index_1 =0
for ids1 in df1.Country:
# print(ids1 in ids)
if ids1 in ids:
break
index_1+=1
abc = list(df2.loc[index_2])
abc[0] = list(df1.loc[index_1])[0]
return Insert_row(index_1+1,df1,abc )
index_2=0
for ids in df2.Country:
df1 =proper_plc(index_2)
index_2+=1

Related

Why my replace method does not work with the string [B]?

I have the following dataset called world_top_ten:
`
Most populous countries 2000 2015 2030[A]
0 China[B] 1270 1376 1416
1 India 1053 1311 1528
2 United States 283 322 356
3 Indonesia 212 258 295
4 Pakistan 136 208 245
5 Brazil 176 206 228
6 Nigeria 123 182 263
7 Bangladesh 131 161 186
8 Russia 146 146 149
9 Mexico 103 127 148
10 World total 6127 7349 8501
`
I am trying to replace the [B] with "":
world_top_ten['Most populous countries'].str.replace(r'"[B]"', '')
And it is returning me:
0 China[B]
1 India
2 United States
3 Indonesia
4 Pakistan
5 Brazil
6 Nigeria
7 Bangladesh
8 Russia
9 Mexico
10 World total
Name: Most populous countries, dtype: object
What am I doing wrong here?
Because [] is special regex character escape it:
world_top_ten['Most populous countries'].str.replace(r'\[B\]', '', regex=True)

I have number of matches, but only for home matches for given team. How can I sum values for duplicate pairs?

My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106

Transform dataframe values with multilevel indices to single column

I would like to please ask your advice.
How can I transform the first dataframe into the second, below?
Continent, Country and Location are names of column indices.
Polution_level would be added as the column name of the values present on the first dataframe.
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
Continent Country Location Date Polution_Level
Asia Japan Tokyo 01 Jan 20 250
Asia Japan Tokyo 02 Jan 20 252
Asia Japan Tokyo 03 Jan 20 253
...
Europe Portugal Lisbon 03 Jan 20 138
Thank you.
The following should do what you want.
Modules
import io
import pandas as pd
Create data
df = pd.read_csv(io.StringIO("""
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
"""), sep="\s\s+", engine="python", header=[0,1,2], index_col=[0])
Verify multiindex
df.columns
MultiIndex([( 'Asia', 'Japan', 'Tokyo'),
( 'Asia', 'China', 'Shanghai'),
('Africa', 'Mozambique', 'Maputo'),
('Europe', 'Portugal', 'Lisbon')],
names=['Continent', 'Country', 'Location'])
Transpose table and stack values
ndf = df.T.stack().reset_index()
ndf.rename({0:'Polution_Level'}, axis=1)

Pandas dataframe Split One column data into 2 using some condition

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?
Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth
Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

How to filter a transposed pandas dataframe?

Say I have a transposed df like so
id 0 1 2 3
0 1361 Spain Russia South Africa China
1 1741 Portugal Cuba UK Ukraine
2 1783 Germany USA France Egypt
3 1353 Brazil Russia Japan Kenya
4 1458 India Romania Holland Nigeria
How could I get all rows where there is 'er' so it'll return me this
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria
because 'er' is contained in Germany and Nigeria.
Thanks!
Using contains
df[df.apply(lambda x :x.str.contains(pat='er')).any(1)]
Out[96]:
id 0 1 2 3
2 1783 Germany USA France Egypt None
4 1458 India Romania Holland Nigeria None
Use apply + str.contains across rows:
df = df[df.apply(lambda x: x.str.contains('er').any(), axis=1)]
print(df)
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria

Categories

Resources