Drop redundant rows for group in pandas - python

I have the following DataFrame:
import pandas as pd
data = {'id': ['A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris', 'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin', 'Berlin'],
'year': [2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005],
't': [0,0,0,0,1,0,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,0,0,0,
0,0,0,0,1,0]}
df = pd.DataFrame(data)
For each group id - city, I need to drop the rows for those higher years after t=1. For instance, id = A is in London in year=2004 (t=1). I want to drop the rows for the group A - London when year=2005. Please note that if an id is never in a city over 2000-2005, I want to keep all the rows (see, for instance, id = B in Milan).
The desired output:
import pandas as pd
data = {'id': ['A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin'],
'year': [2000,2001, 2002, 2003, 2004,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002,
2000,2001, 2002, 2003, 2004],
't': [0,0,0,0,1,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,
0,0,0,0,1]}
df = pd.DataFrame(data)

Idea is use cumualtive sum per groups, but need shift values and then remove all rows after first 1 in boolean indexing:
#if not sorted years per groups
#df = df.sort_values(['id','city','year'])
df = df[~df.groupby(['id', 'city'])['t'].transform(lambda x: x.shift().cumsum()).gt(0)]
print (df)
id city year t
0 A London 2000 0
1 A London 2001 0
2 A London 2002 0
3 A London 2003 0
4 A London 2004 1
6 A New York 2000 0
7 A New York 2001 0
8 A New York 2002 0
9 A New York 2003 0
10 A New York 2004 0
11 A New York 2005 1
12 B Milan 2000 0
13 B Milan 2001 0
14 B Milan 2002 0
15 B Milan 2003 0
16 B Milan 2004 0
17 B Milan 2005 0
18 C Paris 2000 0
19 C Paris 2001 0
20 C Paris 2002 1
24 D Berlin 2000 0
25 D Berlin 2001 0
26 D Berlin 2002 0
27 D Berlin 2003 0
28 D Berlin 2004 1

Related

Pandas dataframe mapping one value to another in a row

I have a pandas two dataframes, one with columns 'Name', 'Year' and 'Currency Rate'.
The other with columns named 'Name', and a long list of years(1950, 1951,...,2019,2020).
And in the column 'Year', it is storing value 2000, 2001,...,2015. The years(1950, 1951,...,2019,2020) columns are storing income of the year.
I want to merge these two dataframes but map the income of the year according to the 'Year' column to a new column named 'Income' and drop all the other years, is there a convenient way to do this?
What I am thinking is splitting the second dataframe into 16 different years, and then joining it to the first dataframe.
UPDATED to reflect OP's comment clarifying the question:
To restate the question based on my updated understanding:
There is a dataframe with columns Name, Year and Currency Rate (3 columns in total). Each row contains a Currency Rate in a given Year for a given Name. Each row is unique by (Name, Year) pair.
There is a second dataframe with columns Name, 1950, 1951, ..., 2020 (72 columns in total). Each cell contains an Income value for the corresponding Name for the row and the Year corresponding to the column name. Each row is unique by Name.
Question: How do we add an Income column to the first dataframe with each row containing the Income value from the second dataframe in the cell with (row, column) corresponding to the (Name, Year) pair of such row in the first dataframe?
Test case assumptions I have made:
Name in the first dataframe is a letter from 'a' to 'n' with some duplicates.
Year in the first dataframe is between 2000 and 2015 (as in the question).
Currency Rate in the first dataframe is arbitrary.
Name in the second dataframe is a letter from 'a' to 'z' (no duplicates).
The values in the second dataframe (which represent Income) are arbitrarily constructed using the ASCII offsets of the characters in the corresponding Name concatenated with the Year of the corresponding column name. This way we can visually "decode" them in the test results to confirm that the value from the correct location in the second dataframe has been loaded into the new Income column in the first dataframe.
table1 = [
{'Name': 'a', 'Year': 2000, 'Currency Rate': 1.1},
{'Name': 'b', 'Year': 2001, 'Currency Rate': 1.2},
{'Name': 'c', 'Year': 2002, 'Currency Rate': 1.3},
{'Name': 'd', 'Year': 2003, 'Currency Rate': 1.4},
{'Name': 'e', 'Year': 2004, 'Currency Rate': 1.5},
{'Name': 'f', 'Year': 2005, 'Currency Rate': 1.6},
{'Name': 'g', 'Year': 2006, 'Currency Rate': 1.7},
{'Name': 'h', 'Year': 2007, 'Currency Rate': 1.8},
{'Name': 'i', 'Year': 2008, 'Currency Rate': 1.9},
{'Name': 'j', 'Year': 2009, 'Currency Rate': 1.8},
{'Name': 'k', 'Year': 2010, 'Currency Rate': 1.7},
{'Name': 'l', 'Year': 2011, 'Currency Rate': 1.6},
{'Name': 'm', 'Year': 2012, 'Currency Rate': 1.5},
{'Name': 'm', 'Year': 2013, 'Currency Rate': 1.4},
{'Name': 'n', 'Year': 2014, 'Currency Rate': 1.3},
{'Name': 'n', 'Year': 2015, 'Currency Rate': 1.2}
]
table2 = [{'Name': name} | {str(year): (income := sum(ord(c) - ord('a') + 1 for c in name)*10000 + year) for year in range(1950, 2021)} for name in set(['x', 'y', 'z']) | set(map(lambda row: row['Name'], table1))]
import pandas as pd
df1 = pd.DataFrame(table1)
df2 = pd.DataFrame(table2).sort_values(by='Name')
print(df1)
print(df2)
df1['Income'] = df1.apply(lambda x: int(df2[df2['Name'] == x['Name']][str(x['Year'])]), axis=1)
print(df1.to_string(index=False))
Output:
Name Year Currency Rate
0 a 2000 1.1
1 b 2001 1.2
2 c 2002 1.3
3 d 2003 1.4
4 e 2004 1.5
5 f 2005 1.6
6 g 2006 1.7
7 h 2007 1.8
8 i 2008 1.9
9 j 2009 1.8
10 k 2010 1.7
11 l 2011 1.6
12 m 2012 1.5
13 m 2013 1.4
14 n 2014 1.3
15 n 2015 1.2
Name 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
11 a 11950 11951 11952 11953 11954 11955 11956 11957 11958 11959 11960 ... 12009 12010 12011 12012 12013 12014 12015 12016 12017 12018 12019 12020
15 b 21950 21951 21952 21953 21954 21955 21956 21957 21958 21959 21960 ... 22009 22010 22011 22012 22013 22014 22015 22016 22017 22018 22019 22020
0 c 31950 31951 31952 31953 31954 31955 31956 31957 31958 31959 31960 ... 32009 32010 32011 32012 32013 32014 32015 32016 32017 32018 32019 32020
10 d 41950 41951 41952 41953 41954 41955 41956 41957 41958 41959 41960 ... 42009 42010 42011 42012 42013 42014 42015 42016 42017 42018 42019 42020
9 e 51950 51951 51952 51953 51954 51955 51956 51957 51958 51959 51960 ... 52009 52010 52011 52012 52013 52014 52015 52016 52017 52018 52019 52020
1 f 61950 61951 61952 61953 61954 61955 61956 61957 61958 61959 61960 ... 62009 62010 62011 62012 62013 62014 62015 62016 62017 62018 62019 62020
4 g 71950 71951 71952 71953 71954 71955 71956 71957 71958 71959 71960 ... 72009 72010 72011 72012 72013 72014 72015 72016 72017 72018 72019 72020
3 h 81950 81951 81952 81953 81954 81955 81956 81957 81958 81959 81960 ... 82009 82010 82011 82012 82013 82014 82015 82016 82017 82018 82019 82020
2 i 91950 91951 91952 91953 91954 91955 91956 91957 91958 91959 91960 ... 92009 92010 92011 92012 92013 92014 92015 92016 92017 92018 92019 92020
13 j 101950 101951 101952 101953 101954 101955 101956 101957 101958 101959 101960 ... 102009 102010 102011 102012 102013 102014 102015 102016 102017 102018 102019 102020
14 k 111950 111951 111952 111953 111954 111955 111956 111957 111958 111959 111960 ... 112009 112010 112011 112012 112013 112014 112015 112016 112017 112018 112019 112020
12 l 121950 121951 121952 121953 121954 121955 121956 121957 121958 121959 121960 ... 122009 122010 122011 122012 122013 122014 122015 122016 122017 122018 122019 122020
5 m 131950 131951 131952 131953 131954 131955 131956 131957 131958 131959 131960 ... 132009 132010 132011 132012 132013 132014 132015 132016 132017 132018 132019 132020
7 n 141950 141951 141952 141953 141954 141955 141956 141957 141958 141959 141960 ... 142009 142010 142011 142012 142013 142014 142015 142016 142017 142018 142019 142020
8 x 241950 241951 241952 241953 241954 241955 241956 241957 241958 241959 241960 ... 242009 242010 242011 242012 242013 242014 242015 242016 242017 242018 242019 242020
6 y 251950 251951 251952 251953 251954 251955 251956 251957 251958 251959 251960 ... 252009 252010 252011 252012 252013 252014 252015 252016 252017 252018 252019 252020
16 z 261950 261951 261952 261953 261954 261955 261956 261957 261958 261959 261960 ... 262009 262010 262011 262012 262013 262014 262015 262016 262017 262018 262019 262020
[17 rows x 72 columns]
Name Year Currency Rate Income
a 2000 1.1 12000
b 2001 1.2 22001
c 2002 1.3 32002
d 2003 1.4 42003
e 2004 1.5 52004
f 2005 1.6 62005
g 2006 1.7 72006
h 2007 1.8 82007
i 2008 1.9 92008
j 2009 1.8 102009
k 2010 1.7 112010
l 2011 1.6 122011
m 2012 1.5 132012
m 2013 1.4 132013
n 2014 1.3 142014
n 2015 1.2 142015

Create new fake data with new primary keys from existing dataframe python

I have a dataframe as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c'], 'name': ['Anna', 'Peter', 'John'], 'year': [1999, 2001, 1993]})
I want to create new data by randomly re-arranging values in each column but for column id I also need to add a random letter at the end of the values, then add the new data to existing df1 as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c', '2by', '1ao', '1az', '3cc'], 'name': ['Anna', 'Peter', 'John', 'John', 'Peter', 'Anna', 'Anna'], 'year': [1999, 2001, 1993, 1999, 1999, 2001, 2001]})
Could anyone help me, please? Thank you very much.
Use DataFrame.sample and add random letter by numpy.random.choice:
import string
N = 5
df2 = (df1.sample(n=N, replace=True)
.assign(id =lambda x:x['id']+np.random.choice(list(string.ascii_letters),size=N)))
df1 = df1.append(df2, ignore_index=True)
print (df1)
id name year
0 1a Anna 1999
1 2b Peter 2001
2 3c John 1993
3 1aY Anna 1999
4 3cp John 1993
5 3cE John 1993
6 2bz Peter 2001
7 3cu John 1993

how to repeat row labels in pandas pivot table function and export it as excel

I've a sample pivoted data using pandas pivot_table()
df = pd.DataFrame([['USA', 'LA', 'April', 2, '1:2'],
['USA', 'FL', 'April', 5, '5:6'],
['USA', 'TX', 'April', 7, '1:3'],
['Canada', 'Ontario', 'April', 2, '1:3'],
['Canada', 'Toronto', 'April', 3, '1:5'],
['USA', 'LA', 'May', 3, '4:5'],
['USA', 'FL', 'May', 6, '4:5'],
['USA', 'TX', 'May', 2, '1:4'],
['Canada', 'Ontario', 'May', 6, '8:9'],
['Canada', 'Toronto', 'May', 9, '3:4']],
columns=['Country', 'Cities', 'month', 'Count', 'Ratio'])
mux1 = pd.MultiIndex.from_product([data['month'].unique(), ['Count', 'Ratio']])
data = data.pivot_table(columns=['month'], values=['Count', 'Ratio'], index=['Country', 'Cities']).swaplevel(1, 0, axis=1).reindex(mux1, axis=1)
April May
Count Ratio Count Ratio
Country Cities
USA LA 2 1:2 3 4:5
FL 5 5:6 6 4:5
TX 7 1:3 2 1:4
Canada Ontario 2 1:3 6 8:9
Toronto 3 1:5 9 3:4
How could I repeat my row labels in the pivot data which looks like below and export it as excel?
April May
Count Ratio Count Ratio
Country Cities
USA LA 2 1:2 3 4:5
USA FL 5 5:6 6 4:5
USA TX 7 1:3 2 1:4
Canada Ontario 2 1:3 6 8:9
Canada Toronto 3 1:5 9 3:4
I've tried pd.option_context('display.multi_sparse', False), as it only display the content, it does not export data as excel.
The way I solved it, which may be not the optimal solution was:
Switching the index order. In your case it would be: index=[ 'Cities','Country'])
data = data.pivot_table(columns=['month'], values=['Count', 'Ratio'], index=[ 'Cities','Country']).swaplevel(1, 0, axis=1).reindex(mux1, axis=1)
Adding the data.reset_index().to_excel('file.xlsx', index=False) after finishing the table actually worked
cfuper = np.round(pd.pivot_table(pndg_crss, columns = None,
values = 'Results %',
index = ['Email','Title','Manager']))
cfuper.reset_index().to_excel('test.xlsx',sheet_name = 'Sheet1',index=False)

How to target specific rows for replacement with order is different between dataframes?

Say I have 2 dataframes that I want to fillna with but the order is not the exact same, is it possible to target specific columns as part of the mapping?
Here's a example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
print(data)
print(dataNew)
print(dataNew.fillna(data))
My output is not right because if you see, dataframe new's country data is not in the same order(uk/us are shifted and so is fr/uk at the end). Is there a way to match it based on year, country and sales before replacing the NaN value in the rep column?
The output I'm looking for is, like the first data column. I'm trying to understand how I could have filled in the NA's by selecting a matching cells in another df. To make the question easier I made the first data column have less fields so the question is purely about mapping/searching
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'sales': [10, 21, 20, 10,12],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
# join on these three columns and get the rep column from other dataframe
cols = ['year', 'country', 'sales']
dataNew = dataNew.drop('rep', 1).join(data.set_index(cols), on = cols)
print(data)
output:
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
You can also use Multi-Index along with fillna():
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
(dataNew.set_index(['year', 'country']).sort_index().fillna(
data.set_index(['year', 'country']).sort_index())
).reset_index()
year country sales rep
0 2013 uk 12 kyle
1 2014 fr 10 kyle
2 2015 fr 20 claire
3 2016 uk 10 john
4 2016 usa 21 john

Create dictionary with two keys where both keys must be met to retrieve value

I have two dataframes:
df_one:
person city year col_x
ah bos 1998
bc bos 1996
dm ny 2001
hh la 1999
df_two:
person city range_a range_b
mk bos 1995 2004
kk bos 2004 2017
ab ny 1977 2019
fc dc 2001 2005
cc dc 2006 2019
et la 1995 2005
tr mia 1997 2006
I'd like to fill df_one, col_x with values based on conditions in both df_one and df_two. You would take the city from df_one, match the city from df_two, and based on the where the year on df_one falls in between the range from df_two - you would place the person on df_two into col_x on df_one.
Example: "ah" from the first row in df_one - the city is bos, and the year 1998 - so col_x would be mk for that row, because the city matches and 1998 falls between 1995 and 2004.
I'm not really sure where to start with this on pandas; I believe it may be some kind of nested dictionary with two values, but not sure if that's possible.
Here is a way to go about it.
First I created the data frame based on your description:
df1 = pd.DataFrame({'A': ['ah','bc','dm','hh'], 'B':['bos','bos','ny','la'], 'C': [1998,1996,2001,1999]})
df2 = pd.DataFrame({'A': ['mk','kk','ab','fc','cc','et','tr'], 'B':['bos','bos','ny','dc','dc','la','la'],'C': [1995,2004,1977,2001,2006,1995,1997], 'D':[2004,2017,2019,2005,2019,2005,2006] })
df1
A B C
0 ah bos 1998
1 bc bos 1996
2 dm ny 2001
3 hh la 1999
df2
A B C D
0 mk bos 1995 2004
1 kk bos 2004 2017
2 ab ny 1977 2019
3 fc dc 2001 2005
4 cc dc 2006 2019
5 et la 1995 2005
6 tr la 1997 2006
Then passed the rows of df1 to a function(check data). The function compares each row of df1 with all the rows in df2 and returns all the matching values from df2['A'] based on the condition you have mentioned. Please read my comment to your original question. 'la' in df1 will choose 2 values in df2.
Option1: I have chosen all the values for df1['D'] and that comes as a list.
Option 2: I have chosen only the first value out of all the matching values which is put as a singular value.
You can choose which option you want to go for or clarify further.
Option 1:
def check_data(row):
return (df2[ (df2['B'] == row['B']) & (df2['C'] <= row['C']) & (df2['D'] >= row['C'])]['A'].values)
df1['D'] = df1.apply(check_data, axis=1)
df1
A B C D
0 ah bos 1998 [mk]
1 bc bos 1996 [mk]
2 dm ny 2001 [ab]
3 hh la 1999 [et, tr]
Option 2:
def check_data(row):
return (df2[ (df2['B'] == row['B']) & (df2['C'] <= row['C']) & (df2['D'] >= row['C'])]['A'].iloc[0])
df1['D'] = df1.apply(check_data, axis=1)
df1
A B C D
0 ah bos 1998 mk
1 bc bos 1996 mk
2 dm ny 2001 ab
3 hh la 1999 et
You can use pandas dataframe merge, then use lambda expression to compute column as per your required logic.
Example:
import pandas as pd
First dataframe:
df_one = pd.DataFrame([
{'A': 'ah', 'B': 'bos', 'C': 1998, 'D': ''},
{'A': 'bc', 'B': 'bos', 'C': 1996, 'D': ''},
{'A': 'dm', 'B': 'ny', 'C': 2001, 'D': ''},
{'A': 'hh', 'B': 'la', 'C': 1999, 'D': ''},
])
print("df_one")
print(df_one)
Second dataframe:
df_two = pd.DataFrame([
{'A': 'mk', 'B': 'bos', 'C': 1995, 'D': 2004},
{'A': 'kk', 'B': 'bos', 'C': 2004, 'D': 2017},
{'A': 'ab', 'B': 'ny', 'C': 1977, 'D': 2019},
{'A': 'fc', 'B': 'dc', 'C': 2001, 'D': 2005},
{'A': 'cc', 'B': 'dc', 'C': 2006, 'D': 2019},
{'A': 'et', 'B': 'la', 'C': 1995, 'D': 2005},
{'A': 'tr', 'B': 'la', 'C': 1997, 'D': 2006},
])
print("df_two")
print(df_two)
Merge the dataframes on column B:
df_merged = pd.merge(df_one, df_two, on='B')
print("df_merged")
print(df_merged)
Perform your logic:
df_merged['D_x'] = df_merged.apply(lambda x: x['A_y'] if x['C_y'] <= x['C_x'] <= x['D_y'] else '', axis=1)
print(df_merged)
Get required columns only:
result_columns = ['A_x', 'B', 'C_x', 'D_x']
df_result = df_merged[result_columns]
Rename the columns to desired format:
df_result = df_result.rename(columns={'A_x': 'A', 'C_x': 'C', 'D_x': 'D'})
Clean up records with blank value of D:
df_result = df_result[df_result['D'] != '']
print("df_result")
print(df_result)

Categories

Resources