i have the following df:
country sport score
0 ita swim 15
1 fr run 25
2 ger golf 37
3 ita run 17
4 fr golf 58
5 fr run 35
i am interested in some elements of categories only:
ctr = ['ita','fr']
sprt= ['run','golf']
i was hoping in something like this to extract them:
df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)]
but while it doesn't throw any error it returns empty..
any suggestion?
i also tried:
df[(df['country']== {x for x in ctr})&(df['sport']== {x for x in sprt})]
EDIT:
the reason why want to use a loop, is cos i am actually interested in the 3 top scores of each combination, which i hoped to concat:
df1 = pd.concat(df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)].sort_values(by=['score'],ascending=False).head(3))
Use double Series.isin for check membership:
df1 = df[(df['country'].isin(ctr))&(df['sport'].isin(sprt))]
print (df1)
country sport score
1 fr run 25
3 ita run 17
4 fr golf 58
5 fr run 35
df2 = df1.sort_values('score', ascending=False).groupby(['country','sport']).head(3)
print (df2)
country sport score
4 fr golf 58
5 fr run 35
1 fr run 25
3 ita run 17
Related
Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
I have some American Football data in a DataFrame like below:
df = pd.DataFrame({'Green Bay Packers' : ['30-18-0', '5-37', '10-71' ],
'Chicago Bears' : ['45-26-1', '5-20', '10-107']},
index=['Att - Comp - Int', 'Sacked - Yds Lost', 'Penalties - Yards'])
Green Bay Packers Chicago Bears
Att - Comp - Int 30-18-0 45-26-1
Sacked - Yds Lost 5-37 5-20
Penalties - Yards 10-71 10-107
You can see above that each row contains multiple data points that need to be split off.
What I'd like to do is find some way to split the rows up so that each data point is it's own row. The final output would like like:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Is there a way to do this efficiently? I tried some Regex but it just turned into a mess. Sorry if my formatting isn't perfect...2nd question ever posted here.
Try:
df = df.reset_index().apply(lambda x: x.str.split("-"))
df = pd.DataFrame(
{c: df[c].explode().str.strip() for c in df.columns},
).set_index("index")
df.index.name = None
print(df)
Prints:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
First reset the index, then stack all the columns and split them on -, You can also additionally apply to remove any left over whitespace characters after after using split, then unstack again, then apply pd.Series.explode finally reset the index, and drop any left-over unrequired column.
out = (df.reset_index()
.stack().str.split('-').apply(lambda x:[i.strip() for i in x])
.unstack()
.apply(pd.Series.explode)
.reset_index()
.drop(columns='level_0'))
index Green Bay Packers Chicago Bears
0 Att 30 45
1 Comp 18 26
2 Int 0 1
3 Sacked 5 5
4 Yds Lost 37 20
5 Penalties 10 10
6 Yards 71 107
Assuming you have same number of splits for every row, with pandas >= 1.3.0, you can explode multiple columns at the same time:
df = df.reset_index().apply(lambda s: s.str.split(' *- *'))
df.explode(df.columns.tolist()).set_index('index')
Green Bay Packers Chicago Bears
index
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Use .apply() on each column (including index) and for each column:
use .str.split() to split data points and
use .explode() to create rows for each split data point element
df_out = (df.reset_index()
.apply(lambda x: x.str.split(r'\s*-\s*').explode())
.set_index('index').rename_axis(index=None)
)
Result:
print(df_out)
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
I have these two dictionaries,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico), the output is as follows:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".
Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
You could do the following thing. Define the function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS
question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna
I have dataframe data grouped by two columns (X, Y) and then I have count of elements in Z. Idea here is to find the top 2 counts of elements across X, Y.
Dataframe should look like:
mostCountYInX = df.groupby(['X','Y'],as_index=False).count()
C X Y Z
USA NY NY 5
USA NY BR 14
USA NJ JC 40
USA FL MI 3
IND MAH MUM 4
IND KAR BLR 2
IND KER TVM 2
CHN HK HK 3
CHN SH SH 3
Individually, I can extract the information I am looking for:
XTopCountInTopY = mostCountYInX[mostCountYInX['X'] == 'NY']
XTopCountInTopY = XTopCountInTopY.nlargest(2,'Y')
In the above I knew group I am looking for which is X = NY and got the top 2 records. Is there a way to print them together?
Say I am interested in IND and USA then the Output expected:
C X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2
I think you need groupby on index with parameter sort=False then apply using lambda function and sort_values on Z using parameter ascending=False then take top 2 values and reset_index as:
mask = df.index.isin(['USA','IND'])
df = df[mask].groupby(df[mask].index,sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(level=0,drop=True)
print(df)
X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2
EDIT : After OP changed the Dataframe:
mask = df['C'].isin(['USA','IND'])
df = df[mask].groupby('C',sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(drop=True)
print(df)
C X Y Z
0 USA NJ JC 40
1 USA NY BR 14
2 IND MAH MUM 4
3 IND KAR BLR 2