I have a dataframe as below :
Card_x Country Age Code Card_y
S INDIA Adult Garments S,E,D,G,M,A
S INDIA Adult Grocery D,S,G,A,M,E
I have list as below :
lis1 = [S,D,G,E,M,A]
Now i wanted my dataframe to be as below :
Explanation : Group by Card_x, Country , Age and get the lis1 values as "Card_y"
Card_x Country Age Card_y
S INDIA Adult S,D,G,E,M,A
Can i be helped ?
Note : Logic for calulating lis1 is below :
lis1=[]
for i in range(len(t)):
l=df.Card_y.iloc[i].split(',')
lis1.append(l)
sorted(lis1[0], key=lambda elem: sum(sublist.index(elem) for sublist in lis1) / len(lis1))
Basically, lis1 gets the Rank of each Card_y for different "Code" and gets the Average Rank and recomputes the Rank with least Average.
Eg : S is in 1st Rank for Code - Garments, and 2rd Rank for Code - Grocery.so average is 1+2/2=1.5
D is 3rd Rank for Code - Garments, and 1st Rank for Code - Grocery. so average is 3+1/2=2.
Now based on the average, with least average i get the Ranked list.
so it will be S,D,G,E,M,A
Try:
df_out = df.groupby(['Card_x','Country','Age'])['Card_y'].apply(lambda x: x.str.split(',', expand=True)
.rename(columns = lambda x: x+1)
.stack().reset_index(level=1))
df_out = df_out.groupby(['Card_x','Country','Age',0])['level_1'].mean().sort_values().reset_index(level=-1)
df_out.groupby(['Card_x','Country','Age'])[0].agg(','.join).rename('Card_y').reset_index()
Output:
Card_x Country Age Card_y
0 S INDIA Adult S,D,G,E,A,M
Related
I have a csv file, and using python get the highest average price of avocado from the data. All works fine until printing the region
avocadoesDB = pd.read_csv("avocado.csv")
avocadoesDB = pd.DataFrame(avocadoesDB)
avocadoesDB = avocadoesDB[['AveragePrice', 'type', 'year', 'region']]
regions = avocadoesDB[['AveragePrice', 'region']]
regionMax = max(regions['AveragePrice'])
region = regions.loc[regions['AveragePrice']==regionMax]
print(f"The highest average price for both types of potatoes is ${regionMax} from {region['region']}.")
Output:
The highest average price for both types of potatoes is $3.25 from 14125 SanFrancisco
Name: region, dtype: object.
Expected:
The highest average price for both types of potatoes is $3.25 from SanFrancisco.
So i've tried to copy the similar method on a simple dataset and i've seem to make it work, here's the code snippet
mx = max(df1['Salary'])
plc = df.loc[df1['Salary']==mx]['Name']
print('Max Sal : ' + str(plc.iloc[0]))
Output:
Max Sal : Farah
According to this post on Stack Overflow, when you use df.loc[df1['Salary']==mx]['Name'] , A Series Object is returned, and so to retrieve the value of the desired column, you use [0], if I understood the post correctly.
So for your code, you can replace
region = regions.loc[regions['AveragePrice']==regionMax]
print(f"The highest average price for both types of potatoes is ${regionMax} from {region['region']}.")
with
region = regions.loc[regions['AveragePrice']==regionMax]['region']
print(f"The highest average price for both types of potatoes is ${regionMax} from {region}.")
This should work. Hope this helps!
The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?
You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end
I have a big dataframe like:
product price serial category department origin
0 cookies 4 2345 breakfast food V
1 paper 0.5 4556 stationery work V
2 spoon 2 9843 kitchen household M
I want to convert to dict, but I just want an output like:
{serial: 2345}{serial: 4556}{serial: 9843} and {origin: V}{origin: V}{origin: M}
where key is column name and value is value
Now, i've tried with df.to_dict('values') and I selected dic['origin'] and returns me
{0: V}{1:V}{2:M}
I've tried too with df.to_dict('records') but it give me:
{product: cookies, price: 4, serial:2345, category: breakfast, department:food, origin:V}
and I don't know how to select only 'origin' or 'serial'
You can do something like:
serial_dict = df[['serial']].to_dict('r')
origin_dict = df[['origin']].to_dict('r')
I am working on an assignment for the coursera Introduction to Data Science course. I have a dataframe with 'Country' as the index and 'Rank" as one of the columns. When I try to reduce the data frame only to include the rows with countries in rank 1-15, the following works but excludes Iran, which is ranked 13.
df.set_index('Country', inplace=True)
df.loc['Iran', 'Rank'] = 13 #I did this in case there was some sort of
corruption in the original data
df_top15 = df.where(df.Rank < 16).dropna().copy()
return df_top15
When I try
df_top15 = df.where(df.Rank == 12).dropna().copy()
I get the row for Spain.
But when I try
df_top15 = df.where(df.Rank == 13).dropna().copy()
I just get the column headers, no row for Iran.
I also tried
df.Rank == 13
and got a series with False for all countries but Iran, which was True.
Any idea what could be causing this?
Your code works fine:
df = pd.DataFrame([['Italy', 5],
['Iran', 13],
['Tinbuktu', 20]],
columns=['Country', 'Rank'])
res = df.where(df.Rank < 16).dropna()
print(res)
Country Rank
0 Italy 5.0
1 Iran 13.0
However, I dislike this method because via mask the dtype of your Rank series becomes float due to initial conversion of some values to NaN.
A better idea, in my opinion, is to use query or loc. Using either method obviates the need for dropna:
res = df.query('Rank < 16')
res = df.loc[df['Rank'] < 16]
print(res)
Country Rank
0 Italy 5
1 Iran 13
I have two data frames. df1 looks like -
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
df2 looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in df1 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively. Whether an actor is male or female is done based on df2.
Here is what I am doing -
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
This code gives me list index out of range error.
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
Feel free to suggest some other approach.
How about this?
df1['Male'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'male')] for name in x.split('*')] )))
df1['Female'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'female')] for name in x.split('*')] )))
using str and join
d2 = df2.set_index('ActorName')
d1 = df1.set_index('MovieName')
method 1
split
d1.join(d1.Actors.str.split('*', expand=True).stack() \
.str.split(expand=True)[0].map(d2.Gender) \
.groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()
method 2
extractall
d1.join(d1.Actors.str.extractall('((?P<first>[^*]+)\s+(?P<last>[^*]+))') \
['first'].map(d2.Gender).groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()