How to choose only the "male" attribute from a newly compiled dataframe? - python

I am working with the following dataframe which I created from a much larger csv file with additional information in columns not needed:
df_avg_tot_purch = df_purchase_data.groupby(["SN", "Gender"])["Price"].agg(lambda x: x.unique().mean())
df_avg_tot_purch.head()
This code results in the following:
SN Gender
Adairialis76 Male 2.28
Adastirin33 Female 4.48
Aeda94 Male 4.91
Aela59 Male 4.32
Aelaria33 Male 1.79
Name: Price, dtype: float64
I now need to have this dataframe only show the male gender. The point of the project here is to find all the individuals (which may repeat in the rows), determine the average of each of their purchases. I did it this way because I also need to run another for females and "others" in the column.

after groupby the keys on which you grouped become indices, so now you have to either reset index to change them into normal columns, or explicitly use index while subsetting
df_avg_tot_purch[df_avg_tot_purch.index.isin(['Male'], level='Gender')]
or
df_avg_tot_purch = df_avg_tot_purch.reset_index()
df_avg_tot_purch[df_avg_tot_purch['Gender'] == 'Male']

Related

Automatically Map columns from one dataframe to another using pandas

I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID
Sales year
Name
Acc year
10
1934
xyz
1834
11
1942
abc
1842
SLAVE DF:
PO ID
Yr
Amount
Year
12
1935
365.2
1839
13
1966
253.9
1855
RESULTANT DF:
PO ID
Sales Year
Acc Year
10
1934
1834
11
1942
1842
12
1935
1839
13
1966
1855
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically.
Any idea/logic would be helpful.
Thanks in advance!
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!

Return a python DataFrame with full rows containing the highest sales for each company

Suppose I have the following data table:
import pandas as pd
data = {'Company':['ELCO','ELCO','ELCO','BOBCO','BOBCO','BOBCO','LAMECO','LAMECO','LAMECO'],
'Person':['Sam','Mikey','Amy','Vanessa','Carl','Sarah','Emily','Laura','Steve'],
'Sales':[220,123,312,125,263,321,243,275,198]}
df = pd.DataFrame(data)
df
How would I go about logically extracting the data to end up with a data table that just shows the highest 'Sales' for each company whist keeping the full rows for those highest sales figures. In other words, how would I get the smaller DataFrame shown at the bottom of the attached image using conditional logic etc?
DataFrame Outputs
You want groupby().idxmax() and loc:
df.loc[df.groupby('Company').Sales.idxmax()]
Output:
Company Person Sales
5 BOBCO Sarah 321
2 ELCO Amy 312
7 LAMECO Laura 275
Note: The above gives you only one sale person per company. If you want all sale persons with max sale in each company, you need transform:
df[df['Sales'] == df.groupby('Company').Sales.transform('max')]

Grouping values based on another column and summing those values together

I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.
It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]
Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.
To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]

I have a code here, I want to find the total number of females and males in a certain csv file.

import pandas as pd
df = pd.read_csv('admission_data.csv')
df.head()
female = 0
male = 0
for row in df:
if df['gender']).any()=='female':
female = female+1
else:
male = male+1
print (female)
print male
The CSV file has 5 columnsHere is the picture
I want to find the total number of females, males and number of them admitted, number of females admitted, males admitted
Thank you. This is the code I have tried and some more iterations of the above code but none of them seem to work.
Your if logic is wrong.
No need for a loop at all.
print(df['gender'].tolist().count('female'))
print(df['gender'].tolist().count('male'))
Alternatively you can use value_counts as #Wen suggested:
print(df['gender'].value_counts()['male'])
print(df['gender'].value_counts()['female'])
Rule of thumb: 99% of the times there is no need to use explicit loops when working with pandas. If you find yourself using one then there is most probably a better (and faster) way.
You just need value_counts
df['gender'].value_counts()
I created the below csv file:
student_id,gender,major,admitted
35377,female,chemistry,False
56105,male,physics,True
31441,female,chemistry,False
51765,male,physics,True
31442,female,chemistry,True
Reading the csv file into dataframe:
import pandas as pd
df=pd.read_csv('D:/path/test1.csv', sep=',')
df[df['admitted']==True].groupby(['gender','admitted']).size().reset_index(name='count')
df
gender admitted count
0 female True 1
1 male True 2
Hope this helps!
i think you can use these brother,
// This line creates create a data frame which only have gender as male
count_male=df[df['Gender']=="male"]
// 2nd line you are basically counting how many values are there in gender column
count_male['Gender'].size
(or)
count_male=df['Gender']=="male"]
count_male.sum()
Take the values in the column gender, store in a list, and count the occurrences:
import pandas as pd
df = pd.read_csv('admission_data.csv')
print(list(df['gender']).count('female'))
print(list(df['gender']).count('male'))

Acquire the data from a row in a Pandas

Instructions given by Professor:
1. Using the list of countries by continent from World Atlas data, load in the countries.csv file into a pandas DataFrame and name this data set as countries.
2. Using the data available on Gapminder, load in the Income per person (GDP/capita, PPP$ inflation-adjusted) as a pandas DataFrame and name this data set as income.
3. Transform the data set to have years as the rows and countries as the columns. Show the head of this data set when it is loaded.
4. Graphically display the distribution of income per person across all countries in the world for any given year (e.g. 2000). What kind of plot would be best?
In the code below, I have some of these tasks completed, but I'm having a hard time understanding how to acquire data from a DataFrame row. I want to be able to acquire data from a row and then plot it. It may seem like a trivial concept, but I've been at it for a while and need assistance please.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
countries = pd.read_csv('2014_data/countries.csv')
countries.head(n=3)
income = pd.read_excel('indicator gapminder gdp_per_capita_ppp.xlsx')
income = income.T
def graph_per_year(year):
stryear = str(year)
dfList = income[stryear].tolist()
graph_per_year(1801)
Pandas uses three types of indexing.
If you are looking to use integer indexing, you will need to use .iloc
df_1
Out[5]:
consId fan-cnt
0 1155696024483 34.0
1 1155699007557 34.0
2 1155694005571 34.0
3 1155691016680 12.0
4 1155697016945 34.0
df_1.iloc[1,:] #go to the row with index 1 and select all the columns
Out[8]:
consId 1.155699e+12
fan-cnt 3.400000e+01
Name: 1, dtype: float64
And to go to a particular cell, you can use something along the following lines,
df_1.iloc[1][1]
Out[9]: 34.0
You need to go through the documentation for other types of indexing namely .ix and .loc as suggested by sohier-dane.
To answer your first question, a bar graph with a year sector would be best. You'll have to keep countries on y axis and per capita income on y. And a dropdown perhaps to select a particular year for which the graph will change.

Categories

Resources