I tried to calculate the median and counts of specific column of my data frame:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')\
.agg([np.median, np.size])
print(large_depts)
It said:
ValueError: no results
But when I checked the dataframe, there were values in my dataframe:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]
print(large_depts)
Total Department
0 677,680.65 Boston Police Department
1 250,893.61 Boston Police Department
2 208,676.89 Boston Police Department
3 319,319.93 Boston Police Department
4 577,123.44 Boston Police Department
I found out that When I try to groupby, there was something wrong, but I don't know why:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')
print(large_depts)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000000000D1C0A08>
Here's the data: https://data.boston.gov/dataset/418983dc-7cae-42bb-88e4-d56f5adcf869/resource/31358fd1-849a-48e0-8285-e813f6efbdf1/download/employeeearningscy18full.csv
You donĀ“t need call Department variable again. You can chage np.size to 'count' too. Try this code:
df[df['Department'].isin(Departments_top10)].Total.groupby('Department').agg([np.median, 'count'])
You have a couple of errors going on in your code above.
Your Total column is not a numeric type (as you pointed out in the comments, it's a String). I'm assuming you can change (though permanent) your Total column and your code may work? I don't have access to your data so I can't fully check if your groupby functions are working.
Here's code to change your string to list (as asked in the comments). Not sure if this is what you really want.
str2lst = lambda s: s.split(",")
df['Total'] = [str2lst(i) for i in df['Total']]
EDIT: After looking at your DataFrame (and realizing that Total is a number and not a list), I uncovered several rows that contained the column names as values. Removing these as well as changing your string values to float type:
df.drop([12556, 22124, 22123, 22122, 22121, 22125], inplace = True)
str2float = lambda s: s.replace(',', '')
df['Total'] = [float(str2float(i)) for i in df['Total']]
Now running agg() exactly how you have it in the question will work. Here's my results:
Total
Department median size
BPS Facility Management 53183.315 668.0
BPS Special Education 49875.830 831.0
BPS Substitute Teachers/Nurs 6164.070 1196.0
BPS Transportation 20972.770 506.0
Boston Cntr - Youth & Families 44492.625 584.0
In your last code entry, groupby has to have a method you're trying to group by with. Think about it intuitively, how are you grouping your variables? If I instructed you to group a set of cards together, you'd ask how? By color? Number? Suits? You told Python to group Department, but you didn't give it how you wanted it grouped. So Python returned a "...generic.DataFrameGroupBy object".
Try doing df...groupby('Department').count() and you'll see df grouped by Department.
Related
I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.
An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400
You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]
I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.
I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?
There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )
I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()
I have a pandas dataframe top3 with data as in the image below.
Using the two columns, STNAME and SENSUS2010POP, I need to find the sum for Wyoming (sum: 91738+75450+46133=213321), then the sum for Wisconsin (sum:1825699), West Virginia and so on. Summing up the 3 counties for each state. (and need to sort them in ascending order after that).
I have tried this code to compute the answer:
topres=top3.groupby('STNAME').sum().sort_values(['CENSUS2010POP'], ascending=False)
Maybe you can suggest a more efficient way to do it? Maybe with a lambda expression?
You can use groupby:
df.groupby('STNAME').sum()
Note: I'm starting in the problem before selecting the top 3 counties per state, and jumping straight to their sum.
I found it helpful with this problem to use a list selection.
I created a data frame view of the counties with:
counties_df=census_df[census_df['SUMLEV'] == 50]
and a separate one of the states so I could get at their names.
states_df=census_df[census_df['SUMLEV'] == 40]
Then I was able to create that sum of the populations of the top 3 counties per state, by looping over all states and summing the largest 3.
res = [(x, counties_df[(counties_df['STNAME']==x)].nlargest(3,['CENSUS2010POP'])['CENSUS2010POP'].sum()) for x in states_df['STNAME']]
I converted that result to a data frame
dfObj = pd.DataFrame(res)
named its columns
dfObj.columns = ['STNAME','POP3']
sorted in place
dfObj.sort_values(by=['POP3'], inplace=True, ascending=False)
and returned the first 3
return dfObj['STNAME'].head(3).tolist()
Definitely groupby is a more compact way to do the above, but I found this way helped me break down the steps (and the associated course had not yet dealt with groupby).
I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.
It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]
Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.
To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]