Pandas Groupby with no aggregation for single result - python

I have a Dataframe in Pandas that shows percentage of men in a city/state. The dataframe df looks like the following (note this is not my actual usage/data but my datatypes are similar)
STATE CITY PERC_MEN
ALABAMA ABBEVILLE 41.3%
ALABAMA ADAMSVILLE 53.5%
....
WYOMING WRIGHT 46.6%
Each State/percentage of men combo will have exactly 1 value returned.
How do I show the city/population values for a given state? My code looks like the following (I need the first line where I groupby STATE because I do other stuff with the data)
for state, state_df in df.groupby(by=['STATE']):
print(state_df.groupby(by=['CITY'])['PERC_MEN'])
However this prints <pandas.core.groupby.generic.SeriesGroupBy object at 0xXXXXXXX>
Normally for groupby's I use an aggregate like mean() or sum() but is there a way to just return the value?

I wouldnt iterate dataframe.
Set index and slice
df=df.set_index(['STATE','CITY'])
df.xs(('ALABAMA', 'ABBEVILLE'), level=['STATE','CITY'])
or
df.loc[('ALABAMA', 'ABBEVILLE'),:]

Related

How to create a Dataframe from rows with conditions from another existing Dataframe using pandas?

So I have this problem, because of the size of the dataframe that I am working on, clearly, I cannot upload it, but it has the following structure:
country
coastline
EU
highest
1
Norway
yes
yes
1500
2
Turkey
yes
no
20100
...
...
...
...
41
Bolivia
no
no
999
42
Japan
yes
no
89
I have to solve several exercises with Pandas, among them is, for example, showing the country with the "highest" maximum, minimum and the average but only of the countries that do belong to the EU, I already solved the maximum and the minimum, but for the middle I thought about creating a new dataframe, one that is created from only the rows that contain a "yes" in the EU column, I've tried a few things, but they haven't worked.
I thought this is the best way to solve it, but if anyone has another idea, I'm looking forward to reading it.
By the way, these are the examples that I said that I was able to solve:
print('Minimum outside the EU')
paises[(paises.EU == "no")].sort_values(by=['highest'], ascending=[True]).head(1)
Which gives me this:
country
coastline
EU
highest
3
Belarus
no
no
345
As a last condition, this must be solved using pandas, since it is basically the chapter that we are working on in classes.
If you want to create a new dataframe that is based off of a filter on your first, you can do this:
new_df = df[df['EU'] == 'yes'].copy()
This will look at the 'EU' column in the original dataframe df, and only return the rows where it is 'yes'. I think it is good practice to add the .copy() since we can sometimes get strange side-affects if we then make changes to new_df (probably wouldn't here).

Use a for-loop to create a new DataFrame in Python?

I am working on the Cars.csv DataFrame which can be found here: https://www.kaggle.com/ljanjughazyan/cars1
My goal is to create a new Data Frame with the column names: USA, Europe and Japan and to save the number of cars that are in each category.
for a in list(cars.origin.unique()):
x= pd.DataFrame({a:[cars.loc[cars["origin"]==a,"origin"].size]})
I tried it with this code, but as a result I obtain a Data Frame with only one column that is "Europe". So it kind of works, but I cant figure why it just dismisses the other values. Why doesnt it work and can it be done using a for-loop?
Thanks in advance!!
I assume "Europe" would be the last item in your list. Because you are resetting x in every iteration of your for-loop.
So if you print(x) inside the loop, you should first see a DataFrame with just USA, then just Japan and then just Europe, which is your final result.
I would suggest putting the data into a python dict and creating you DataFrame afterwards.
data = {}
for a in list(cars.origin.unique()):
data[a] = [cars.loc[cars["origin"]==a,"origin"].size]
x = pd.DataFrame(data)

Filtering pandas dataframe column of numpy arrays by nan values

I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]

Creating a dictionary from a dataframe having duplicate column value

I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?
You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}

How to sum up every 3 rows by column in Pandas Dataframe Python

I have a pandas dataframe top3 with data as in the image below.
Using the two columns, STNAME and SENSUS2010POP, I need to find the sum for Wyoming (sum: 91738+75450+46133=213321), then the sum for Wisconsin (sum:1825699), West Virginia and so on. Summing up the 3 counties for each state. (and need to sort them in ascending order after that).
I have tried this code to compute the answer:
topres=top3.groupby('STNAME').sum().sort_values(['CENSUS2010POP'], ascending=False)
Maybe you can suggest a more efficient way to do it? Maybe with a lambda expression?
You can use groupby:
df.groupby('STNAME').sum()
Note: I'm starting in the problem before selecting the top 3 counties per state, and jumping straight to their sum.
I found it helpful with this problem to use a list selection.
I created a data frame view of the counties with:
counties_df=census_df[census_df['SUMLEV'] == 50]
and a separate one of the states so I could get at their names.
states_df=census_df[census_df['SUMLEV'] == 40]
Then I was able to create that sum of the populations of the top 3 counties per state, by looping over all states and summing the largest 3.
res = [(x, counties_df[(counties_df['STNAME']==x)].nlargest(3,['CENSUS2010POP'])['CENSUS2010POP'].sum()) for x in states_df['STNAME']]
I converted that result to a data frame
dfObj = pd.DataFrame(res)
named its columns
dfObj.columns = ['STNAME','POP3']
sorted in place
dfObj.sort_values(by=['POP3'], inplace=True, ascending=False)
and returned the first 3
return dfObj['STNAME'].head(3).tolist()
Definitely groupby is a more compact way to do the above, but I found this way helped me break down the steps (and the associated course had not yet dealt with groupby).

Categories

Resources