how to groupby a dataframe if its row value is not present? - python

i have a dataframe of countrywise open and solved complaints
country complaints status
india network issue solved
usa internet speed issue open
uk network issue open
india internet speed issue solved
usa network issue open
uk voice issue solved
I wanted to group by countries where status is open
i tried
df = df[df.status=="open"]
then
df.groupby("countries",as_index=True).count
the output i got is
country complaints
usa 2
uk 1
but the output want is
country complaints
usa 2
uk 1
india 0
since india has no open complaints I am unable to get india after groupby. How take data is a way such that the groupby also brings india value as 0

You can do:
df['status'].eq('open').groupby(df['country']).sum()
Output:
country
india 0
uk 1
usa 2
Name: status, dtype: int64

If you want to only count that country values so for that instead of using group by you can use value_count() method, it's use to count the categorical value
so your code will be look like this
df = df[df.status=="open"]
df.country.value_counts()
then your output will become like this
india 0
uk 1
usa 2

Related

How to create a Pandas dataframe from another column in a dataframe by splitting it?

I have the following source dataframe
Person
Country
Is Rich?
0
US
Yes
1
India
No
2
India
Yes
3
US
Yes
4
US
Yes
5
India
No
6
US
No
7
India
No
I need to convert it another dataframe for plotting a bar graph like below for easily accessing data
Bar chart of economic status per country
Data frame to be created is like below.
Country
Rich
Poor
US
3
1
India
1
3
I am new to Pandas and Exploratory data science. Please help here
You can try pivot_table
df['Is Rich?'] = df['Is Rich?'].replace({'Yes': 'Rich', 'No': 'Poor'})
out = df.pivot_table(index='Country', columns='Is Rich?', values='Person', aggfunc='count')
print(out)
Is Rich? Poor Rich
Country
India 3 1
US 1 3
You could do:
converted = df.assign(Rich=df['Is Rich?'].eq('Yes')).eval('Poor = ~Rich').groupby('Country').agg({'Rich': 'sum', 'Poor': 'sum'})
print(converted)
Rich Poor
Country
India 1 3
US 3 1
However, if you want to plot it as a barplot, the following format might work best with a plotting library like seaborn:
plot_df = converted.reset_index().melt(id_vars='Country', value_name='No. of people', var_name='Status')
print(plot_df)
Country Status No. of people
0 India Rich 1
1 US Rich 3
2 India Poor 3
3 US Poor 1
Then, with seaborn:
import seaborn as sns
sns.barplot(x='Country', hue='Status', y='No. of people', data=plot_df)
Resulting plot:

return highest frequency using pandas

I have a dataframe
name country gender
Ada US 1
Aby UK 0
Alan US 0
Eli US 1
Eddy US 1
Bing NW 0
Bing US 1
Eli UK 0
Eli US 0
Alan US 1
Ada UK 0
Some names are assigned with different gender and country. E.g. Eli has US and 1 also has UK and 0.
I have used
groupby('name')['gender]
groupby('name')['code']
After the groupby, I am hoping to return the "gender" and "country" with the highest frequency. For example, if Eli has two US and one UK, then the country should be US. Same rule applies to gender.
For gender I used > 0.5 rule
df= df_inv.groupby('name')['gender'].mean()
df = df_inv.reset_index()
df['gender'] = (df['gender']>=0.5).astype(int)
Is there easier way to write this code? Also, is there any solution for categorical variable like country?
You should group by two properties (name and country/gender), build a table, and choose the column with the maximum value in each row:
df.groupby(['name','country']).size().unstack().idxmax(1)
#name
#Aby UK
#Ada UK
#Alan US
#Bing NW
#Eddy US
#Eli US
df.groupby(['name','gender']).size().unstack().idxmax(1)
#name
#Aby 0
#Ada 0
#Alan 0
#Bing 0
#Eddy 1
#Eli 0
You can later join the results if you want.
We can do groupby with function mode by agg
df = df.groupby('name').agg({'country':lambda x : x.mode()[0],'gender':lambda x : int(x.mean()>0.5)})
Out[154]:
country gender
name
Aby UK 0
Ada UK 0
Alan US 0
Bing NW 0
Eddy US 1
Eli US 0
Looks like this will do the work... pls check and confirm
a=df.groupby('name')['gender'].max().to_frame().reset_index()
b=df.groupby('name')['country'].max().to_frame().reset_index()
df=b
df['gender']=a['gender']
del a,b

Creating new variable by aggregation in python

I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran
I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Categories

Resources