pandas - top count items after groupby on multiple columns

pandas - top count items after groupby on multiple columns - python

I have dataframe data grouped by two columns (X, Y) and then I have count of elements in Z. Idea here is to find the top 2 counts of elements across X, Y.
Dataframe should look like:
mostCountYInX = df.groupby(['X','Y'],as_index=False).count()
C X Y Z
USA NY NY 5
USA NY BR 14
USA NJ JC 40
USA FL MI 3
IND MAH MUM 4
IND KAR BLR 2
IND KER TVM 2
CHN HK HK 3
CHN SH SH 3
Individually, I can extract the information I am looking for:
XTopCountInTopY = mostCountYInX[mostCountYInX['X'] == 'NY']
XTopCountInTopY = XTopCountInTopY.nlargest(2,'Y')
In the above I knew group I am looking for which is X = NY and got the top 2 records. Is there a way to print them together?
Say I am interested in IND and USA then the Output expected:
C X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2

I think you need groupby on index with parameter sort=False then apply using lambda function and sort_values on Z using parameter ascending=False then take top 2 values and reset_index as:
mask = df.index.isin(['USA','IND'])
df = df[mask].groupby(df[mask].index,sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(level=0,drop=True)
print(df)
X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2
EDIT : After OP changed the Dataframe:
mask = df['C'].isin(['USA','IND'])
df = df[mask].groupby('C',sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(drop=True)
print(df)
C X Y Z
0 USA NJ JC 40
1 USA NY BR 14
2 IND MAH MUM 4
3 IND KAR BLR 2

Related

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())

Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'

If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)

Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')

You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

pandas Plot area by specifiying index column

My data looks like:
Club Count
0 AC Milan 2
1 Ajax 1
2 FC Barcelona 4
3 Bayern Munich 2
4 Chelsea 1
5 Dortmund 1
6 FC Porto 1
7 Inter Milan 1
8 Juventus 1
9 Liverpool 2
10 Man U 2
11 Real Madrid 7
I'm trying to plot an Area plot using Club as the X Axis, when plotting all data, it looks correct but the X axis displayed is the index and not the Clubs.
When specifying the index as Club(index=x), it shows correct, but the scale of the y axis is set from 0 to 0.05, assuming that's why nothing is displayed since the count is from 1 to 7 any suggestions ?
Code used:
data.columns = ['Club', 'Count']
x=data.Club
y=data.Count
print(data)
ax.margins(0, 10)
data.plot.area()
df = pd.DataFrame(y,index=x)
df.plot.area()
results:

Change to
df = pd.Series(y,index=x)
df.plot.area()

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())

This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa

Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

A query using pandas

Hello
i need to create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014 , using pandas This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index)
you'll find a description of my data in the picture :

Consider the following demo:
In [19]: df
Out[19]:
REGION STNAME CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 0 Washington Washington 10 12
1 1 Washington Washington County 11 13
2 2 Alabama Alabama County 13 15
3 4 Alaska Alaska 14 12
4 3 Montana Montana 10 11
5 2 Washington Washington 15 19
In [20]: qry = "REGION in [1,2] and POPESTIMATE2015 > POPESTIMATE2014 and CTYNAME.str.contains('^Washington')"
In [21]: df.query(qry, engine='python')[['STNAME', 'CTYNAME']]
Out[21]:
STNAME CTYNAME
1 Washington Washington County
5 Washington Washington

Use boolean indexing with mask created by isin and startswith:
mask = df['REGION'].isin([1,2]) &
df['COUNTY'].str.startswith('Washington') &
(df['POPESTIMATE2015'] > df['POPESTIMATE2014'])
df = df.loc[mask, ['STNAME', 'CTYNAME']]

Pandas read JSON into Excel

I am trying to parse JSON data from an URL. I have fetched the data and parsed it into a dataframe. From the looks of it, I am missing a step.
Data Returns in JSON format in excel but my data frame returns two columns: entry number and JSON Text
import urllib.request
import json
import pandas
with urllib.request.urlopen("https://raw.githubusercontent.com/gavinr/usa-
mcdonalds-locations/master/mcdonalds.geojson") as url:
data = json.loads(url.read().decode())
print(data)
json_parsed = json.dumps(data)
print(json_parsed)
df=pandas.read_json(json_parsed)
writer = pandas.ExcelWriter('Mcdonaldsstorelist.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

I believe you can use json_normalize:
df = pd.io.json.json_normalize(data['features'])
df.head()
geometry.coordinates geometry.type properties.address \
0 [-80.140924, 25.789141] Point 1601 ALTON RD
1 [-80.218683, 25.765501] Point 1400 SW 8TH ST
2 [-80.185108, 25.849872] Point 8116 BISCAYNE BLVD
3 [-80.37197, 25.550894] Point 23351 SW 112TH AVE
4 [-80.36734, 25.579132] Point 10855 CARIBBEAN BLVD
properties.archCard properties.city properties.driveThru \
0 Y MIAMI BEACH Y
1 Y MIAMI Y
2 Y MIAMI Y
3 N HOMESTEAD Y
4 Y MIAMI Y
properties.freeWifi properties.phone properties.playplace properties.state \
0 Y (305)672-7055 N FL
1 Y (305)285-0974 Y FL
2 Y (305)756-0400 N FL
3 Y (305)258-7837 N FL
4 Y (305)254-3487 Y FL
properties.storeNumber properties.storeType properties.storeUrl \
0 14372 FREESTANDING http://www.mcflorida.com/14372
1 7408 FREESTANDING http://www.mcflorida.com/7408
2 11511 FREESTANDING http://www.mcflorida.com/11511
3 34014 FREESTANDING NaN
4 12215 FREESTANDING http://www.mcflorida.com/12215
properties.zip type
0 33139-2420 Feature
1 33135 Feature
2 33138 Feature
3 33032 Feature
4 33157 Feature
df.columns
Index(['geometry.coordinates', 'geometry.type', 'properties.address',
'properties.archCard', 'properties.city', 'properties.driveThru',
'properties.freeWifi', 'properties.phone', 'properties.playplace',
'properties.state', 'properties.storeNumber', 'properties.storeType',
'properties.storeUrl', 'properties.zip', 'type'],
dtype='object')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - top count items after groupby on multiple columns - python

Related

Replace the values in a column based on frequency

pandas Plot area by specifiying index column

Folium FeatureGroup in Python

A query using pandas

Pandas read JSON into Excel

Categories

Resources