Can pandas be used to pivot/transpose a data frame

Can pandas be used to pivot/transpose a data frame - python

I am currently have a table that I need to transform. The data set looks like this:
Period Course Teacher
0 1 Sci Teach1
1 2 Art Teach1
2 H Span Teach1
3 1 PE Teach2
4 2 CS Teach2
5 H Conf Teach2
6 1 AP Art Teach3
7 2 Dig Art Teach3
8 H Health Teach3
I would like it to look like the following:
Teacher H P1 P2
0 Teacher1 Span Sci Art
1 Teacher2 Conf PE CS
2 Teacher3 Health AP Art Dig Art
Not sure how to go about this, or if it is something pandas would handle.
I have tried df.transpose which results in:
0 1 2 ... 6 7 8
Period 1 2 H ... 1 2 H
Course Sci Art Span ... AP Art Dig Art Health
Teacher Teach1 Teach1 Teach1 ... Teach3 Teach3 Teach3
I have tried using df.groupby(['Teacher']) which returns an error grouper, exclusions, obj = get_grouper
I have searched for pivot, melt, transform. Not sure of the language needed to search for examples of this type.

Here is a simple approach using df.pivot():
df = df.pivot(index="Teacher", columns="Period", values="Course").reset_index()
df.columns = ["Teacher", "P1", "P2", "H"]
df = df[["Teacher", "H", "P1", "P2"]]
print(df)
Teacher H P1 P2
0 Teach1 Span Sci Art
1 Teach2 Conf PE CS
2 Teach3 Health AP Art Dig Art

pivot_table is definitively what you are looking for:
>>> (df.assign(Period=df['Period'].mask(df['Period'].str.isdigit(), other='P'+df['Period']))
.pivot_table(index='Teacher', columns='Period', values='Course', aggfunc='first')
.rename_axis(columns=None).reset_index())
Teacher H P1 P2
0 Teach1 Span Sci Art
1 Teach2 Conf PE CS
2 Teach3 Health AP Art Dig Art

Related

How to fit a statistical model with a one hot encoded variable

I have my data frame that initially looked like this:
item_id title user_id gender .....
0 1 Toy Story (1995) 308 M
1 4 Get Shorty (1995) 308 M
2 5 Copycat (1995) 308 M
Than I ran a mixed effects regression, which worked fine:
import statsmodels.api as sm
import statsmodels.formula.api as smf
md = smf.mixedlm("rating ~ C(gender) + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())
However, afterwards I did a one hot encoding on the gender variable and the dataframe became like this:
item_id title user_id gender_M gender_F .....
0 1 Toy Story (1995) 308 1 0
1 4 Get Shorty (1995) 308 1 0
2 5 Copycat (1995) 308 1 0
Would it be correct to run the model like this (changing gender with gender_M and gender_F)? Is it the same? Or is there a better way?
md = smf.mixedlm("rating ~ gender_M + gender_F + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())

pandas Plot area by specifiying index column

My data looks like:
Club Count
0 AC Milan 2
1 Ajax 1
2 FC Barcelona 4
3 Bayern Munich 2
4 Chelsea 1
5 Dortmund 1
6 FC Porto 1
7 Inter Milan 1
8 Juventus 1
9 Liverpool 2
10 Man U 2
11 Real Madrid 7
I'm trying to plot an Area plot using Club as the X Axis, when plotting all data, it looks correct but the X axis displayed is the index and not the Clubs.
When specifying the index as Club(index=x), it shows correct, but the scale of the y axis is set from 0 to 0.05, assuming that's why nothing is displayed since the count is from 1 to 7 any suggestions ?
Code used:
data.columns = ['Club', 'Count']
x=data.Club
y=data.Count
print(data)
ax.margins(0, 10)
data.plot.area()
df = pd.DataFrame(y,index=x)
df.plot.area()
results:

Change to
df = pd.Series(y,index=x)
df.plot.area()

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())

This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa

Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

Fuzzy matching inside a column

Suppose I have a list of sports like this :
sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]
I would like to create a dataframe that match each element of sport with it's closest if the fuzzy matching is superior than 0.5 and if it's not just match it with itself. (I want to use the function fuzzywuzzy.fuzz.ratio(x,y) for that)
The result should look like :
pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})
sport closest_match
0 futball futball
1 fitbal futball
2 football football
3 tennis tennis
4 tenis tennis
5 tenisse tennis
6 footbal futball
7 zennis tennis
8 ping-pong ping-pong
Thanks

here is a solution using itertools.combinations:
from fuzzywuzzy import fuzz
import pandas as pd
sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
df = pd.DataFrame(dist, columns=["sport","closest"])
df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
print(df)
df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()
output:
sport closest ratio
0 fitbal football 77
1 football footbal 93
2 futball football 80
3 tenis zennis 83
4 tenisse zennis 62
5 tennis zennis 91

How to label the Pie chart?

My stud_alcoh data set is given below
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = stud_alcoh.groupby('legal_drinker').size()
number_of_drinkers
legal_drinker
False 284
True 111
dtype: int64
I have to draw a pie chart with number_of_drinkers with True as 111 and False 284. I wrote number_of_drinkers.plot(kind='pie')
which Y label and also the number(284 and 111) is not labeling

This should work:
number_of_drinkers.plot(kind = 'pie', label = 'my label', autopct = '%.2f%%')
The autopct argument gives you a notation of percentage inside the plot, with the desired number of decimals indicated right before the letter "f". So you can change this, for example, to %.1f%% for only one decimal.
I personally don't know of a way to show the raw numbers inside but only the percentage, but to the best of my understanding this is the purpose of a pie.

Your question already has a good answer. You could also try this. I'm using the data frame you shared.
import pandas as pd
df = pd.read_clipboard()
df
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = df.groupby('legal_drinker').size() # Series
number_of_drinkers
legal_drinker
False 4
True 1
dtype: int64
number_of_drinkers.plot.pie(label='counts', autopct='%1.1f%%') # Label the wedges with their numeric value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can pandas be used to pivot/transpose a data frame - python

Related

How to fit a statistical model with a one hot encoded variable

pandas Plot area by specifiying index column

Folium FeatureGroup in Python

Fuzzy matching inside a column

How to label the Pie chart?

Categories

Resources