Split column into multiple columns or separate "tables" in pandas

Split column into multiple columns or separate "tables" in pandas - python

I'm very new to python and don't have much of a clue what I'm doing. I have a series of data that describes the performance ('Leistung') of different people ('Leistungserbringer'). Each performance is also linked to a specific value ('Taxpunke'). I'd like to display the top 10 performances for each person, defined by the value of the performance.
byleistung = df.groupby('Leistungserbringer')
df2 = byleistung['Taxpunkte'].describe()
df2.sort_values(['mean'], ascending=[False])
count mean std min 25% 50% 75% max
Leistungserbringer
Larsson William 6188.0 99.799108 231.765598 2.50 15.81 31.61 111.71 3909.72
Karlsson Oliwer 5645.0 93.344057 277.989424 3.61 15.81 31.61 94.83 9122.68
McGregor Sean 1250.0 89.100800 136.175528 3.61 18.35 34.78 111.71 998.64
Groeneveld Arno 4045.0 84.859498 202.230230 1.93 15.81 31.61 63.23 3323.52
Heepe Simon 3776.0 82.662950 359.970010 3.61 15.81 31.61 50.47 13597.60
Bitar Wahib 7814.0 72.190337 142.399537 3.61 15.81 31.61 61.75 3634.15
Cox James 4746.0 72.036013 132.240942 2.50 15.81 31.61 50.65 1664.40
Carvalho Tomas 7415.0 60.868030 156.889297 2.86 15.81 15.81 41.50 2099.20
The 'count' is the amount of performances the specific person did. In total there are 330 different performances these people have done. As an example:
byleistung = df.groupby('Leistung')
byleistung['Taxpktwert'].describe()
count unique top freq
Leistung
'(+) %-Zuschlag für Notfall B, ' 2 1 KVG 2
'+ Bronchoalveoläre Lavage (BAL)' 1 1 KVG 1
'+ Bürstenabstrich bei Bronchoskopie' 8 1 KVG 8
'+ Endobronchialer Ultraschall mit Punktion' 1 1 KVG 1
'XOLAIR Trockensub 150 mg c Solv Durchstf' 109 1 KVG 109
my DataFrame looks like this (has 40'000 more rows):
df.head()
Leistungserbringer Anzahl Leistung AL TL Taxpktwert Taxpunkte
0 Groeneveld Arno 12 'Beratung' 147.28 87.47 KVG 234.75
1 Groeneveld Arno 12 'Konsilium' 147.28 87.47 KVG 234.75
2 Groeneveld Arno 12 'Ultra' 147.28 87.47 KVG 234.75
3 Groeneveld Arno 12 'O2-Druck' 147.28 87.47 KVG 234.75
4 Groeneveld Arno 12 'Funktion' 147.28 87.47 KVG 234.75
I want my endresult to kinda look like this for each of the people. The ranking should be based on the product of counts per performance ('Anzahl') * value ('Taxpunkte'):
Leistungserbringer Leistung Anzahl Taxpunkte Total Taxpkt
Larsson William 1 x a x*a
2 y b y*b
.
.
10 z c z*c
...
McGregor Sean 1 x a x*a
2 y b y*b
.
.
10 z c z*b
Any hints or recommendations of approach would be greatly appreciated.

Related

get the number of involved singer in a phase

I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF

You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64

Populate a dataframe column based on a column of other dataframe

I have a dataframe with the population of a region and i want to populate a column of other dataframe with the same distribution.
The first dataframe looks like this:
Municipio Population Population5000
0 Lisboa 3184984 1291
1 Porto 2597191 1053
2 Braga 924351 375
3 Setúbal 880765 357
4 Aveiro 814456 330
5 Faro 569714 231
6 Leiria 560484 227
7 Coimbra 541166 219
8 Santarém 454947 184
9 Viseu 378784 154
10 Viana do Castelo 252952 103
11 Vila Real 214490 87
12 Castelo Branco 196989 80
13 Évora 174490 71
14 Guarda 167359 68
15 Beja 158702 64
16 Bragança 140385 57
17 Portalegre 120585 49
18 Total 12332794 5000
Basically, the second dataframe has 5000 rows and i want to create a column with a name corresponding to the Municipios from the first df.
My problem is that i dont know how to populate the column with same occurence distribution from the first dataframe.
The final result would be something like this:
Municipio
0 Porto
1 Porto
2 Lisboa
3 Évora
4 Lisboa
5 Aveiro
...
4996 Viseu
4997 Lisboa
4998 Porto
4999 Guarda
5000 Beja
Can someone help me?

I would use a simple comprehension to build a list of size 5000 with as many elements with a town name as the value of Population5000, and optionally shuffle it if you want a random order:
lst = [m for m,n in df.loc[:len(df)-2,
['Municipio', 'Population5000']].to_numpy()
for i in range(n)]
random.shuffle(lst)
result = pd.Series(1, index=lst, name='Municipio')
Initialized with random.seed(0), it gives:
Setúbal 1
Santarém 1
Lisboa 1
Setúbal 1
Aveiro 1
..
Santarém 1
Porto 1
Lisboa 1
Faro 1
Aveiro 1
Name: Municipio, Length: 5000, dtype: int64

You could just do a simple map if you do;
map = dict(zip(DF1['Population5000'], DF1['Municipio']))
DF2['Municipo'] = DF2['Population5000'].map(map)
or just change the population 5000 column name in the map (DF2) to whatever the column containing your population values is called.

map = dict(zip(municipios['Population5000'], municipios['Municipio']))
df['Municipio'] = municipios['Population5000'].map(map)
I tried this as suggested by Amen_90 and the column Municipio from the second dataframe it only gets populated with 1 instance of every Municipio, when i wanted to have the same value_counts as in the column "Population5000" in my first dataframe.
df["Municipio"].value_counts()
Beja 1
Aveiro 1
Bragança 1
Vila Real 1
Porto 1
Santarém 1
Coimbra 1
Guarda 1
Leiria 1
Castelo Branco 1
Viseu 1
Total 1
Faro 1
Portalegre 1
Braga 1
Évora 1
Setúbal 1
Viana do Castelo 1
Lisboa 1
Name: Municipio, dtype: int64

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())

This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa

Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

Pandas - cumsum() on groupby() over multiple axis

I’m new to Pandas.
I have a data set that is horse racing results. Example here:
RaceID RaceDate RaceMeet Position Horse Jockey Trainer RaceLength race win HorseWinPercentage
446252 01/01/2008 Southwell (AW) 1 clear reef tom mclaughlin jane chapple-hyam 3101 1 1 0
447019 14/01/2008 Southwell (AW) 5 clear reef tom mclaughlin jane chapple-hyam 2654 1 0 100
449057 21/02/2008 Southwell (AW) 2 clear reef tom mclaughlin jane chapple-hyam 3101 1 0 50
463805 26/08/2008 Chelmsford (AW) 6 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 33.33333333
469220 27/11/2008 Chelmsford (AW) 3 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 25
470195 11/12/2008 Chelmsford (AW) 5 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 20
471052 26/12/2008 Wolhampton (AW) 1 clear reef andrea atzeni jane chapple-hyam 2690 1 1 16.66666667
471769 07/01/2009 Wolhampton (AW) 6 clear reef ian mongan jane chapple-hyam 2690 1 0 28.57142857
472137 13/01/2009 Chelmsford (AW) 2 clear reef jamie spencer jane chapple-hyam 3080 1 0 25
472213 20/01/2009 Southwell (AW) 5 clear reef jamie spencer jane chapple-hyam 2654 1 0 22.22222222
476595 25/03/2009 Kempton (AW) 4 clear reef pat cosgrave jane chapple-hyam 2639 1 0 20
477674 08/04/2009 Kempton (AW) 5 clear reef pat cosgrave jane chapple-hyam 2639 1 0 18.18181818
479098 21/04/2009 Kempton (AW) 3 clear reef andrea atzeni jane chapple-hyam 2639 1 0 16.66666667
492913 14/11/2009 Wolhampton (AW) 1 clear reef andrea atzeni jane chapple-hyam 3639 1 1 15.38461538
493720 25/11/2009 Kempton (AW) 3 clear reef andrea atzeni jane chapple-hyam 3518 1 0 21.42857143
495863 29/12/2009 Southwell (AW) 1 clear reef shane kelly jane chapple-hyam 3101 1 1 20
I want to be able to groupby() multiple axis to count up wins and create combination win percentages or results at specific track and lengths.
When I just need to groupby a single axis – it works great:
df['horse_win_count'] = df.groupby(['Horse'])['win'].cumsum()
df['horse_race_count'] = df.groupby(['Horse'])['race'].cumsum()
df['HorseWinPercentage2'] = df['horse_win_count'] / df['horse_race_count'] * 100
df['HorseWinPercentage'] = df.groupby('Horse')['HorseWinPercentage2'].shift(+1)
However when I need to groupby more than one axis I get some really weird results.
For example I was to create a win percentage for when a specific Jockey rides a specific Trainers’ horse – groupby([‘Jockey’,’Trainer’]). Then I need to know the percentage as it changes for each individual row (race).
df['jt_win_count'] = df.groupby(['Jockey','Trainer'])['win'].cumsum()
df['jt_race_count'] = df.groupby(['Jockey','Trainer'])['race'].cumsum()
df['JTWinPercentage2'] = df['jt_win_count'] / df['jt_race_count'] * 100
df['JTWinPercentage'] = df.groupby(['Jockey','Trainer'])['JTWinPercentage2'].shift(+1)
df['JTWinPercentage'].fillna(0, inplace=True)
Or I want to count up the number of times a horse has won over that course and that distance. So I need to groupby([‘Horse’, ‘RaceMeet’,’RaceLength’]):
df['CD'] = df.groupby([‘RaceMeet’,’RaceLength’,’Horse’])[‘win’].cumsum()
df['CD'] = df.groupby(["RaceMeet","RaceLength","Horse"]).shift(+1)
I get results in the 10s of 1000s.
How can I groupby several axis, make a computation and shift the results back by one entry while grouped by several entries?
And even better can you explain why my code above doesn’t work? Like I say I’m new to Pandas and keen to learn.
Cheers.

Question was already asked: Pandas DataFrame Groupby two columns and get counts and here
python pandas groupby() result
I do not really know what your goal is though.
I guess you should first add another column with the new parameter you want to group by. for example: df['jockeyTrainer']=df.loc['Jockey']+df.loc['Trainer'] Then you can use this to groupby. Or you follow the information in the links.

Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes

I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes.
Notice how each cell is separated by a , ... is there any really efficient method to split all of these cells and create 2 dataframes, one for the labels and one for the ( value ) in parenthesis?
NumPy has all those ufuncs, is there a way I can use them on string dtypes since they can be converted to np.array with DF.as_matrix()? I'm trying to steer clear of for loops, I could iterate through all the indices and populate an empty array but that's pretty barbaric.
I'm using Beaker Notebook btw, it's really cool (HIGHLY RECOMMENDED)
#Set URL Destination
url = "http://www.reef.org/print/db/stats"
#Process raw table
DF_raw = pd.pandas.read_html(url)[0]
#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()
#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies
#Split these into 2 separate DataFrames
Here's my naive way of doing such:
import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)
dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq
I want these 2 dataframes as my output:
(1) Species;
and (2) Frequencies

you can do it this way:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
and DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9
0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7
0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2
RegEx debugging and explanation:
we basically want to remove everything, except number in parentheses:
(\d+\.*\d*) - group(1) - it's our number
\((\d+\.*\d*)\) - our number in parentheses
.*\((\d+\.*\d*)\).* - the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell
it will be replaced with the group(1) - our number

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split column into multiple columns or separate "tables" in pandas - python

Related

get the number of involved singer in a phase

Populate a dataframe column based on a column of other dataframe

Folium FeatureGroup in Python

Pandas - cumsum() on groupby() over multiple axis

Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes

Categories

Resources