How to get distinct rows from pandas dataframe? - python
I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(3rd of vier()) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.
Input:
HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808
Code:
import pandas as pd
import matplotlib.pyplot as plt
#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) + df['S']
df.drop(['M', 'S'], axis=1, inplace=True)
#Funties
def twee():
print("Het totaal aantal karts = " + str(df['KartNumber'].nunique()))
print("Het aantal unique drivers = " + str(df['DriverName'].nunique()))
print("Het totaal aantal heats = " + str(df['HeatNumber'].nunique()))
def drie():
print("De 10 snelste Drivers obv individuele tijd zijn: ")
print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))
def vier():
print('De 10 snelste Drivers obv snelste heat gemiddelde:')
print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))
print(df)
HeatNumber NumberOfKarts KartNumber DriverName Laptime
0 334 11 5 Monique 53.862
1 334 11 5 Monique 59.070
2 334 11 5 Monique 47.832
3 334 11 5 Monique 47.213
4 334 11 5 Monique 51.975
... ... ... ... ... ...
4053 437 2 20 luuk 39.678
4054 437 2 20 luuk 39.872
4055 437 2 20 luuk 39.454
4056 437 2 20 luuk 39.575
4057 437 2 20 luuk 39.648
Output:
DriverName HeatNumber
giovanni 411 26.233
ryan 411 27.747
giovanni 408 27.938
papa 394 28.075
guus 406 28.998
Rob 427 29.371
Suus 427 29.416
Jan-jullius 394 29.428
Joep 427 29.934
Indy 423 29.991
The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?
ok so add drop_duplication on a column like this just need to add sort as well
df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')
(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))
You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.
Related
How to extract a specific table from an url via Python?
I am currently looking at the following Link: https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund There is a table that displays all Positions of the ETF. My goal is to extract the table and save it to a xlsx file. I wrote a code: import requests import pandas as pd url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund' html = requests.get(url).content df_list = pd.read_html(html) df = df_list[-1] print(df) df.to_excel('my data.xlsx') However, when I try pd.read_html(html) it show me always that no tables could have been found on the website. Does somebody know how to identify and pull the desired table via Python?
The problem is that the website uses cookies, so, when you use the default link, it redirects you to a first page that needs you to click in a button to accept cookies. And then you go to the right page. I encountered the right URL, that goes straight to the main page that you are looking for, try this: url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund?switchLocale=y&siteEntryPassthrough=true' html = requests.get(url).content df_list = pd.read_html(html) print(df_list) Here is my Output: [Empty DataFrame Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung] Index: [], Empty DataFrame Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung] Index: [], Unnamed: 0 2012 2013 2014 2015 ... 2017 2018 2019 2020 2021 0 Gesamtrendite (%) 177 210 74 108 ... 108 -110 276 -19 251 1 Vergleichsindex (%) 178 212 72 96 ... 106 -108 268 -20 249 [2 rows x 11 columns], Unnamed: 0 ... Von 31.Mär.2021Bis 31.Mär.2022 0 Gesamtrendite (%) Per 31.Mär.2022 ... 863 1 Vergleichsindex (%) Per 31.Mär.2022 ... 849 [2 rows x 6 columns], Unnamed: 0 1J 3J 5J 10J Seit Auflage 0 Gesamtrendite (%) 197 894 538 940 651 1 Vergleichsindex (%) 178 879 522 920 636, Unnamed: 0 Seit 1.1. 1M 3M ... 3J 5J 10J Seit Auflage 0 Gesamtrendite (%) -736 -80 -43 ... 2929 2994 14548 21694 1 Vergleichsindex (%) -755 -92 -65 ... 2874 2896 14119 20904 [2 rows x 10 columns], Empty DataFrame Columns: [Emittententicker, Name, Sektor, Anlageklasse, Marktwert, Gewichtung (%), Nominalwert, Nominale, ISIN, Kurs, Standort, Börse, Marktwährung] Index: [], Empty DataFrame Columns: [Kategorie, Fonds] Index: [], Empty DataFrame Columns: [Kategorie, Fonds] Index: [], Börse Ticker ... Common Code (EOC) iNAV ISIN 0 Xetra EXSA ... 186 794 77 - 1 Bolsa Mexicana De Valores EXSA ... - - 2 Borsa Italiana EXSA ... - - 3 SIX Swiss Exchange EXSA ... - - [4 rows x 14 columns]] Process finished with exit code 0
This is how you would use Selenium: from selenium import webdriver import time from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() #maximize the window size driver.maximize_window() #navigate to the url driver.get("https://www.google.com/")
Iterate over specific rows, sum results and store in new row
I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row. For example in Year 1990: Category A B C D Year E 147 78 476 531 1990 F 914 356 337 781 1990 G 117 874 15 69 1990 H 45 682 247 65 1990 I 20 255 465 19 1990 Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019 I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years. Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?
I'm not sure the multiple index what you mean. It usually appear after some group and aggregate function. At your table, it looks just multiple column So, if I understand correctly. Here a complete code to show how to use the multiple condition of DataFrame import io import pandas as pd data = """Category A B C D Year E 147 78 476 531 1990 F 914 356 337 781 1990 G 117 874 15 69 1990 H 45 682 247 65 1990 I 20 255 465 19 1990""" table = pd.read_csv(io.StringIO(data), delimiter="\t") years = table["Year"].unique() for year in years: row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)] row = row[["A", "B", "C", "D"]].sum() row["Category"], row["Year"] = "sum", year table = table.append(row, ignore_index=True)
If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum: df[df['Category'].isin(['G', 'H'])].sum() output: Category GH A 162 B 1556 C 262 D 134 Year 3980 dtype: object NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH". Or, better, set Category as index and slice with loc: df.set_index('Category').loc[['G', 'H']].sum() output: A 162 B 1556 C 262 D 134 Year 3980 dtype: int64
Selecting top % of rows in pandas
I have a sample dataframe as below (actual dataset is roughly 300k entries long): user_id revenue ----- --------- --------- 0 234 100 1 2873 200 2 827 489 3 12 237 4 8942 28934 ... ... ... 96 498 892384 97 2345 92 98 239 2803 99 4985 98332 100 947 4588 which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users). The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way. Can anybody propose a way for this? Thank you!
Suppose You have dataframe df: user_id revenue 234 21 2873 20 827 23 12 23 8942 28 498 22 2345 20 239 24 4985 21 947 25 I've flatten revenue distribution to show the idea. Now calculating step by step: df = pd.read_clipboard() df = df.sort_values(by = 'revenue', ascending = False) df['revenue_cum'] = df['revenue'].cumsum() df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum() df result: user_id revenue revenue_cum %revenue_cum 4 8942 28 28 0.123348 9 947 25 53 0.233480 7 239 24 77 0.339207 2 827 23 100 0.440529 3 12 23 123 0.541850 5 498 22 145 0.638767 0 234 21 166 0.731278 8 4985 21 187 0.823789 1 2873 20 207 0.911894 6 2345 20 227 1.000000 Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire. A case example from your dataset: import pandas as pd import numpy as np df = pd.DataFrame({'user_id':[234,2873,827,12,8942], 'revenue':[100,200,489,237,28934]}) df.quantile([0.8,1],interpolation='nearest') This would print the top 2 rows in value: user_id revenue 0.8 2873 489 1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold: # Sort values from highest to lowest: df = df.sort_values(by='revenue', ascending=False) # Add a column with aggregated effect of the row: df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum() # Define the threshold I need to analyze and keep those rows: min_threshold = 30 top_percent = df.loc[df['cumulative_percentage'] <= min_threshold] The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for: import pandas as pd def n_percent_revenue_generating_users(df, col, n_percent): df.sort_values(by=[col], ascending=False, inplace=True) df[f'{col}_cs'] = df[col].cumsum() df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum() df_ = df[df[f'{col}_csp'] > n_percent] index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin() threshold_revenue = df_.loc[index_nearest, col] output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp']) return output n_percent_revenue_generating_users(df, 'revenue', 20)
Sort letters in ascending order ('a-z') in Python after using value_counts
I imported my data file and isolated the first letter of each word, and provided the count of the word. My next step is to sort the letters in ascending order 'a-z'. This is the code that I have right now: import pandas as pd df = pd.read_csv(text.txt", names=['FirstNames']) df df['FirstLetter'] = df['FirstNames'].str[:1] df df['FirstLetter'] = df['FirstLetter'].str.lower() df df['FirstLetter'].value_counts() df df2 = df['FirstLetter'].index.value_counts() df2 Using .index.value_counts() wasn't working for me. It turned this output: Out[72]: 2047 1 4647 1 541 1 4639 1 2592 1 545 1 4643 1 2596 1 549 1 2600 1 2612 1 553 1 4651 1 2604 1 557 1 4655 1 2608 1 561 1 2588 1 4635 1 .. ````````` How can I fix this?
You can use the sort_index() function. This should work df['FirstLetter'].value_counts().sort_index()
Format Pandas Pivot Table
I met a problem in formatting pivot table that created by Pandas. So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index. >> df = PD.read_excel("data.xls") >> table = PD.pivot_table(df,index=["B"], values='Count',columns=["A"],aggfunc=[NUM.sum], fill_value=0,margins=True,dropna= True) >> table It returns as: sum A 1 2 3 All B 1 23 52 0 75 2 16 35 12 65 3 56 0 0 56 All 95 87 12 196 And I hope to have a format like this: A All_B 1 2 3 1 23 52 0 75 B 2 16 35 12 65 3 56 0 0 56 All_A 95 87 12 196 How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it. Before manipulation, import pandas as pd import numpy as np np.random.seed(0) a = np.random.randint(1, 4, 100) b = np.random.randint(1, 4, 100) df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100))) table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True) print(table) B 1 2 3 All A 1 454 649 770 1873 2 628 576 467 1671 3 376 247 481 1104 All 1458 1472 1718 4648 After: multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']]) multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']]) table.index = multi_level_index table.columns = multi_level_column print(table) A All_B 1 2 3 B 1 454 649 770 1873 2 628 576 467 1671 3 376 247 481 1104 All_A 1458 1472 1718 4648