Pandas Groupby Percentage of total

Pandas Groupby Percentage of total - python

df["% Sales"] = df["Jan"]/df["Q1"]
q1_sales = df.groupby(["City"])["Jan","Feb","Mar", "Q1"].sum()
ql_sales.head()
Jan Feb Mar Q1
City
Los Angeles 44 40 54 138
Want the code to get the percentage of sales for the quarter. Want it to look like this below each month is divided by the total sales of the quarter.
Jan Feb Mar
City
Los Angeles 31.9% 29% 39.1%

Try div:
q1_sales[['Jan','Feb','Mar']].div(q1_sales['Q1']*0.01, axis='rows')
Output:
Jan Feb Mar
City
Los Angeles 31.884058 28.985507 39.130435

Use:
new_df=q1_sales[q1_sales.columns.difference(['Q1'])]
new_df=(new_df.T/new_df.sum(axis=1)*100).T
print(new_df)
Feb Jan Mar
Los Angeles 28.985507 31.884058 39.130435

Related

turning one column into multiple pro-rated column

I have a data regarding an insurance customer's premium during a certain year.
User ID
Period From
Period to
Period from-period to
Total premium
A8856
Jan 2022
Apr 2022
4
$600
A8857
Jan 2022
Feb 2022
2
$400
And I'm trying to turn it into a pro-rated one
Assuming that the input I'm expecting is like this:
User ID
Period From
Total premium
A8856
Jan 2022
$150
A8856
Feb 2022
$150
A8856
Mar 2022
$150
A8856
Apr 2022
$150
A8857
Jan 2022
$200
A8857
Feb 2022
$200
What kind of code do you think I should use? I use python and help is really appreciated.

Pandas where function

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?

Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135

You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

Error "6 columns passed, passed data had 286 columns "

I am web-scraping the table that is found on this website : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
Everything was good, but I had a small issue with the "Price" label and was unable to fix it. I've been trying for the past few hours and this is the last error that I ran into : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = requests.get("https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html")
soup = BeautifulSoup(page.content, "lxml")
gdp = soup.find_all("table", attrs={"class": "table flight-detail hidden-xs"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
print(len(body))
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows
# Lets now iterate through the head HTML code and make list of clean headings
# Declare empty list to keep Columns names
headings = []
for item in head.find_all("th"): # loop through all th elements
# convert the th elements to text and strip "\n"
item = (item.text).rstrip("\n")
# append the clean column name to headings
headings.append(item)
print(headings)
import re
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
row = [] # this will old entries for one row
for row_item in body_rows[row_num].find_all("td")[:-1]: #loop through all row entries
# row_item.text removes the tags from the entries
# the following regex is to remove \xa0 and \n and comma from row_item.text
# xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item.text)
#append aa to row - note one row entry is being appended
row.append(aa)
# append one row to all_rows
all_rows.append(row)
for row_item in body_rows[row_num].find_all("td")[-1].find("span").text: #loop through the last row entry, price.
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item)
row.append(aa)
all_rows.append(row)
# We can now use the data on all_rowsa and headings to make a table
# all_rows becomes our data and headings the column names
df = pd.DataFrame(data=all_rows,columns=headings)
#df.head()
#print(df)
df["Date"]=pd.to_datetime(df["Date"]).dt.strftime("%d/%m/%Y")
print(df)
If you could please run the code and tell me how to solve this issue so I could print everything when I am using " print(df) ".
Previusly, I was able to print eveything, except the price, who had "\t\t\t\t\t\t\t" instead of the price.
Thank you.

To get the table into panda DataFrame, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = (
"https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html"
)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for tr in soup.select("tr:has(td)"):
row = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
data.append(row)
df = pd.DataFrame(data, columns="From To Aircraft Seats Date Price".split())
print(df)
df.to_csv("data.csv", index=False)
Prints:
From To Aircraft Seats Date Price
0 Prague Vaclav Havel Airport Bratislava M R Stefanik Citation XLS+ 9 Thu Jun 03 00:00:00 UTC 2021 €3 300 (RRP €6 130)
1 Billund Odense Learjet 45 / 45XR 8 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €7 100)
2 La Roche/yon Les Ajoncs Nantes Atlantique Embraer Phenom 100 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €4 820)
3 London Biggin Hill Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 980)
4 Prague Vaclav Havel Airport Salzburg (mozart) Gulfstream G200 9 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 800)
5 Palma De Mallorca Edinburgh Cessna C525 Citation CJ2 5 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €18 680)
6 Linz Blue Danube Linz Munich Munchen Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 600)
7 Geneva Cointrin Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €9 240)
8 Vienna Schwechat Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 590)
9 Cannes Mandelieu Geneva Cointrin Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
10 Brussels National Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 790)
11 Split Bari Palese Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
12 Copenhagen Roskilde Aalborg Challenger 604 11 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €16 750)
13 Brussels National Leipzig Halle Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 690)
...
And saves data.csv (screenshot from LibreOffice):

How to separate date values from a text column with special characters in a pandas dataframe? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a column with 4 values like below in a dataframe :
Have attached the image below for better understanding
Input
India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992
Output
Place
India Chennai
India Chennai
(Asia) India Chennai
India Chennai
Date
24 Oct 1992
Oct 1992
22 Oct 1992
1992
I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.
I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.
Would appreciate if somebody could help !!
Apologies for the format of Input and Output data !!

Use:
import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')
Input
col
0 India,Chennai - 24 Oct 1992
1 India,-Chennai,Oct 1992
2 India,-Chennai, 1992
3 (Asia) India,Chennai-22 Oct 1992
Output
col Place Date
0 India,Chennai - 24 Oct 1992 India Chennai 24 Oct 1992
1 India,-Chennai,Oct 1992 India Chennai Oct 1992
2 India,-Chennai, 1992 India Chennai 1992
3 (Asia) India,Chennai-22 Oct 1992 (Asia) India Chennai 22 Oct 1992

There are lot of ways to create columns by using Pandas library in python,
you can create by creating list or by list of dictionaries or by dictionaries of list.
for simple understanding here i am going to use lists
first import pandas as pd
import pandas as pd
creating a list from given data
data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]
creating dataframe from list
df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))
print
print(df)
output will be as
Country City Date Year
0 India chennai 24 Oct 1992
1 India chennai 23 Oct 1992
2 India chennai 23 Oct 1992
3 India chennai 21 Oct 1992
hope this will help you

The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!
>>> import re
>>> text_array
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']
# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]
>>> tmp
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]
# join the last two fields together to get the digit back.
>>> r = [(i[0], "".join(i[1:])) for i in tmp]
>>> r
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]
If you have control over the how input is generated then I would suggest that
the input is made more consistent and then we can parse using a tool like
pandas or directly with csv.
Hope this helps.
Regards,
Prasanth

Python code:
import re
import pandas as pd
input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)
p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []
with open(csv_file, encoding='utf-8', errors='ignore') as f:
for line in f:
s = re.sub("[,-]", " ", line.strip())
s = re.sub("\s+", " ", s)
r = p.search(s)
str_date = r.group()
dates.append(str_date)
place = s[0:s.find(str_date)]
places.append(place)
dict = {'Place': places,
'Date': dates
}
df = pd.DataFrame(dict)
print(df)
Output:
Place Date
0 India Chennai 24 Oct 1992
1 India Chennai Oct 1992
2 (Asia) India Chennai 22 Oct 1992
3 India Chennai 1992

How to remove duplicates rows based on partial strings in Python

If i have a dataframe as follows in which 01 and 02, 03 and 04, 05 and 06 are same cites:
id city
01 New York City
02 New York
03 Tokyo City
04 Tokyo
05 Shanghai City
06 Shanghai
07 Beijing City
08 Paris
09 Berlin
How can I drop duplicates cites and get following dataframe? Thanks.
id city
01 New York
02 Tokyo
03 Shanghai
04 Beijing City
05 Paris
06 Berlin

Replace City part with null string and apply group by keeping the first row
df=pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
df looks like this
city id
0 New York City 1
1 New York 2
2 Tokyo City 3
3 Tokyo 4
Apply replace and group by to get first row in each group
df.city=df.city.str.replace('City','').str.strip()
df.groupby('city').first().sort_values('id')
Output:
city id
New York 1
Tokyo 3
Or use drop_duplicates on subset of columns. Thanks #JR ibkr
df.drop_duplicates(subset='city')

This is much easier in pandas now with drop_duplicates and the keep parameter.
# dataset
df = pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
# replace values
df.city = df.city.str.replace('City','').str.strip()
# drop duplicate (answer of original question)
df.drop_duplicates(subset=['city'])
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Groupby Percentage of total - python

Try div: q1_sales[['Jan','Feb','Mar']].div(q1_sales['Q1']*0.01, axis='rows') Output: Jan Feb Mar City Los Angeles 31.884058 28.985507 39.130435

Use: new_df=q1_sales[q1_sales.columns.difference(['Q1'])] new_df=(new_df.T/new_df.sum(axis=1)*100).T print(new_df) Feb Jan Mar Los Angeles 28.985507 31.884058 39.130435

Related

turning one column into multiple pro-rated column

Pandas where function

Error "6 columns passed, passed data had 286 columns "

How to separate date values from a text column with special characters in a pandas dataframe? [closed]

How to remove duplicates rows based on partial strings in Python

Categories

Resources