Pandas drop unique row in order to use groupby and qcut

Pandas drop unique row in order to use groupby and qcut - python

How do I drop unique? It is interfering with groupby and qcut.
df0 = psql.read_frame(sql_query,conn)
df = df0.sort(['industry','C'], ascending=[False,True] )
Here is my dataframe:
id industry C
5 28 other industry 0.22
9 32 Specialty Eateries 0.60
10 33 Restaurants 0.84
1 22 Processed & Packaged Goods 0.07
0 21 Processed & Packaged Goods 0.14
8 31 Processed & Packaged Goods 0.43
11 34 Major Integrated Oil & Gas 0.07
14 37 Major Integrated Oil & Gas 0.50
15 38 Independent Oil & Gas 0.06
18 41 Independent Oil & Gas 0.06
19 42 Independent Oil & Gas 0.13
12 35 Independent Oil & Gas 0.43
16 39 Independent Oil & Gas 0.65
17 40 Independent Oil & Gas 0.91
13 36 Independent Oil & Gas 2.25
2 25 Food - Major Diversified 0.35
3 26 Beverages - Soft Drinks 0.54
4 27 Beverages - Soft Drinks 0.73
6 29 Beverages - Brewers 0.19
7 30 Beverages - Brewers 0.21
And I've used the following code from pandas and qcut to rank column 'C' which sadly went batsh*t on me.
df['rank'] = df.groupby(['industry'])['C'].transform(lambda x: pd.qcut(x,5, labels=range(1,6)))
After researching a bit, the reason qcut threw errors is because of the unique value for industry column, reference to error and another ref to err.
Although, I still want to be able to rank without throwing out unique (unique should be assign to the value of 1) if that is possible. But after so many tries, I am convinced that qcut can't handle unique and so I am willing to settle for dropping unique to make qcut happy doing its thing.
But if there is another way, I'm very curious to know. I really appreciate your help.

Just in case anyone still wants to do this. You should be able to do it by selecting only duplicates?
df = df[df['industry'].duplicated(keep=False)]

Related

pandas - using a dataframe to find the mean by dropping duplicates

I have a list of sales data which has a header which look like this:
Product ID SN Age Gender Item ID Item Name Price
0 0 Lisim78 20 Male 108 Extraction, 3.53
1 1 Lisovynya38 40 Male 143 Frenzied Scimitar 1.56
2 2 Ithergue48 24 Male 92 Final Critic 4.88
3 3 Chamassasya86 24 Male 100 Blindscythe 3.27
4 4 Iskosia90 23 Male 131 Fury 1.44
There are obviously a number of sales items which are sold multiple times. I'm trying to get the mean of the sales price. Here's the code which I created
average_price = purchase_data_df.groupby('Item ID')[["Price"]].mean()
print(average_price)
But this seems to only give the mean across each Item ID. How do I code to get the overall mean?

python: Arrange in pandas dataframe

I extract the data from a webpage but would like to arrange it into the pandas dataframe table.
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
finz = html.fromstring(finviz.content)
col = finz.xpath('//table/tr/td[#class="table-top"]/text()')
data = finz.xpath('//table/tr/td/a[#class="screener-link"]/text()')
Col is the column for the pandas dataframe and each of the 28 data points in data list will be arranged accordingly into rows. data points 29 to 56 in the second row and so forth. How to write the code elegantly?
datalist = []
for y in range (28):
datalist.append(data[y])
>>> datalist
['1', 'Agilent Technologies, Inc.', 'Healthcare', 'Medical Laboratories & Research', 'USA', '23.00B', '29.27', '4.39', '4.53', '18.76', '1.02%', '5.00%', '5.70%', '3
24.30M', '308.52M', '2.07', '8.30%', '15.70%', '14.60%', '1.09', '1,775,149', '2', 'Alcoa Corporation', 'Basic Materials', 'Aluminum', 'USA', '1.21B', '-']
But the result is not in table form like dataframe

Pandas has a function to parse HTML: pd.read_html
You can try the following:
# Modules
import pandas as pd
import requests
# HTML content
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
# Convert to dataframe
df = pd.read_html(finviz.content)[-2]
# Set 1st row to columns names
df.columns = df.iloc[0]
# Drop 1st row
df = df.drop(df.index[0])
# df = df.set_index('No.')
print(df)
# 0 No. Ticker Company Sector Industry Country ... Debt/Eq Profit M Beta Price Change Volume
# 1 1 A Agilent Technologies, Inc. Healthcare Medical Laboratories & Research USA ... 0.51 14.60 % 1.20 72.47 - 0.28 % 177333
# 2 2 AA Alcoa Corporation Basic Materials Aluminum USA ... 0.44 - 10.80 % 2.03 6.28 3.46 % 3021371
# 3 3 AAAU Perth Mint Physical Gold ETF Financial Exchange Traded Fund USA ... - - - 16.08 - 0.99 % 45991
# 4 4 AACG ATA Creativity Global Services Education & Training Services China ... 0.02 - 2.96 0.95 - 0.26 % 6177
# 5 5 AADR AdvisorShares Dorsey Wright ADR ETF Financial Exchange Traded Fund USA ... - - - 40.80 0.22 % 1605
# 6 6 AAL American Airlines Group Inc. Services Major Airlines USA ... - 3.70 % 1.83 12.81 4.57 % 16736506
# 7 7 AAMC Altisource Asset Management Corporation Financial Asset Management USA ... - -17.90 % 0.78 12.28 0.00 % 0
# 8 8 AAME Atlantic American Corporation Financial Life Insurance USA ... 0.28 - 0.40 % 0.29 2.20 3.29 % 26
# 9 9 AAN Aaron's, Inc. Services Rental & Leasing Services USA ... 0.20 0.80 % 1.23 22.47 - 0.35 % 166203
# 10 10 AAOI Applied Optoelectronics, Inc. Technology Semiconductor - Integrated Circuits USA ... 0.49 - 34.60 % 2.02 7.80 2.63 % 61303
# 11 11 AAON AAON, Inc. Industrial Goods General Building Materials USA ... 0.02 11.40 % 0.88 48.60 0.71 % 20533
# 12 12 AAP Advance Auto Parts, Inc. Services Auto Parts Stores USA ... 0.21 5.00 % 1.04 95.94 - 0.58 % 165445
# 13 13 AAPL Apple Inc. Consumer Goods Electronic Equipment USA ... 1.22 21.50 % 1.19 262.39 2.97 % 11236642
# 14 14 AAT American Assets Trust, Inc. Financial REIT - Retail USA ... 1.03 12.50 % 0.99 25.35 2.78 % 30158
# 15 15 AAU Almaden Minerals Ltd. Basic Materials Gold Canada ... 0.04 - 0.53 0.28 - 1.43 % 34671
# 16 16 AAWW Atlas Air Worldwide Holdings, Inc. Services Air Services, Other USA ... 1.33 - 10.70 % 1.65 22.79 2.70 % 56521
# 17 17 AAXJ iShares MSCI All Country Asia ex Japan ETF Financial Exchange Traded Fund USA ... - - - 60.13 1.18 % 161684
# 18 18 AAXN Axon Enterprise, Inc. Industrial Goods Aerospace/Defense Products & Services USA ... 0.00 0.20 % 0.77 71.11 2.37 % 187899
# 19 19 AB AllianceBernstein Holding L.P. Financial Asset Management USA ... 0.00 89.60 % 1.35 19.15 1.84 % 54588
# 20 20 ABB ABB Ltd Industrial Goods Diversified Machinery Switzerland ... 0.67 5.10 % 1.10 17.44 0.52 % 723739
# [20 rows x 29 columns]
I let you improve the data selection if the HTML page structure change ! The parent div id might be useful.
Explanation "[-2]": the read_html returns a list of dataframe:
list_df = pd.read_html(finviz.content)
print(type(list_df))
# <class 'list'>
# Elements types in the lists
print(type(list_df [0]))
# <class 'pandas.core.frame.DataFrame' >
So in order to get the desired dataframe, I select the 2nd element before the end with [-2]. This discussion explains about negative indexes.

Trying to use first 23 rows of a Pandas data frame as headers and then pivot on the headers

I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.

Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.

pd.read_csv multiple tables and parse data frames using index=0

I am new to pandas/python. Have used excel and stata pretty extensively.
I get a .csv file with multiple tables in it from a supplier that will not change their format.
The tables have headers and a blank row in between them.
The number of rows in each table can vary
The number of tables also seems to vary (i just discovered!)
There are 23 possible tables that can come in the file
I have managed to create one big data frame from the file
I can't seem to group by the index=0
Here is the code i have so far:
%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}
here is a sample of the .csv file with the first 3 tables:
Record Identifier Sender ID Receiver ID Action Warehouse ID Warehouse Name System Close Date DBA Address Address 2 City State Postal Code Phone Fax Primary Contact Email FEIN DUNS GLN
WAREHOUSE COX SUPPLIERX Change 1 Richmond 20160127 Company 700 Court Anywhere CA 99999 5555555555 5555555555 na na 0 50682020
Record Identifier Sender ID Receiver ID Sender Supplier ID Supplier Name Supplier Family
SUPPLIER COX SUPPLIERX 16 SUPPLIERX SUPPLIERX
Record Identifier Sender ID Receiver ID Supplier Product Number Sender Product ID Product Name Sender Brand ID Active Cases Per Pallet Cases Per Layer Case GTIN Carrier GTIN Unit GTIN Package Name Case Weight Case Height Case Width Case Length Case Ounces Case Equivalents Retail Units Per Case Consumable Units Per Case Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX 53030 LAG DOGTOWN PALE ALE 4/6/12OZ NR 217 Active 70 10 7.2383E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53071 LAG DOGTOWN PALE ALE 1/2 KEG 217 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 2100008003 53122 LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active 75 15 7.2383E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53130 LAG SUCKS ALE 4/6/12OZ NR 1473 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53132 LAG SUCKS ALE 12/32oz NR 1473 Active 50 10 7.23831E+11 7.2383E+11 7.2383E+11 12/32oz NR 38.2 9.5 10.75 20.6667 384 1.333333 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53170 LAG SUCKS ALE 1/4 KEG 1473 Inactive 1 1 0 1.11111E+11 KEG-1/4 BBL 87.2 11.75 17 17 992 3.444444 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53171 LAG FARMHOUSE SAISON 1/2 KEG 1478 Inactive 16 1 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53172 LAG SUCKS ALE 1/2 KEG 1473 Active 80 4 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53255 LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active 75 15 7.23831E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53271 LAG FARMHOUSE HOP STOOPID 1/2 KEG 222 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53330 LAG CENSORED ALE 4/6/12OZ NR 218 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53331 LAG CENSORED ALE 2/12/12 OZ NR 218 Inactive 60 1 7.2383E+11 7.2383E+11 7.2383E+11 2/12/12oz NR 31.9 9.5 10.75 15.5 288 1 2 24 Case Aluminum
PRODUCT COX SUPPLIERX 53333 LAG CENSORED ALE 24/12 OZ NR 218 Inactive 70 1 7.2383E+11 24/12oz NR 31.9 9.5 10.75 15.5 288 1 1 24 Case Aluminum

The first thing you need is simply to load your data cleanly. I'm going to assume your input file is tab-separated, even though your code doesn't specify that. This code works for me:
from cStringIO import StringIO
import pandas as pd
subfiles = [StringIO()]
with open('t.txt') as bigfile:
for line in bigfile:
if line.strip() == "": # blank line, new subfile
subfiles.append(StringIO())
else: # continuation of same subfile
subfiles[-1].write(line)
for subfile in subfiles:
subfile.seek(0)
table = pd.read_csv(subfile, sep='\t')
print '*****************'
print table
Basically what I do is to break up the original file into subfiles by looking for blank lines. Once that's done, reading the chunks with Pandas is straightforward, so long as you specify the correct sep character.

this worked, then i used the slicer to create tables
df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25))
table_names=["WAREHOUSE","SUPPLIER","PRODUCT"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

Add calculated column to a pandas pivot table

I have created a pandas data frame and then converted it into pivot table.
My pivot table looks like this:
Operators TotalCB Qd(cb) Autopass(cb)
Aircel India 55 11 44
Airtel Ghana 20 17 3
Airtel India 41 9 9
Airtel Kenya 9 4 5
Airtel Nigeria 24 17 7
AT&T USA 18 10 8
I was wondering how to add calculated columns so that I get my pivot table with Autopass% (Autopass(cb)/TotalCB*100) just like we are able to create them in Excel using calculated field option.
I want my pivot table output to be something like below:
Operators TotalCB Qd(cb) Autopass(cb) Qd(cb)% Autopass(cb)%
Aircel India 55 11 44 20% 80%
Airtel Ghana 20 17 3 85% 15%
Airtel India 41 29 9 71% 22%
Airtel Kenya 9 4 5 44% 56%
AT&T USA 18 10 8 56% 44%
How do I define the function which calculates the percentage columns and how to apply that function to my two columns namely Qd(cb) and Autopass(cb) to give me additional calculated columns

This should do it, assuming data is your pivoted dataframe:
data['Autopass(cb)%'] = data['Autopass(cb)'] / data['TotalCB'] * 100
data['Qd(cb)%'] = data['Qd(cb)'] / data['TotalCB'] * 100
Adding a new column to a dataframe is as simple as df['colname'] = new_series. Here we assign it with your requested function, when we do it as a vector operation it creates a new series.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas drop unique row in order to use groupby and qcut - python

Just in case anyone still wants to do this. You should be able to do it by selecting only duplicates? df = df[df['industry'].duplicated(keep=False)]

Related

pandas - using a dataframe to find the mean by dropping duplicates

python: Arrange in pandas dataframe

Trying to use first 23 rows of a Pandas data frame as headers and then pivot on the headers

pd.read_csv multiple tables and parse data frames using index=0

Add calculated column to a pandas pivot table

Categories

Resources