pd.read_csv multiple tables and parse data frames using index=0 - python

I am new to pandas/python. Have used excel and stata pretty extensively.
I get a .csv file with multiple tables in it from a supplier that will not change their format.
The tables have headers and a blank row in between them.
The number of rows in each table can vary
The number of tables also seems to vary (i just discovered!)
There are 23 possible tables that can come in the file
I have managed to create one big data frame from the file
I can't seem to group by the index=0
Here is the code i have so far:
%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}
here is a sample of the .csv file with the first 3 tables:
Record Identifier Sender ID Receiver ID Action Warehouse ID Warehouse Name System Close Date DBA Address Address 2 City State Postal Code Phone Fax Primary Contact Email FEIN DUNS GLN
WAREHOUSE COX SUPPLIERX Change 1 Richmond 20160127 Company 700 Court Anywhere CA 99999 5555555555 5555555555 na na 0 50682020
Record Identifier Sender ID Receiver ID Sender Supplier ID Supplier Name Supplier Family
SUPPLIER COX SUPPLIERX 16 SUPPLIERX SUPPLIERX
Record Identifier Sender ID Receiver ID Supplier Product Number Sender Product ID Product Name Sender Brand ID Active Cases Per Pallet Cases Per Layer Case GTIN Carrier GTIN Unit GTIN Package Name Case Weight Case Height Case Width Case Length Case Ounces Case Equivalents Retail Units Per Case Consumable Units Per Case Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX 53030 LAG DOGTOWN PALE ALE 4/6/12OZ NR 217 Active 70 10 7.2383E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53071 LAG DOGTOWN PALE ALE 1/2 KEG 217 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 2100008003 53122 LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active 75 15 7.2383E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53130 LAG SUCKS ALE 4/6/12OZ NR 1473 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53132 LAG SUCKS ALE 12/32oz NR 1473 Active 50 10 7.23831E+11 7.2383E+11 7.2383E+11 12/32oz NR 38.2 9.5 10.75 20.6667 384 1.333333 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53170 LAG SUCKS ALE 1/4 KEG 1473 Inactive 1 1 0 1.11111E+11 KEG-1/4 BBL 87.2 11.75 17 17 992 3.444444 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53171 LAG FARMHOUSE SAISON 1/2 KEG 1478 Inactive 16 1 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53172 LAG SUCKS ALE 1/2 KEG 1473 Active 80 4 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53255 LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active 75 15 7.23831E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53271 LAG FARMHOUSE HOP STOOPID 1/2 KEG 222 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53330 LAG CENSORED ALE 4/6/12OZ NR 218 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53331 LAG CENSORED ALE 2/12/12 OZ NR 218 Inactive 60 1 7.2383E+11 7.2383E+11 7.2383E+11 2/12/12oz NR 31.9 9.5 10.75 15.5 288 1 2 24 Case Aluminum
PRODUCT COX SUPPLIERX 53333 LAG CENSORED ALE 24/12 OZ NR 218 Inactive 70 1 7.2383E+11 24/12oz NR 31.9 9.5 10.75 15.5 288 1 1 24 Case Aluminum

The first thing you need is simply to load your data cleanly. I'm going to assume your input file is tab-separated, even though your code doesn't specify that. This code works for me:
from cStringIO import StringIO
import pandas as pd
subfiles = [StringIO()]
with open('t.txt') as bigfile:
for line in bigfile:
if line.strip() == "": # blank line, new subfile
subfiles.append(StringIO())
else: # continuation of same subfile
subfiles[-1].write(line)
for subfile in subfiles:
subfile.seek(0)
table = pd.read_csv(subfile, sep='\t')
print '*****************'
print table
Basically what I do is to break up the original file into subfiles by looking for blank lines. Once that's done, reading the chunks with Pandas is straightforward, so long as you specify the correct sep character.

this worked, then i used the slicer to create tables
df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25))
table_names=["WAREHOUSE","SUPPLIER","PRODUCT"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

Related

pandas - using a dataframe to find the mean by dropping duplicates

I have a list of sales data which has a header which look like this:
Product ID SN Age Gender Item ID Item Name Price
0 0 Lisim78 20 Male 108 Extraction, 3.53
1 1 Lisovynya38 40 Male 143 Frenzied Scimitar 1.56
2 2 Ithergue48 24 Male 92 Final Critic 4.88
3 3 Chamassasya86 24 Male 100 Blindscythe 3.27
4 4 Iskosia90 23 Male 131 Fury 1.44
There are obviously a number of sales items which are sold multiple times. I'm trying to get the mean of the sales price. Here's the code which I created
average_price = purchase_data_df.groupby('Item ID')[["Price"]].mean()
print(average_price)
But this seems to only give the mean across each Item ID. How do I code to get the overall mean?

Data not showing in table form when using Jupyter Notebook

I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in

How to get segment and cluster using number of variables in R

I have a data frame which consists of the following data:
cus_id sex city state product_type var1 var2 type score
CA-1 Male ABC New York type-1 10 10 P-1 750
CA-2 Female ABC Alaska type-2 10 9.5 P-2 850
CA-3 Male Denver dfdfd type-3 10 11.1 P-3 560
CA-4 Female esx Nsdfe type-4 15 15 P-3 734
CA-5 Male dfr dfdedc type-5 13 12.9 P-3 798
CA-6 Male xds Nsdfe type-6 14.5 10.8 P-2 700
CA-7 Female edf New York type-5 14.2 14 P-2 550
CA-8 Female xde New York type-5 04 04 P-1 650
CA-9 Male wer New York type-1 10 11 P-1 730
Using the above-mentioned data frame, I want to create a segment considering the variables sex, City, State and score for the below-mentioned independent parameters.
product_type : is the static from type-1 to type-7
type: is the static from P-1 to P-3
The score is range from 100 to 1000,that we can break as per the segment identified for product_type and type
I want to Identify the cluster where the difference of var2 value from var1 value is minimum in terms of percentage. For Example, for cus_id CA-1 the match is 100% so we will have the segment for 100% with the matching sex, city, state and score variables.
I don't know how to make cluster using K means, Need approach and suggestion by SO.

Same observations on data frame feature appear as independent

I have got the following DF:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afrix 72
afr-ix 4
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at tokyo 1
at&t 5041
att 2
batelco 723
batelco 2
batelco (manual) 4
beeline 1702
beeline - 01 6
beeline - 02 6
i need to get a unique list of carrier_name so I have done some basic housekeeping as I only want to keep the names with no white spaces at the beginign or end of the observation with the following code:
`carrier = pd.DataFrame(data['sol_carrier'].value_counts(dropna=False))
carrier['carrier_name'] = carrier.index
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier['carrier_name'] = carrier['carrier_name'].str.replace('[^a-zA-Z]', ' ')
carrier['carrier_name'] = np.where(carrier['carrier_name']==' ',np.NaN,carrier['carrier_name'])
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier = carrier.reset_index(drop=True)
carrier = carrier[['carrier_name','sol_carrier']]
carrier.sort_values(by='carrier_name')`
what happens here is that i get a list of carrier_name but still get some duplicate observations like airtel or beelinefor example. I dont understand why this is happening as both observations are the same and and there are no more whitespaces at the begining or the end of the observation and, this observations are followed by its respective value_counts()so there is no reason for them to be duplicated. Here is the same DF but after the above code has been applied:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afr ix 4
afrix 72
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at t 5041
at tokyo 1
att 2
batelco 723
batelco 2
batelco manual 4
beeline 1702
beeline 6
beeline 6
That happens because you don't aggregate the results you just change the values in 'carrier_name' columns.
To aggregate the results call
carrier.groupby('carrier_name').sol_carrier.sum()
or modify the 'data' dataframe and then call
data['sol_carrier'].value_counts()

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Categories

Resources