I have got the following DF:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afrix 72
afr-ix 4
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at tokyo 1
at&t 5041
att 2
batelco 723
batelco 2
batelco (manual) 4
beeline 1702
beeline - 01 6
beeline - 02 6
i need to get a unique list of carrier_name so I have done some basic housekeeping as I only want to keep the names with no white spaces at the beginign or end of the observation with the following code:
`carrier = pd.DataFrame(data['sol_carrier'].value_counts(dropna=False))
carrier['carrier_name'] = carrier.index
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier['carrier_name'] = carrier['carrier_name'].str.replace('[^a-zA-Z]', ' ')
carrier['carrier_name'] = np.where(carrier['carrier_name']==' ',np.NaN,carrier['carrier_name'])
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier = carrier.reset_index(drop=True)
carrier = carrier[['carrier_name','sol_carrier']]
carrier.sort_values(by='carrier_name')`
what happens here is that i get a list of carrier_name but still get some duplicate observations like airtel or beelinefor example. I dont understand why this is happening as both observations are the same and and there are no more whitespaces at the begining or the end of the observation and, this observations are followed by its respective value_counts()so there is no reason for them to be duplicated. Here is the same DF but after the above code has been applied:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afr ix 4
afrix 72
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at t 5041
at tokyo 1
att 2
batelco 723
batelco 2
batelco manual 4
beeline 1702
beeline 6
beeline 6
That happens because you don't aggregate the results you just change the values in 'carrier_name' columns.
To aggregate the results call
carrier.groupby('carrier_name').sol_carrier.sum()
or modify the 'data' dataframe and then call
data['sol_carrier'].value_counts()
Related
I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/
I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..
I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')
Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:
So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063
I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924
I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in
I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.
Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.