Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame

Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame - python

I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!

I came up with this regex (txt is your text from the question):
import re
import numpy as np
import pandas as pd
data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
Prints:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
EDIT: Now should work even for your big document from the regex101

Related

Trying to aggregate a query on data that has already been aggregated - not sure the best approach to use

I am reading data from an Excel spreadsheet, and I am able to narrow down the results to a specific date range using the following method, below. As you can see it returns only results matching the date range criteria that I prescribed and returns info from each of the three columns: "Date of inquiry Receipt", "Office" and "LocationType". However, what I wish to do is also calculate the total number of each unique value existing in the resulting "office" column. For example I need to find out in my spreadsheet that for all data less than or equal to 2021-04-04 there are the following counts: Central = 18 , Central West = 12 , East = 5, South = 3
If I were using good old fashioned SQL query language could I could use a single command that would be kind of like:
"SELECT 'Office' from 2021_AutoReport.xlsx WHERE 'Date of inquery Receipt' <= '2021-04-04', JOIN OUTTER for SUM(Central), SUM(Central West), SUM(South), SUM(East) ....I'm not a SQL query pro, but hopefully you understand what I am trying to do and can advise how to do it by using dataframe queries?
Thanks so much for your help!
Example of what I have so far....just need to know how the approach to answer my question:
df =pd.read_excel("2021_AutoReport.xlsx")
myfilteredInfo= df[df['Date of inquiry Receipt'] <= '2021-04-04']
print(myfilteredInfo)
..... Result:
Date of inquiry Receipt Office LocationType
2 2021-01-04 Central Laboratory
3 2021-02-23 Central Farm
4 2021-02-24 Central Laboratory
5 2021-02-24 Central Laboratory
6 2021-02-24 Central Laboratory
7 2021-02-26 Central West SalesOffice
8 2021-03-02 Central Laboratory
9 2021-03-03 Central West Other
10 2021-03-03 Central West SalesOffice
11 2021-03-04 Central Laboratory
12 2021-03-04 Central Laboratory
13 2021-03-08 Central Laboratory
14 2021-03-08 South Other
15 2021-03-09 Central West Laboratory
16 2021-03-11 Central Laboratory
17 2021-03-11 Central West Other
18 2021-03-16 East Laboratory
19 2021-03-16 East Laboratory
20 2021-03-19 Central West Other
21 2021-03-19 Central West Laboratory
22 2021-03-20 East Laboratory
23 2021-03-22 Central Laboratory
24 2021-03-22 East Laboratory
25 2021-03-23 Central Other
26 2021-03-24 Central Laboratory
27 2021-03-24 Central West Laboratory
28 2021-03-25 Central Other
29 2021-03-25 Central West Other
30 2021-03-25 Central Laboratory
31 2021-03-26 South Laboratory
32 2021-03-30 Central Other
33 2021-03-31 Central West Laboratory
34 2021-04-01 South Other
35 2021-04-01 Central West SalesOffice
36 2021-04-01 Central Laboratory
37 2021-04-01 East SalesOffice
38 2021-04-01 Central Laboratory
39 2021-04-01 Central West Laboratory

use the value_counts method (Documentation) for the column that you need (Office):
myfilteredInfo['Office'].value_counts()

... calculate the total number of each unique value existing in the resulting "office" column ...
This would do the work:
>>> df.groupby('office').agg(numer_unique=('LocationType', 'count')

write a python function using ```def``` from 3 pre-existing columns in a dataframe; columns 1 and 2 as inputs = column 3 as output

My dataframe looks like this. 3 columns. All I want to do is write a FUNCTION that, when the first two columns are inputs, the corresponding third column (GHG intensity) is the output. I want to be able to input any property name and year and achieve the corresponding GHG intensity value. I cannot stress enough that this has to be written as a function using def. Please help!
Property Name Data Year \
467 GALLERY 37 2018
477 Navy Pier, Inc. 2016
1057 GALLERY 37 2015
1491 Navy Pier, Inc. 2015
1576 GALLERY 37 2016
2469 The Chicago Theatre 2016
3581 Navy Pier, Inc. 2014
4060 Ida Noyes Hall 2015
4231 Chicago Cultural Center 2015
4501 GALLERY 37 2017
5303 Harpo Studios 2015
5450 The Chicago Theatre 2015
5556 Chicago Cultural Center 2016
6275 MARTIN LUTHER KING COMMUNITY CENTER 2015
6409 MARTIN LUTHER KING COMMUNITY CENTER 2018
6665 Ida Noyes Hall 2017
7621 Ida Noyes Hall 2018
7668 MARTIN LUTHER KING COMMUNITY CENTER 2017
7792 The Chicago Theatre 2018
7819 Ida Noyes Hall 2016
8664 MARTIN LUTHER KING COMMUNITY CENTER 2016
8701 The Chicago Theatre 2017
9575 Chicago Cultural Center 2017
10066 Chicago Cultural Center 2018
GHG Intensity (kg CO2e/sq ft)
467 7.50
477 22.50
1057 8.30
1491 23.30
1576 7.40
2469 4.50
3581 17.68
4060 11.20
4231 13.70
4501 7.90
5303 18.70
5450 NaN
5556 10.30
6275 14.10
6409 12.70
6665 8.30
7621 8.40
7668 12.10
7792 4.40
7819 10.20
8664 12.90
8701 4.40
9575 9.30
10066 7.50

Here is an example, with a a different data frame to test:
import pandas as pd
df = pd.DataFrame(data={'x': [3, 5], 'y': [4, 12]})
def func(df, arg1, arg2, arg3):
''' agr1 and arg2 are input columns; arg3 is output column.'''
df = df.copy()
df[arg3] = df[arg1] ** 2 + df[arg2] ** 2
return df
Results are:
print(func(df, 'x', 'y', 'z'))
x y z
0 3 4 25
1 5 12 169

You can try this code
def GHG_Intensity(PropertyName, Year):
Intensity = df[(df['Property Name']==PropertyName) & (df['Data Year']==Year)]['GHG Intensity (kg CO2e/sq ft)'].to_list()
return Intensity[0] if len(Intensity) else 'GHG Intensity Not Available'
print(GHG_Intensity('Navy Pier, Inc.', 2016))

Creating new rows from single cell strings in pandas dataframe

I have a pandas dataframe with output scraped directly from a USDA text file. Below is an example of of the dataframe:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans,Cucumbers,Eggplant,Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples and Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
I am looking for a programmatic solution to break up the multiple commodity (each commodity is separated by commas or "and" in the above table) cells in the "CommodityGroups" column, create new rows for the separated commodities, and duplicate the rest of column data for each new row. Desired example output:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
Any guidance you can provide in this pursuit will be greatly appreciated!

Use .str.split to split the column with a pattern ',| and ', which is ',' or ' and '. '|' is OR.
Use .explode to separate list elements into separate rows
Optionally, set ignore_index=True where the resulting index will be labeled 0, 1, …, n - 1, depending on your needs.
import pandas as pd
# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
'Low': [4500, 7000, 3800],
'High': [4700, 8000, 4000]}
# create the dataframe
df = pd.DataFrame(data)
# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')
# explode the CommodityGroup lists
df = df.explode('CommodityGroup')
# final
Date Region CommodityGroup InboundCity Low High
0 1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1 1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1 1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
2 1/2/2019 Michigan Apples Boston 3800 4000

You can try this:
df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
.apply(lambda x: x.str.split(',| and ').explode())
.reset_index()
print(df)
Date Region InboundCity Low High CommodityGroup
0 1/2/2019 Mexico Crossings Atlanta 4500 4700 Beans
1 1/2/2019 Mexico Crossings Atlanta 4500 4700 Cucumbers
2 1/2/2019 Mexico Crossings Atlanta 4500 4700 Eggplant
3 1/2/2019 Mexico Crossings Atlanta 4500 4700 Melons
4 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Apples
5 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Pears
6 1/2/2019 Michigan Boston 3800 4000 Apples

Not able to extract complete city list

I am using the following code to extract the list of cities mentioned on this page, but it gives me just the first 23 cities.
Can't figure out where I am going wrong!
import requests,bs4
res=requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text=bs4.BeautifulSoup(res.text,"lxml")
fields=text.select('td[bgcolor="silver"] > font[size="-2"] > b')
print len(fields)
for field in fields:
print field.getText()
This is the output I am getting:
23
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
But this webpage contains 125 cities.

lxml works fine for me, I get 124 cities using your own code so it has nothing to do with the parser, you are either using an old version of bs4 or it is an encoding issue, you should call .content and let requests handle the encoding, you are also missing a city using your logic, to get all 125:
import requests, bs4
res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text = bs4.BeautifulSoup(res.content,"lxml")
rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
print(len(rows))
for row in rows:
print(row.get_text())
If we run it, you can see we get all the cities:
In [1]: import requests,bs4
In [2]: res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
In [3]: text = bs4.BeautifulSoup(res.text,"lxml")
In [4]: rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
In [5]: print(len(rows))
125
In [6]: for row in rows:
...: print(row.get_text())
...:
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
Chicago
London
Shenzhen
Essen/Düsseldorf
Tehran
Bogota
Lima
Bangkok
Johannesburg/East Rand
Chennai
Taipei
Baghdad
Santiago
Bangalore
Hyderabad
St Petersburg
Philadelphia
Lahore
Kinshasa
Miami
Ho Chi Minh City
Madrid
Tianjin
Kuala Lumpur
Toronto
Milan
Shenyang
Dallas/Fort Worth
Boston
Belo Horizonte
Khartoum
Riyadh
Singapore
Washington
Detroit
Barcelona
Houston
Athens
Berlin
Sydney
Atlanta
Guadalajara
San Francisco/Oakland
Montreal.
Monterey
Melbourne
Ankara
Recife
Phoenix/Mesa
Durban
Porto Alegre
Dalian
Jeddah
Seattle
Cape Town
San Diego
Fortaleza
Curitiba
Rome
Naples
Minneapolis/St. Paul
Tel Aviv
Birmingham
Frankfurt
Lisbon
Manchester
San Juan
Katowice
Tashkent
Fukuoka
Baku/Sumqayit
St. Louis
Baltimore
Sapporo
Tampa/St. Petersburg
Taichung
Warsaw
Denver
Cologne/Bonn
Hamburg
Dubai
Pretoria
Vancouver
Beirut
Budapest
Cleveland
Pittsburgh
Campinas
Harare
Brasilia
Kuwait
Munich
Portland
Brussels
Vienna
San Jose
Damman
Copenhagen
Brisbane
Riverside/San Bernardino
Cincinnati
Accra

Matching rows from a table

I'm scraping this wikipedia page:
https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area
And getting the data from the table, like this:
Location = response.xpath('//*[#id="mw-content-text"]/table/tr/td[2]/a/text()').extract()[0]
Name = response.xpath('//*[#id="mw-content-text"]/table/tr/td[1]/a/text()').extract()
Once I have it, the plan is to add those list to a data frame. The issue is the that at the end I get:
len(Name)
40
and
len(Location)
47
This is because at some rows in the location column there are several elements, like in the third column where it is: Coconut Grove, Miami
there I get to elements.

You can use read_html and df is first df of dfs:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area',
header=0 )[0]
print df
Name Location
0 Aventura Mall Aventura
1 Bal Harbour Shops Bal Harbour
2 Bayside Marketplace Downtown Miami
3 Boynton Beach Mall Boynton Beach
4 CityPlace West Palm Beach
5 CocoWalk Coconut Grove, Miami
6 Coral Square Coral Springs
7 Dadeland Mall Kendall
8 Dolphin Mall Sweetwater
9 Downtown at the Gardens Palm Beach Gardens
10 The Falls Kendall
11 Galeria International Mall Downtown Miami
12 The Galleria at Fort Lauderdale Fort Lauderdale
13 The Gardens Mall Palm Beach Gardens
14 The Grand Doubletree Shops Downtown Miami
15 Las Olas Riverfront Fort Lauderdale
16 Las Olas Shops Fort Lauderdale
17 Lincoln Road Mall Miami Beach
18 Loehmann's Fashion Island Aventura
19 Mall of the Americas Miami
20 The Mall at 163rd Street North Miami Beach
21 The Mall at Wellington Green Wellington
22 Miami International Mall Doral
23 Miracle Marketplace Miami
24 Metrofare Shops & Cafe Government Center, Downtown Miami
25 Pembroke Lakes Mall Pembroke Pines
26 Pompano Citi Centre Pompano Beach
27 Sawgrass Mills Sunrise
28 Seminole Paradise Hollywood
29 The Shops at Fontainebleau Miami Beach
30 The Shops at Mary Brickell Village Brickell, Miami
31 The Shops at Midtown Miami Midtown Miami
32 The Shops at Pembroke Gardens Pembroke Pines
33 The Shops at Sunset Place South Miami
34 Southland Mall Cutler Bay
35 Town Center at Boca Raton Boca Raton
36 The Village at Gulfstream Park Hallandale Beach
37 Village of Merrick Park Coral Gables
38 Westfield Broward Plantation
39 Westland Mall Hialeah

You just need the correct xpath:
rows = response.xpath('//table[#class="wikitable"]//tr[not(./th)]')
for row in rows:
print ''.join(row.xpath('.//td[1]//text()').extract()), ' | ' , ''.join(row.xpath('.//td[2]//text()').extract())
Aventura Mall | Aventura
Bal Harbour Shops | Bal Harbour
Bayside Marketplace | Downtown Miami
Boynton Beach Mall | Boynton Beach
CityPlace | West Palm Beach
CocoWalk | Coconut Grove, Miami
Coral Square | Coral Springs
Dadeland Mall | Kendall
Dolphin Mall | Sweetwater
Downtown at the Gardens | Palm Beach Gardens
The Falls | Kendall
Galeria International Mall | Downtown Miami
The Galleria at Fort Lauderdale | Fort Lauderdale
The Gardens Mall | Palm Beach Gardens
The Grand Doubletree Shops | Downtown Miami
Las Olas Riverfront | Fort Lauderdale
Las Olas Shops | Fort Lauderdale
Lincoln Road Mall | Miami Beach
Loehmann's Fashion Island | Aventura
Mall of the Americas | Miami
The Mall at 163rd Street | North Miami Beach
The Mall at Wellington Green | Wellington
Miami International Mall | Doral
Miracle Marketplace | Miami
Metrofare Shops & Cafe | Government Center, Downtown Miami
Pembroke Lakes Mall | Pembroke Pines
Pompano Citi Centre | Pompano Beach
Sawgrass Mills | Sunrise
Seminole Paradise | Hollywood
The Shops at Fontainebleau | Miami Beach
The Shops at Mary Brickell Village | Brickell, Miami
The Shops at Midtown Miami | Midtown Miami
The Shops at Pembroke Gardens | Pembroke Pines
The Shops at Sunset Place | South Miami
Southland Mall | Cutler Bay
Town Center at Boca Raton | Boca Raton
The Village at Gulfstream Park | Hallandale Beach
Village of Merrick Park | Coral Gables
Westfield Broward | Plantation
Westland Mall | Hialeah

If what you want is consider the two words as one, you can do a string replace over the entire word to replace the comma with an empty string:
location = [loc.replace(',', '') for loc in location]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame - python

Related

Trying to aggregate a query on data that has already been aggregated - not sure the best approach to use

write a python function using ```def``` from 3 pre-existing columns in a dataframe; columns 1 and 2 as inputs = column 3 as output

Creating new rows from single cell strings in pandas dataframe

Not able to extract complete city list

Matching rows from a table

Categories

Resources