Matching rows from a table - python

I'm scraping this wikipedia page:
https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area
And getting the data from the table, like this:
Location = response.xpath('//*[#id="mw-content-text"]/table/tr/td[2]/a/text()').extract()[0]
Name = response.xpath('//*[#id="mw-content-text"]/table/tr/td[1]/a/text()').extract()
Once I have it, the plan is to add those list to a data frame. The issue is the that at the end I get:
len(Name)
40
and
len(Location)
47
This is because at some rows in the location column there are several elements, like in the third column where it is: Coconut Grove, Miami
there I get to elements.

You can use read_html and df is first df of dfs:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area',
header=0 )[0]
print df
Name Location
0 Aventura Mall Aventura
1 Bal Harbour Shops Bal Harbour
2 Bayside Marketplace Downtown Miami
3 Boynton Beach Mall Boynton Beach
4 CityPlace West Palm Beach
5 CocoWalk Coconut Grove, Miami
6 Coral Square Coral Springs
7 Dadeland Mall Kendall
8 Dolphin Mall Sweetwater
9 Downtown at the Gardens Palm Beach Gardens
10 The Falls Kendall
11 Galeria International Mall Downtown Miami
12 The Galleria at Fort Lauderdale Fort Lauderdale
13 The Gardens Mall Palm Beach Gardens
14 The Grand Doubletree Shops Downtown Miami
15 Las Olas Riverfront Fort Lauderdale
16 Las Olas Shops Fort Lauderdale
17 Lincoln Road Mall Miami Beach
18 Loehmann's Fashion Island Aventura
19 Mall of the Americas Miami
20 The Mall at 163rd Street North Miami Beach
21 The Mall at Wellington Green Wellington
22 Miami International Mall Doral
23 Miracle Marketplace Miami
24 Metrofare Shops & Cafe Government Center, Downtown Miami
25 Pembroke Lakes Mall Pembroke Pines
26 Pompano Citi Centre Pompano Beach
27 Sawgrass Mills Sunrise
28 Seminole Paradise Hollywood
29 The Shops at Fontainebleau Miami Beach
30 The Shops at Mary Brickell Village Brickell, Miami
31 The Shops at Midtown Miami Midtown Miami
32 The Shops at Pembroke Gardens Pembroke Pines
33 The Shops at Sunset Place South Miami
34 Southland Mall Cutler Bay
35 Town Center at Boca Raton Boca Raton
36 The Village at Gulfstream Park Hallandale Beach
37 Village of Merrick Park Coral Gables
38 Westfield Broward Plantation
39 Westland Mall Hialeah

You just need the correct xpath:
rows = response.xpath('//table[#class="wikitable"]//tr[not(./th)]')
for row in rows:
print ''.join(row.xpath('.//td[1]//text()').extract()), ' | ' , ''.join(row.xpath('.//td[2]//text()').extract())
Aventura Mall | Aventura
Bal Harbour Shops | Bal Harbour
Bayside Marketplace | Downtown Miami
Boynton Beach Mall | Boynton Beach
CityPlace | West Palm Beach
CocoWalk | Coconut Grove, Miami
Coral Square | Coral Springs
Dadeland Mall | Kendall
Dolphin Mall | Sweetwater
Downtown at the Gardens | Palm Beach Gardens
The Falls | Kendall
Galeria International Mall | Downtown Miami
The Galleria at Fort Lauderdale | Fort Lauderdale
The Gardens Mall | Palm Beach Gardens
The Grand Doubletree Shops | Downtown Miami
Las Olas Riverfront | Fort Lauderdale
Las Olas Shops | Fort Lauderdale
Lincoln Road Mall | Miami Beach
Loehmann's Fashion Island | Aventura
Mall of the Americas | Miami
The Mall at 163rd Street | North Miami Beach
The Mall at Wellington Green | Wellington
Miami International Mall | Doral
Miracle Marketplace | Miami
Metrofare Shops & Cafe | Government Center, Downtown Miami
Pembroke Lakes Mall | Pembroke Pines
Pompano Citi Centre | Pompano Beach
Sawgrass Mills | Sunrise
Seminole Paradise | Hollywood
The Shops at Fontainebleau | Miami Beach
The Shops at Mary Brickell Village | Brickell, Miami
The Shops at Midtown Miami | Midtown Miami
The Shops at Pembroke Gardens | Pembroke Pines
The Shops at Sunset Place | South Miami
Southland Mall | Cutler Bay
Town Center at Boca Raton | Boca Raton
The Village at Gulfstream Park | Hallandale Beach
Village of Merrick Park | Coral Gables
Westfield Broward | Plantation
Westland Mall | Hialeah

If what you want is consider the two words as one, you can do a string replace over the entire word to replace the comma with an empty string:
location = [loc.replace(',', '') for loc in location]

Related

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:
city Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
city W/L Ratio
0 Boston 2.500000
1 Buffalo 0.555556
2 Calgary 1.057143
3 Chicago 0.846154
4 Columbus 1.500000
5 Dallas–Fort Worth 1.312500
6 Denver 1.433333
7 Detroit 0.769231
8 Edmonton 0.900000
9 Las Vegas 2.125000
10 Los Angeles 1.655862
11 Miami–Fort Lauderdale 1.466667
12 Minneapolis-Saint Paul 1.730769
13 Montreal 0.725000
14 Nashville 2.944444
15 New York 1.517241
16 New York City 0.908870
17 Ottawa 0.651163
18 Philadelphia 1.615385
19 Phoenix 0.707317
20 Pittsburgh 1.620690
21 Raleigh 1.028571
22 San Francisco Bay Area 1.666667
23 St. Louis 1.375000
24 Tampa Bay 2.347826
25 Toronto 1.884615
26 Vancouver 0.775000
27 Washington, D.C. 1.884615
28 Winnipeg 2.600000
And I do a join like this:
result = pd.merge(df, nhl_df , on="city")
The result should have 28 rows, instead I have 24 rows.
One of the missing one is for example Miami-Fort Lauderdale
I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?
city Population W/L Ratio
0 New York City 20153634 0.908870
1 Los Angeles 13310447 1.655862
2 San Francisco Bay Area 6657982 1.666667
3 Chicago 9512999 0.846154
4 Dallas–Fort Worth 7233323 1.312500
5 Washington, D.C. 6131977 1.884615
6 Philadelphia 6070500 1.615385
7 Boston 4794447 2.500000
8 Denver 2853077 1.433333
9 Phoenix 4661537 0.707317
10 Detroit 4297617 0.769231
11 Toronto 5928040 1.884615
12 Pittsburgh 2342299 1.620690
13 St. Louis 2807002 1.375000
14 Nashville 1865298 2.944444
15 Buffalo 1132804 0.555556
16 Montreal 4098927 0.725000
17 Vancouver 2463431 0.775000
18 Columbus 2041520 1.500000
19 Calgary 1392609 1.057143
20 Ottawa 1323783 0.651163
21 Edmonton 1321426 0.900000
22 Winnipeg 778489 2.600000
23 Las Vegas 2155664 2.125000
24 Raleigh 1302946 1.028571
I think here is possible check if same chars by integer that represents the character in function ord, here are different – with code 150 and – with code 8211, so it is reason why values not matched:
a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale
print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale
print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
You can try copy values for replace for select correct - value:
#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

How to extract city name with rege from team name in pandas dataframe

I have the following pandas dataframe, only showing one column
0 Atlantic Division
1 Tampa Bay Lightning*
2 Boston Bruins*
3 Toronto Maple Leafs*
4 Florida Panthers
5 Detroit Red Wings
6 Montreal Canadiens
7 Ottawa Senators
8 Buffalo Sabres
9 Metropolitan Division
10 Washington Capitals*
11 Pittsburgh Penguins*
12 Philadelphia Flyers*
13 Columbus Blue Jackets*
14 New Jersey Devils*
15 Carolina Hurricanes
16 New York Islanders
17 New York Rangers
18 Central Division
19 Nashville Predators*
20 Winnipeg Jets*
21 Minnesota Wild*
22 Colorado Avalanche*
23 St. Louis Blues
24 Dallas Stars
25 Chicago Blackhawks
26 Pacific Division
27 Vegas Golden Knights*
28 Anaheim Ducks*
29 San Jose Sharks*
30 Los Angeles Kings*
31 Calgary Flames
32 Edmonton Oilers
33 Vancouver Canucks
34 Arizona Coyotes
35 Atlantic Division
36 Montreal Canadiens*
37 Ottawa Senators*
38 Boston Bruins*
39 Toronto Maple Leafs*
40 Tampa Bay Lightning
41 Florida Panthers
42 Detroit Red Wings
43 Buffalo Sabres
44 Metropolitan Division
45 Washington Capitals*
46 Pittsburgh Penguins*
47 Columbus Blue Jackets*
48 New York Rangers*
49 New York Islanders
50 Philadelphia Flyers
51 Carolina Hurricanes
52 New Jersey Devils
53 Central Division
54 Chicago Blackhawks*
55 Minnesota Wild*
56 St. Louis Blues*
57 Nashville Predators*
58 Winnipeg Jets
59 Dallas Stars
60 Colorado Avalanche
61 Pacific Division
62 Anaheim Ducks*
63 Edmonton Oilers*
64 San Jose Sharks*
65 Calgary Flames*
66 Los Angeles Kings
67 Arizona Coyotes
68 Vancouver Canucks
69 Atlantic Division
70 Florida Panthers*
71 Tampa Bay Lightning*
72 Detroit Red Wings*
73 Boston Bruins
74 Ottawa Senators
75 Montreal Canadiens
76 Buffalo Sabres
77 Toronto Maple Leafs
78 Metropolitan Division
79 Washington Capitals*
80 Pittsburgh Penguins*
81 New York Rangers*
82 New York Islanders*
83 Philadelphia Flyers*
84 Carolina Hurricanes
85 New Jersey Devils
86 Columbus Blue Jackets
87 Central Division
88 Dallas Stars*
89 St. Louis Blues*
90 Chicago Blackhawks*
91 Nashville Predators*
92 Minnesota Wild*
93 Colorado Avalanche
94 Winnipeg Jets
95 Pacific Division
96 Anaheim Ducks*
97 Los Angeles Kings*
98 San Jose Sharks*
99 Arizona Coyotes
100 Calgary Flames
101 Vancouver Canucks
102 Edmonton Oilers
103 Atlantic Division
104 Montreal Canadiens*
105 Tampa Bay Lightning*
106 Detroit Red Wings*
107 Ottawa Senators*
108 Boston Bruins
109 Florida Panthers
110 Toronto Maple Leafs
111 Buffalo Sabres
112 Metropolitan Division
113 New York Rangers*
114 Washington Capitals*
115 New York Islanders*
116 Pittsburgh Penguins*
117 Columbus Blue Jackets
118 Philadelphia Flyers
119 New Jersey Devils
120 Carolina Hurricanes
121 Central Division
122 St. Louis Blues*
123 Nashville Predators*
124 Chicago Blackhawks*
125 Minnesota Wild*
126 Winnipeg Jets*
127 Dallas Stars
128 Colorado Avalanche
129 Pacific Division
130 Anaheim Ducks*
131 Vancouver Canucks*
132 Calgary Flames*
133 Los Angeles Kings
134 San Jose Sharks
135 Edmonton Oilers
136 Arizona Coyotes
137 Atlantic Division
138 Boston Bruins*
139 Tampa Bay Lightning*
140 Montreal Canadiens*
141 Detroit Red Wings*
142 Ottawa Senators
143 Toronto Maple Leafs
144 Florida Panthers
145 Buffalo Sabres
146 Metropolitan Division
147 Pittsburgh Penguins*
148 New York Rangers*
149 Philadelphia Flyers*
150 Columbus Blue Jackets*
151 Washington Capitals
152 New Jersey Devils
153 Carolina Hurricanes
154 New York Islanders
155 Central Division
156 Colorado Avalanche*
157 St. Louis Blues*
158 Chicago Blackhawks*
159 Minnesota Wild*
160 Dallas Stars*
161 Nashville Predators
162 Winnipeg Jets
163 Pacific Division
164 Anaheim Ducks*
165 San Jose Sharks*
166 Los Angeles Kings*
167 Phoenix Coyotes
168 Vancouver Canucks
169 Calgary Flames
170 Edmonton Oilers
Name: team, dtype: object
I need to create one additional column with the city name.
At first look the regex would be simple (the first word) should be the city name, and the rest is the team name.
However some cities have 2 words (Los Angeles, St Louis ,etc)
Is there a possibility to do this with regex or it has to be done manually?
Update: I tried the following:
nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$')
But I get this error:
ValueError: Wrong number of items passed 2, placement implies 1
You can try something like that:
^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$
Here you should look for city name in first or second group.
This pattern uses assumption that first part of two-word city names has no more than 5 symbols. The result might not be so clean, but seems to work fine on given example.
You can use
^([\w.]{1,5}(?:\s\w+)?\w*)
See the regex demo. Details:
^ - start of string
([\w.]{1,5}(?:\s\w+)?\w*) - Capturing group 1:
[\w.]{1,5} - one to five word or dot chars
(?:\s\w+)? - an optional occurrence of a whitespace and then one or more word chars
\w* - zero or more word chars.
Pandas test:
import pandas as pd
nhl_df = pd.DataFrame({"team":["Atlantic Division","Tampa Bay Lightning*","Boston Bruins*","Toronto Maple Leafs*","Florida Panthers","Detroit Red Wings","Montreal Canadiens","Ottawa Senators","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Philadelphia Flyers*","Columbus Blue Jackets*","New Jersey Devils*","Carolina Hurricanes","New York Islanders","New York Rangers","Central Division","Nashville Predators*","Winnipeg Jets*","Minnesota Wild*","Colorado Avalanche*","St. Louis Blues","Dallas Stars","Chicago Blackhawks","Pacific Division","Vegas Golden Knights*","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Calgary Flames","Edmonton Oilers","Vancouver Canucks","Arizona Coyotes","Atlantic Division","Montreal Canadiens*","Ottawa Senators*","Boston Bruins*","Toronto Maple Leafs*","Tampa Bay Lightning","Florida Panthers","Detroit Red Wings","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Columbus Blue Jackets*","New York Rangers*","New York Islanders","Philadelphia Flyers","Carolina Hurricanes","New Jersey Devils","Central Division","Chicago Blackhawks*","Minnesota Wild*","St. Louis Blues*","Nashville Predators*","Winnipeg Jets","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Edmonton Oilers*","San Jose Sharks*","Calgary Flames*","Los Angeles Kings","Arizona Coyotes","Vancouver Canucks","Atlantic Division","Florida Panthers*","Tampa Bay Lightning*","Detroit Red Wings*","Boston Bruins","Ottawa Senators","Montreal Canadiens","Buffalo Sabres","Toronto Maple Leafs","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","New York Rangers*","New York Islanders*","Philadelphia Flyers*","Carolina Hurricanes","New Jersey Devils","Columbus Blue Jackets","Central Division","Dallas Stars*","St. Louis Blues*","Chicago Blackhawks*","Nashville Predators*","Minnesota Wild*","Colorado Avalanche","Winnipeg Jets","Pacific Division","Anaheim Ducks*","Los Angeles Kings*","San Jose Sharks*","Arizona Coyotes","Calgary Flames","Vancouver Canucks","Edmonton Oilers","Atlantic Division","Montreal Canadiens*","Tampa Bay Lightning*","Detroit Red Wings*","Ottawa Senators*","Boston Bruins","Florida Panthers","Toronto Maple Leafs","Buffalo Sabres","Metropolitan Division","New York Rangers*","Washington Capitals*","New York Islanders*","Pittsburgh Penguins*","Columbus Blue Jackets","Philadelphia Flyers","New Jersey Devils","Carolina Hurricanes","Central Division","St. Louis Blues*","Nashville Predators*","Chicago Blackhawks*","Minnesota Wild*","Winnipeg Jets*","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Vancouver Canucks*","Calgary Flames*","Los Angeles Kings","San Jose Sharks","Edmonton Oilers","Arizona Coyotes","Atlantic Division","Boston Bruins*","Tampa Bay Lightning*","Montreal Canadiens*","Detroit Red Wings*","Ottawa Senators","Toronto Maple Leafs","Florida Panthers","Buffalo Sabres","Metropolitan Division","Pittsburgh Penguins*","New York Rangers*","Philadelphia Flyers*","Columbus Blue Jackets*","Washington Capitals","New Jersey Devils","Carolina Hurricanes","New York Islanders","Central Division","Colorado Avalanche*","St. Louis Blues*","Chicago Blackhawks*","Minnesota Wild*","Dallas Stars*","Nashville Predators","Winnipeg Jets","Pacific Division","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Phoenix Coyotes","Vancouver Canucks","Calgary Flames","Edmonton Oilers"]})
nhl_df['city']=nhl_df['team'].str.extract(r'^([\w.]{1,5}(?:\s\w+)?\w*)')
>>> nhl_df
team city
0 Atlantic Division Atlantic
1 Tampa Bay Lightning* Tampa Bay
2 Boston Bruins* Boston
3 Toronto Maple Leafs* Toronto
4 Florida Panthers Florida
.. ... ...
166 Los Angeles Kings* Los Angeles
167 Phoenix Coyotes Phoenix
168 Vancouver Canucks Vancouver
169 Calgary Flames Calgary
170 Edmonton Oilers Edmonton
^\S+(?=\s\S+$)
This regex gives you the first word of all teamnames that only consist of two words. The others you have to sort manually, because there is no way to tell just by pattern if the middle word is part of the city or the teamname.
Try using the below regex
/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/
Checkthis
function Replace(str) {
var result = str.replace(/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/gim, function (a, $1, $2, $3, $4) {
return `${$2}--${$4}`;
});
return result;
}

How do I change a list I read from a file so that it changes when the program is running but resets when the program ends?

I've been trying to make this program that reads from a file and puts the contents of the file into a list, which has worked out so far. I'm able to do menu options 1, 2, and 6.
But option 3 (where it sorts the list alphabetically), doesn't do anything to change the list.
This is after I've tried copying the sorted list into the "global" team_list. Note, I am NOT trying to change the file, I only wish to change the list within the different features in the while loop so that if you pick option 3 to sort them, it sorts the list alphabetically. But if you were to choose option 2 after choosing option 3, it would display all the names alphabetically.
The .txt file:
Boston Americans
New York Giants
Chicago White Sox
Chicago Cubs
Chicago Cubs
Pittsburgh Pirates
Philadelphia Athletics
Philadelphia Athletics
Boston Red Sox
Philadelphia Athletics
Boston Braves
Boston Red Sox
Boston Red Sox
Chicago White Sox
Boston Red Sox
Cincinnati Reds
Cleveland Indians
New York Giants
New York Giants
New York Yankees
Washington Senators
Pittsburgh Pirates
St. Louis Cardinals
New York Yankees
New York Yankees
Philadelphia Athletics
Philadelphia Athletics
St. Louis Cardinals
New York Yankees
New York Giants
St. Louis Cardinals
Detroit Tigers
New York Yankees
New York Yankees
New York Yankees
New York Yankees
Cincinnati Reds
New York Yankees
St. Louis Cardinals
New York Yankees
St. Louis Cardinals
Detroit Tigers
St. Louis Cardinals
New York Yankees
Cleveland Indians
New York Yankees
New York Yankees
New York Yankees
New York Yankees
New York Yankees
New York Giants
Brooklyn Dodgers
New York Yankees
Milwaukee Braves
New York Yankees
Los Angeles Dodgers
Pittsburgh Pirates
New York Yankees
New York Yankees
Los Angeles Dodgers
St. Louis Cardinals
Los Angeles Dodgers
Baltimore Orioles
St. Louis Cardinals
Detroit Tigers
New York Mets
Baltimore Orioles
Pittsburgh Pirates
Oakland Athletics
Oakland Athletics
Oakland Athletics
Cincinnati Reds
Cincinnati Reds
New York Yankees
New York Yankees
Pittsburgh Pirates
Philadelphia Phillies
Los Angeles Dodgers
St. Louis Cardinals
Baltimore Orioles
Detroit Tigers
Kansas City Royals
New York Mets
Minnesota Twins
Los Angeles Dodgers
Oakland Athletics
Cincinnati Reds
Minnesota Twins
Toronto Blue Jays
Toronto Blue Jays
Atlanta Braves
New York Yankees
Florida Marlins
New York Yankees
New York Yankees
New York Yankees
Arizona Diamondbacks
Anaheim Angels
Florida Marlins
Boston Red Sox
Chicago White Sox
St. Louis Cardinals
Boston Red Sox
Philadelphia Phillies
START_DATE = 1903
END_DATE = 2009
FILE = 'WorldSeriesWinners.txt'
def main():
team_list = file_to_list(FILE)
quit_program = False
while not quit_program:
print('-' * 50)
print('1. Search a team')
print('2. Display team names')
print('3. Sort team names in alphabetical order')
print('4. Sort team names in reverse-alphabetical order')
print('5. Remove a team name')
print('6. Quit')
user_response = int(input('Choose an option (1-6): '))
while user_response < 1 or user_response > 6:
user_response = int(input('Please choose a valid option (1-6): '))
if user_response == 1:
wins = 0
chosen_team = input('Enter a team name: ')
if chosen_team in team_list:
for index in range(len(team_list)):
if team_list[index] == chosen_team:
wins = wins + 1
print(f'The {chosen_team} won the world series {wins} times between {START_DATE} and {END_DATE}')
else:
print(f'The {chosen_team} are not on the list.')
elif user_response == 2:
display(team_list)
elif user_response == 3:
unique_team_names = set(team_list)
new_team_list = list(unique_team_names)
new_team_list.sort()
elif user_response == 6:
quit_program = True
print('Goodbye')
def file_to_list(file_name):
try:
team_file = open(file_name, 'r')
team_list = []
team = team_file.readline()
while team != '':
team = team.rstrip('\n')
team_list.append(team)
team = team_file.readline()
team_file.close()
return team_list
except IOError:
print('File not found')
def display(team_list):
unique_team_names = set(team_list)
new_team_list = list(unique_team_names)
for index in range(len(new_team_list)):
print(new_team_list[index])
if __name__ == '__main__':
main()

Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame

I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!
I came up with this regex (txt is your text from the question):
import re
import numpy as np
import pandas as pd
data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
Prints:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
EDIT: Now should work even for your big document from the regex101

Not able to extract complete city list

I am using the following code to extract the list of cities mentioned on this page, but it gives me just the first 23 cities.
Can't figure out where I am going wrong!
import requests,bs4
res=requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text=bs4.BeautifulSoup(res.text,"lxml")
fields=text.select('td[bgcolor="silver"] > font[size="-2"] > b')
print len(fields)
for field in fields:
print field.getText()
This is the output I am getting:
23
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
But this webpage contains 125 cities.
lxml works fine for me, I get 124 cities using your own code so it has nothing to do with the parser, you are either using an old version of bs4 or it is an encoding issue, you should call .content and let requests handle the encoding, you are also missing a city using your logic, to get all 125:
import requests, bs4
res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text = bs4.BeautifulSoup(res.content,"lxml")
rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
print(len(rows))
for row in rows:
print(row.get_text())
If we run it, you can see we get all the cities:
In [1]: import requests,bs4
In [2]: res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
In [3]: text = bs4.BeautifulSoup(res.text,"lxml")
In [4]: rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
In [5]: print(len(rows))
125
In [6]: for row in rows:
...: print(row.get_text())
...:
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
Chicago
London
Shenzhen
Essen/Düsseldorf
Tehran
Bogota
Lima
Bangkok
Johannesburg/East Rand
Chennai
Taipei
Baghdad
Santiago
Bangalore
Hyderabad
St Petersburg
Philadelphia
Lahore
Kinshasa
Miami
Ho Chi Minh City
Madrid
Tianjin
Kuala Lumpur
Toronto
Milan
Shenyang
Dallas/Fort Worth
Boston
Belo Horizonte
Khartoum
Riyadh
Singapore
Washington
Detroit
Barcelona
Houston
Athens
Berlin
Sydney
Atlanta
Guadalajara
San Francisco/Oakland
Montreal.
Monterey
Melbourne
Ankara
Recife
Phoenix/Mesa
Durban
Porto Alegre
Dalian
Jeddah
Seattle
Cape Town
San Diego
Fortaleza
Curitiba
Rome
Naples
Minneapolis/St. Paul
Tel Aviv
Birmingham
Frankfurt
Lisbon
Manchester
San Juan
Katowice
Tashkent
Fukuoka
Baku/Sumqayit
St. Louis
Baltimore
Sapporo
Tampa/St. Petersburg
Taichung
Warsaw
Denver
Cologne/Bonn
Hamburg
Dubai
Pretoria
Vancouver
Beirut
Budapest
Cleveland
Pittsburgh
Campinas
Harare
Brasilia
Kuwait
Munich
Portland
Brussels
Vienna
San Jose
Damman
Copenhagen
Brisbane
Riverside/San Bernardino
Cincinnati
Accra

Categories

Resources