I wonder how to replace the string value of 'Singapore' in location1 column with the string values from location2 column. In this case, they're Tokyo, Boston, Toronto and Hong Kong, Boston.
import pandas as pd
data = {'location1':["London, Paris", "Singapore", "London, New York", "Singapore", "Boston"],
'location2':["London, Paris", "Tokyo, Boston, Toronto", "London, New York", "Hong Kong, Boston", "Boston"]}
df = pd.DataFrame(data)
location1 location2
0 London, Paris London, Paris
1 Singapore Tokyo, Boston, Toronto
2 London, New York London, New York
3 Singapore Hong Kong, Boston
4 Boston Boston
Simply, use .loc and indexing:
df.loc[df['location1'].eq('Singapore'), 'location1'] = df['location2']
print(df)
# Output:
location1 location2
0 London, Paris London, Paris
1 Tokyo, Boston, Toronto Tokyo, Boston, Toronto
2 London, New York London, New York
3 Hong Kong, Boston Hong Kong, Boston
4 Boston Boston
We can do it using the numpy.where method :
>>> import numpy as np
>>> df["location1"] = np.where(df["location1"] == 'Singapore', df["location2"], df["location1"])
>>> df
location1 location2
0 London, Paris London, Paris
1 Tokyo, Boston, Toronto Tokyo, Boston, Toronto
2 London, New York London, New York
3 Hong Kong, Boston Hong Kong, Boston
4 Boston Boston
Try:
df['location1'] = df['location1'].mask(df['location1'] == 'Singapore')\
.fillna(df['location2'])
Output:
location1 location2
0 London, Paris London, Paris
1 Tokyo, Boston, Toronto Tokyo, Boston, Toronto
2 London, New York London, New York
3 Hong Kong, Boston Hong Kong, Boston
4 Boston Boston
I have the following pandas dataframe, only showing one column
0 Atlantic Division
1 Tampa Bay Lightning*
2 Boston Bruins*
3 Toronto Maple Leafs*
4 Florida Panthers
5 Detroit Red Wings
6 Montreal Canadiens
7 Ottawa Senators
8 Buffalo Sabres
9 Metropolitan Division
10 Washington Capitals*
11 Pittsburgh Penguins*
12 Philadelphia Flyers*
13 Columbus Blue Jackets*
14 New Jersey Devils*
15 Carolina Hurricanes
16 New York Islanders
17 New York Rangers
18 Central Division
19 Nashville Predators*
20 Winnipeg Jets*
21 Minnesota Wild*
22 Colorado Avalanche*
23 St. Louis Blues
24 Dallas Stars
25 Chicago Blackhawks
26 Pacific Division
27 Vegas Golden Knights*
28 Anaheim Ducks*
29 San Jose Sharks*
30 Los Angeles Kings*
31 Calgary Flames
32 Edmonton Oilers
33 Vancouver Canucks
34 Arizona Coyotes
35 Atlantic Division
36 Montreal Canadiens*
37 Ottawa Senators*
38 Boston Bruins*
39 Toronto Maple Leafs*
40 Tampa Bay Lightning
41 Florida Panthers
42 Detroit Red Wings
43 Buffalo Sabres
44 Metropolitan Division
45 Washington Capitals*
46 Pittsburgh Penguins*
47 Columbus Blue Jackets*
48 New York Rangers*
49 New York Islanders
50 Philadelphia Flyers
51 Carolina Hurricanes
52 New Jersey Devils
53 Central Division
54 Chicago Blackhawks*
55 Minnesota Wild*
56 St. Louis Blues*
57 Nashville Predators*
58 Winnipeg Jets
59 Dallas Stars
60 Colorado Avalanche
61 Pacific Division
62 Anaheim Ducks*
63 Edmonton Oilers*
64 San Jose Sharks*
65 Calgary Flames*
66 Los Angeles Kings
67 Arizona Coyotes
68 Vancouver Canucks
69 Atlantic Division
70 Florida Panthers*
71 Tampa Bay Lightning*
72 Detroit Red Wings*
73 Boston Bruins
74 Ottawa Senators
75 Montreal Canadiens
76 Buffalo Sabres
77 Toronto Maple Leafs
78 Metropolitan Division
79 Washington Capitals*
80 Pittsburgh Penguins*
81 New York Rangers*
82 New York Islanders*
83 Philadelphia Flyers*
84 Carolina Hurricanes
85 New Jersey Devils
86 Columbus Blue Jackets
87 Central Division
88 Dallas Stars*
89 St. Louis Blues*
90 Chicago Blackhawks*
91 Nashville Predators*
92 Minnesota Wild*
93 Colorado Avalanche
94 Winnipeg Jets
95 Pacific Division
96 Anaheim Ducks*
97 Los Angeles Kings*
98 San Jose Sharks*
99 Arizona Coyotes
100 Calgary Flames
101 Vancouver Canucks
102 Edmonton Oilers
103 Atlantic Division
104 Montreal Canadiens*
105 Tampa Bay Lightning*
106 Detroit Red Wings*
107 Ottawa Senators*
108 Boston Bruins
109 Florida Panthers
110 Toronto Maple Leafs
111 Buffalo Sabres
112 Metropolitan Division
113 New York Rangers*
114 Washington Capitals*
115 New York Islanders*
116 Pittsburgh Penguins*
117 Columbus Blue Jackets
118 Philadelphia Flyers
119 New Jersey Devils
120 Carolina Hurricanes
121 Central Division
122 St. Louis Blues*
123 Nashville Predators*
124 Chicago Blackhawks*
125 Minnesota Wild*
126 Winnipeg Jets*
127 Dallas Stars
128 Colorado Avalanche
129 Pacific Division
130 Anaheim Ducks*
131 Vancouver Canucks*
132 Calgary Flames*
133 Los Angeles Kings
134 San Jose Sharks
135 Edmonton Oilers
136 Arizona Coyotes
137 Atlantic Division
138 Boston Bruins*
139 Tampa Bay Lightning*
140 Montreal Canadiens*
141 Detroit Red Wings*
142 Ottawa Senators
143 Toronto Maple Leafs
144 Florida Panthers
145 Buffalo Sabres
146 Metropolitan Division
147 Pittsburgh Penguins*
148 New York Rangers*
149 Philadelphia Flyers*
150 Columbus Blue Jackets*
151 Washington Capitals
152 New Jersey Devils
153 Carolina Hurricanes
154 New York Islanders
155 Central Division
156 Colorado Avalanche*
157 St. Louis Blues*
158 Chicago Blackhawks*
159 Minnesota Wild*
160 Dallas Stars*
161 Nashville Predators
162 Winnipeg Jets
163 Pacific Division
164 Anaheim Ducks*
165 San Jose Sharks*
166 Los Angeles Kings*
167 Phoenix Coyotes
168 Vancouver Canucks
169 Calgary Flames
170 Edmonton Oilers
Name: team, dtype: object
I need to create one additional column with the city name.
At first look the regex would be simple (the first word) should be the city name, and the rest is the team name.
However some cities have 2 words (Los Angeles, St Louis ,etc)
Is there a possibility to do this with regex or it has to be done manually?
Update: I tried the following:
nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$')
But I get this error:
ValueError: Wrong number of items passed 2, placement implies 1
You can try something like that:
^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$
Here you should look for city name in first or second group.
This pattern uses assumption that first part of two-word city names has no more than 5 symbols. The result might not be so clean, but seems to work fine on given example.
You can use
^([\w.]{1,5}(?:\s\w+)?\w*)
See the regex demo. Details:
^ - start of string
([\w.]{1,5}(?:\s\w+)?\w*) - Capturing group 1:
[\w.]{1,5} - one to five word or dot chars
(?:\s\w+)? - an optional occurrence of a whitespace and then one or more word chars
\w* - zero or more word chars.
Pandas test:
import pandas as pd
nhl_df = pd.DataFrame({"team":["Atlantic Division","Tampa Bay Lightning*","Boston Bruins*","Toronto Maple Leafs*","Florida Panthers","Detroit Red Wings","Montreal Canadiens","Ottawa Senators","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Philadelphia Flyers*","Columbus Blue Jackets*","New Jersey Devils*","Carolina Hurricanes","New York Islanders","New York Rangers","Central Division","Nashville Predators*","Winnipeg Jets*","Minnesota Wild*","Colorado Avalanche*","St. Louis Blues","Dallas Stars","Chicago Blackhawks","Pacific Division","Vegas Golden Knights*","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Calgary Flames","Edmonton Oilers","Vancouver Canucks","Arizona Coyotes","Atlantic Division","Montreal Canadiens*","Ottawa Senators*","Boston Bruins*","Toronto Maple Leafs*","Tampa Bay Lightning","Florida Panthers","Detroit Red Wings","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Columbus Blue Jackets*","New York Rangers*","New York Islanders","Philadelphia Flyers","Carolina Hurricanes","New Jersey Devils","Central Division","Chicago Blackhawks*","Minnesota Wild*","St. Louis Blues*","Nashville Predators*","Winnipeg Jets","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Edmonton Oilers*","San Jose Sharks*","Calgary Flames*","Los Angeles Kings","Arizona Coyotes","Vancouver Canucks","Atlantic Division","Florida Panthers*","Tampa Bay Lightning*","Detroit Red Wings*","Boston Bruins","Ottawa Senators","Montreal Canadiens","Buffalo Sabres","Toronto Maple Leafs","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","New York Rangers*","New York Islanders*","Philadelphia Flyers*","Carolina Hurricanes","New Jersey Devils","Columbus Blue Jackets","Central Division","Dallas Stars*","St. Louis Blues*","Chicago Blackhawks*","Nashville Predators*","Minnesota Wild*","Colorado Avalanche","Winnipeg Jets","Pacific Division","Anaheim Ducks*","Los Angeles Kings*","San Jose Sharks*","Arizona Coyotes","Calgary Flames","Vancouver Canucks","Edmonton Oilers","Atlantic Division","Montreal Canadiens*","Tampa Bay Lightning*","Detroit Red Wings*","Ottawa Senators*","Boston Bruins","Florida Panthers","Toronto Maple Leafs","Buffalo Sabres","Metropolitan Division","New York Rangers*","Washington Capitals*","New York Islanders*","Pittsburgh Penguins*","Columbus Blue Jackets","Philadelphia Flyers","New Jersey Devils","Carolina Hurricanes","Central Division","St. Louis Blues*","Nashville Predators*","Chicago Blackhawks*","Minnesota Wild*","Winnipeg Jets*","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Vancouver Canucks*","Calgary Flames*","Los Angeles Kings","San Jose Sharks","Edmonton Oilers","Arizona Coyotes","Atlantic Division","Boston Bruins*","Tampa Bay Lightning*","Montreal Canadiens*","Detroit Red Wings*","Ottawa Senators","Toronto Maple Leafs","Florida Panthers","Buffalo Sabres","Metropolitan Division","Pittsburgh Penguins*","New York Rangers*","Philadelphia Flyers*","Columbus Blue Jackets*","Washington Capitals","New Jersey Devils","Carolina Hurricanes","New York Islanders","Central Division","Colorado Avalanche*","St. Louis Blues*","Chicago Blackhawks*","Minnesota Wild*","Dallas Stars*","Nashville Predators","Winnipeg Jets","Pacific Division","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Phoenix Coyotes","Vancouver Canucks","Calgary Flames","Edmonton Oilers"]})
nhl_df['city']=nhl_df['team'].str.extract(r'^([\w.]{1,5}(?:\s\w+)?\w*)')
>>> nhl_df
team city
0 Atlantic Division Atlantic
1 Tampa Bay Lightning* Tampa Bay
2 Boston Bruins* Boston
3 Toronto Maple Leafs* Toronto
4 Florida Panthers Florida
.. ... ...
166 Los Angeles Kings* Los Angeles
167 Phoenix Coyotes Phoenix
168 Vancouver Canucks Vancouver
169 Calgary Flames Calgary
170 Edmonton Oilers Edmonton
^\S+(?=\s\S+$)
This regex gives you the first word of all teamnames that only consist of two words. The others you have to sort manually, because there is no way to tell just by pattern if the middle word is part of the city or the teamname.
Try using the below regex
/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/
Checkthis
function Replace(str) {
var result = str.replace(/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/gim, function (a, $1, $2, $3, $4) {
return `${$2}--${$4}`;
});
return result;
}
I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!
I came up with this regex (txt is your text from the question):
import re
import numpy as np
import pandas as pd
data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
Prints:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
EDIT: Now should work even for your big document from the regex101
I'm scraping this wikipedia page:
https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area
And getting the data from the table, like this:
Location = response.xpath('//*[#id="mw-content-text"]/table/tr/td[2]/a/text()').extract()[0]
Name = response.xpath('//*[#id="mw-content-text"]/table/tr/td[1]/a/text()').extract()
Once I have it, the plan is to add those list to a data frame. The issue is the that at the end I get:
len(Name)
40
and
len(Location)
47
This is because at some rows in the location column there are several elements, like in the third column where it is: Coconut Grove, Miami
there I get to elements.
You can use read_html and df is first df of dfs:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area',
header=0 )[0]
print df
Name Location
0 Aventura Mall Aventura
1 Bal Harbour Shops Bal Harbour
2 Bayside Marketplace Downtown Miami
3 Boynton Beach Mall Boynton Beach
4 CityPlace West Palm Beach
5 CocoWalk Coconut Grove, Miami
6 Coral Square Coral Springs
7 Dadeland Mall Kendall
8 Dolphin Mall Sweetwater
9 Downtown at the Gardens Palm Beach Gardens
10 The Falls Kendall
11 Galeria International Mall Downtown Miami
12 The Galleria at Fort Lauderdale Fort Lauderdale
13 The Gardens Mall Palm Beach Gardens
14 The Grand Doubletree Shops Downtown Miami
15 Las Olas Riverfront Fort Lauderdale
16 Las Olas Shops Fort Lauderdale
17 Lincoln Road Mall Miami Beach
18 Loehmann's Fashion Island Aventura
19 Mall of the Americas Miami
20 The Mall at 163rd Street North Miami Beach
21 The Mall at Wellington Green Wellington
22 Miami International Mall Doral
23 Miracle Marketplace Miami
24 Metrofare Shops & Cafe Government Center, Downtown Miami
25 Pembroke Lakes Mall Pembroke Pines
26 Pompano Citi Centre Pompano Beach
27 Sawgrass Mills Sunrise
28 Seminole Paradise Hollywood
29 The Shops at Fontainebleau Miami Beach
30 The Shops at Mary Brickell Village Brickell, Miami
31 The Shops at Midtown Miami Midtown Miami
32 The Shops at Pembroke Gardens Pembroke Pines
33 The Shops at Sunset Place South Miami
34 Southland Mall Cutler Bay
35 Town Center at Boca Raton Boca Raton
36 The Village at Gulfstream Park Hallandale Beach
37 Village of Merrick Park Coral Gables
38 Westfield Broward Plantation
39 Westland Mall Hialeah
You just need the correct xpath:
rows = response.xpath('//table[#class="wikitable"]//tr[not(./th)]')
for row in rows:
print ''.join(row.xpath('.//td[1]//text()').extract()), ' | ' , ''.join(row.xpath('.//td[2]//text()').extract())
Aventura Mall | Aventura
Bal Harbour Shops | Bal Harbour
Bayside Marketplace | Downtown Miami
Boynton Beach Mall | Boynton Beach
CityPlace | West Palm Beach
CocoWalk | Coconut Grove, Miami
Coral Square | Coral Springs
Dadeland Mall | Kendall
Dolphin Mall | Sweetwater
Downtown at the Gardens | Palm Beach Gardens
The Falls | Kendall
Galeria International Mall | Downtown Miami
The Galleria at Fort Lauderdale | Fort Lauderdale
The Gardens Mall | Palm Beach Gardens
The Grand Doubletree Shops | Downtown Miami
Las Olas Riverfront | Fort Lauderdale
Las Olas Shops | Fort Lauderdale
Lincoln Road Mall | Miami Beach
Loehmann's Fashion Island | Aventura
Mall of the Americas | Miami
The Mall at 163rd Street | North Miami Beach
The Mall at Wellington Green | Wellington
Miami International Mall | Doral
Miracle Marketplace | Miami
Metrofare Shops & Cafe | Government Center, Downtown Miami
Pembroke Lakes Mall | Pembroke Pines
Pompano Citi Centre | Pompano Beach
Sawgrass Mills | Sunrise
Seminole Paradise | Hollywood
The Shops at Fontainebleau | Miami Beach
The Shops at Mary Brickell Village | Brickell, Miami
The Shops at Midtown Miami | Midtown Miami
The Shops at Pembroke Gardens | Pembroke Pines
The Shops at Sunset Place | South Miami
Southland Mall | Cutler Bay
Town Center at Boca Raton | Boca Raton
The Village at Gulfstream Park | Hallandale Beach
Village of Merrick Park | Coral Gables
Westfield Broward | Plantation
Westland Mall | Hialeah
If what you want is consider the two words as one, you can do a string replace over the entire word to replace the comma with an empty string:
location = [loc.replace(',', '') for loc in location]