beautifulsoup find text between span

beautifulsoup find text between span - python

I want to get just a text from span:
html = <a class="business-name" data-analytics='{"click_id":1600,"target":"name","feature_click":""}' href="/new-york-ny/bpp/upper-eastside-orthodontists-20151" rel=""><span>Upper Eastside Orthodontists</span></a>
name = html.find('a', {'class', 'business-name'})
print(name.find('span').text)
give me results:
print(name.find('span').text)
AttributeError: 'NoneType' object has no attribute 'text'
I want to get just the text: Upper Eastside Orthodontists

What you are actually looking for is not in the static/initial request. The page is rendered dynamically.
Luckily the data does come in under the <script> tags, and you can pull out the json and parse it from there:
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
url = 'https://www.superpages.com/new-york-ny/dentists?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find_all('script', {'type':"application/ld+json"})[-2]
p = re.compile('({.*})')
result = p.search(str(script))
data = json.loads(result.group(0))
df = pd.DataFrame(data['mainEntity']['itemListElement'])
Output:
print(df.to_string())
#type name url
0 ItemPage Upper Eastside Orthodontists https://www.superpages.com/new-york-ny/bpp/upper-eastside-orthodontists-20151
1 ItemPage Kara https://www.superpages.com/new-york-ny/bpp/kara-5721648
2 ItemPage Central Park West Dentistry https://www.superpages.com/new-york-ny/bpp/central-park-west-dentistry-471054528
3 ItemPage Majid Rajabi Khamesi Advanced Family Dental https://www.superpages.com/new-york-ny/bpp/majid-rajabi-khamesi-advanced-family-dental-542761105
4 ItemPage Robert Veligdan, DMD, PC https://www.superpages.com/new-york-ny/bpp/robert-veligdan-dmd-pc-21238912
5 ItemPage Irina Rossinski, DDS https://www.superpages.com/new-york-ny/bpp/irina-rossinski-dds-462447740
6 ItemPage Dr. Michael J. Wei https://www.superpages.com/new-york-ny/bpp/dr-michael-j-wei-504012551
7 ItemPage Manhattan Dental Spa https://www.superpages.com/new-york-ny/bpp/manhattan-dental-spa-22612348
8 ItemPage Expert Dental PC https://www.superpages.com/new-york-ny/bpp/expert-dental-pc-459327373
9 ItemPage Dr. Jonathan Freed, D.D.S., P.C. https://www.superpages.com/new-york-ny/bpp/dr-jonathan-freed-d-d-s-p-c-503142997
10 ItemPage Clifford S. Melnick, DMD PC https://www.superpages.com/new-york-ny/bpp/clifford-s-melnick-dmd-pc-512698216
11 ItemPage Ronald Birnbaum Dds https://www.superpages.com/new-york-ny/bpp/ronald-birnbaum-dds-2757412
12 ItemPage Concerned Dental Care https://www.superpages.com/new-york-ny/bpp/concerned-dental-care-453434343
13 ItemPage DownTown Dental Cosmetic Center https://www.superpages.com/new-york-ny/bpp/downtown-dental-cosmetic-center-468569119
14 ItemPage Beth Caunitz, D.D.S. https://www.superpages.com/new-york-ny/bpp/beth-caunitz-d-d-s-479935675
15 ItemPage Alice Urbankova DDS, P https://www.superpages.com/new-york-ny/bpp/alice-urbankova-dds-p-474879958
16 ItemPage Wu Darryl DDS PC https://www.superpages.com/new-york-ny/bpp/wu-darryl-dds-pc-8291524
17 ItemPage Gerald Rosen DDS https://www.superpages.com/new-york-ny/bpp/gerald-rosen-dds-470302208
18 ItemPage Group Health Dental https://www.superpages.com/new-york-ny/bpp/group-health-dental-15648711
19 ItemPage Dr. Shaun Massiah, DMD https://www.superpages.com/new-york-ny/bpp/dr-shaun-massiah-dmd-453290181
20 ItemPage Park 56 Dental https://www.superpages.com/new-york-ny/bpp/park-56-dental-479624928?lid=1001970746762
21 ItemPage Rubin Esther S https://www.superpages.com/new-york-ny/bpp/rubin-esther-s-462458952
22 ItemPage David P Pitman DMD https://www.superpages.com/new-york-ny/bpp/david-p-pitman-dmd-9139813
23 ItemPage Daniell Jason Mishaan, DMD https://www.superpages.com/new-york-ny/bpp/daniell-jason-mishaan-dmd-479623764
24 ItemPage Dolman Oral Surgery https://www.superpages.com/new-york-ny/bpp/dolman-oral-surgery-534333982
25 ItemPage Emagen Dental https://www.superpages.com/new-york-ny/bpp/emagen-dental-460512214
26 ItemPage The Exchange Dental Group https://www.superpages.com/new-york-ny/bpp/the-exchange-dental-group-462981940
27 ItemPage Joshua M. Wilges DDS & Associates https://www.superpages.com/new-york-ny/bpp/joshua-m-wilges-dds-associates-497873451
28 ItemPage Oren Rahmanan, DDS https://www.superpages.com/new-york-ny/bpp/oren-rahmanan-dds-472633138
29 ItemPage Victoria Veytsman, DDS https://www.superpages.com/new-york-ny/bpp/victoria-veytsman-dds-456826960
You could then iterate through each link to get the data from their page.
The other option which is a little tricky is I did find it within the html. It's only tricky in that you need to cut out the excess (there's the sponsor ad, and then more after the initial 30 results, that don't follow the same html structure/pattern)
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
url = 'https://www.superpages.com/new-york-ny/dentists?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
businesses = soup.find_all('a', {'class':'business-name'})
rows = []
for each in businesses[1:31]:
name = each.text
address = each.find_next('div', {'class':'street-address'}).text
phone = each.find_next('a', {'class':'phones phone primary'}).text.replace('Call Now','')
row = {'name':name,
'address':address,
'phone':phone}
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df.to_string())
name address phone
0 Upper Eastside Orthodontists 153 E 87th St Apt 1b, New York, NY, 10128 888-378-2976
1 Kara 30 E 60th St Rm 503, New York, NY, 10022 212-355-2195
2 Central Park West Dentistry 25 W 68th St, New York, NY, 10023 212-579-8885
3 Majid Rajabi Khamesi Advanced Family Dental 30 E 40th St Rm 705, New York, NY, 10016 212-481-2535
4 Robert Veligdan, DMD, PC 343 W 58th St, New York, NY, 10019 212-832-2330
5 Irina Rossinski, DDS 30 5th Ave Apt 1g, New York, NY, 10011 212-673-3700
6 Dr. Michael J. Wei 425 Madison Ave.20th Floor, New York, NY, 10017 646-798-6490
7 Manhattan Dental Spa 200 Madison Ave Ste 2201, New York, NY, 10016 212-683-2530
8 Expert Dental PC 110 E 40th St Rm 104, New York, NY, 10016 212-682-2965
9 Dr. Jonathan Freed, D.D.S., P.C. 315 Madison Ave Rm 509, New York, NY, 10017 212-682-5644
10 Clifford S. Melnick, DMD PC 41 W 58th St Apt 2e, New York, NY, 10019 212-355-1266
11 Ronald Birnbaum Dds 425 W 59th St, New York, NY, 10019 212-523-8030
12 Concerned Dental Care 30 E 40th St Rm 207, New York, NY, 10016 212-696-4979
13 DownTown Dental Cosmetic Center 160 Broadway, New York, NY, 10038 212-964-3337
14 Beth Caunitz, D.D.S. 30 East 40th Street, Suite 406, New York, NY, 10016 212-206-9002
15 Alice Urbankova DDS, P 630 5th Ave Ste 1860, New York, NY, 10111 212-765-7340
16 Wu Darryl DDS PC 41 Elizabeth St, New York, NY, 10013 212-925-7757
17 Gerald Rosen DDS 59 E 54th St, New York, NY, 10022 212-753-9860
18 Group Health Dental 230 W 41st St, New York, NY, 10036 212-398-9690
19 Dr. Shaun Massiah, DMD 50 W 97th St Apt 1c, New York, NY, 10025 212-222-5225
20 Park 56 Dental 120 E 56th St Rm 610, New York, NY, 10022 347-770-3915
21 Rubin Esther S 18 E 48th St, New York, NY, 10017 212-593-7272
22 David P Pitman DMD 57 W 57th St Ste 707, New York, NY, 10019 212-888-2833
23 Daniell Jason Mishaan, DMD 241 W 37th St, New York, NY, 10018 212-730-4440
24 Dolman Oral Surgery 16 E 52nd St Ste 402, New York, NY, 10022 212-696-0167
25 Emagen Dental 250 8th Ave Apt 2s, New York, NY, 10011 212-352-9300
26 The Exchange Dental Group 39 Broadway Rm 2115, New York, NY, 10006 212-422-9229
27 Joshua M. Wilges DDS & Associates 2 West 45th Street Suite 1708, New York, NY, 10036 646-590-2100
28 Oren Rahmanan, DDS 1 Rockefeller Plz Rm 2223, New York, NY, 10020 212-581-6736
29 Victoria Veytsman, DDS 509 Madison Ave Rm 1704, New York, NY, 10022 212-759-6700

Related

select random pairs from remaining unique values in a list

Updated: Not sure I explained it well first time.
I have a scheduling problem, or more accurately, a "first come first served" problem. A list of available assets are assigned a set of spaces, available in pairs (think cars:parking spots, diners:tables, teams:games). I need a rough simulation (random) that chooses the first two to arrive from available pairs, then chooses the next two from remaining available pairs, and so on, until all spaces are filled.
Started using teams:games to cut my teeth. The first pair is easy enough. How do I then whittle it down to fill the next two spots from among the remaining available entities? Tried a bunch of different things, but coming up short. Help appreciated.
import itertools
import numpy as np
import pandas as pd
a = ['Georgia','Oregon','Florida','Texas'], ['Georgia','Oregon','Florida','Texas']
b = [(x,y) for x,y in itertools.product(*a) if x != y]
c = pd.DataFrame(b)
c.columns = ['home', 'away']
print(c)
d = c.sample(n = 2, replace = False)
print(d)
The first results is all possible combinations. But, once the first slots are filled, there can be no repeats. in example below, once Oregon and Georgia are slated in, the only remaining options to choose from are Forlida:Texas or Texas:Florida. Obviously just the sample function alone produces duplicates frequently. I will need this to scale up to dozens, then hundreds of entities:slots. Many thanks in advance!
home away
0 Georgia Oregon
1 Georgia Florida
2 Georgia Texas
3 Oregon Georgia
4 Oregon Florida
5 Oregon Texas
6 Florida Georgia
7 Florida Oregon
8 Florida Texas
9 Texas Georgia
10 Texas Oregon
11 Texas Florida
home away
3 Oregon Georgia
5 Oregon Texas

Not exactly sure what you are trying to do. But if you want to randomly pair your unique entities you can simply randomly order them and then place them in a 2-columns dataframe. I wrote this with all the US states minus one (Wyomi):
states = ['Alaska','Alabama','Arkansas','Arizona','California',
'Colorado','Connecticut','District of Columbia','Delaware',
'Florida','Georgia','Hawaii','Iowa','Idaho','Illinois',
'Indiana','Kansas','Kentucky','Louisiana','Massachusetts',
'Maryland','Maine','Michigan','Minnesota','Missouri',
'Mississippi','Montana','North Carolina','North Dakota',
'Nebraska','New Hampshire','New Jersey','New Mexico',
'Nevada','New York','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina',
'South Dakota','Tennessee','Texas','Utah','Virginia',
'Vermont','Washington','Wisconsin','West Virginia']
a=states.copy()
random.shuffle(states)
c = pd.DataFrame({'home':a[::2],'away':a[1::2]})
print(c)
#Output
home away
0 West Virginia Minnesota
1 New Hampshire Louisiana
2 Nevada Florida
3 Alabama Indiana
4 Delaware North Dakota
5 Georgia Rhode Island
6 Oregon Pennsylvania
7 New York South Dakota
8 Maryland Kansas
9 Ohio Hawaii
10 Colorado Wisconsin
11 Iowa Idaho
12 Illinois Missouri
13 Arizona Mississippi
14 Connecticut Montana
15 District of Columbia Vermont
16 Tennessee Kentucky
17 Alaska Washington
18 California Michigan
19 Arkansas New Jersey
20 Massachusetts Utah
21 Oklahoma New Mexico
22 Virginia South Carolina
23 North Carolina Maine
24 Texas Nebraska
Not sure if this is exactly what you were asking for though.
If you need to schedule all the fixtures of the season, you can check this answer --> League fixture generator in python

After updating a df from filtred df old values reappear

i have a df where i have a requirement to filter it into new df and work on it and after working i wanted to update it to the original df like.
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
my intened output is
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In
46514
So firstly i filtered the original dataframe into a new df and worked on it.
with following code
Filter3_df= Final[Final['State'].isnull()]
Filter3_df['temp'] = Filter3_df['City'].str.extract('([A-Za-z]+)')
mask2= Filter3_df['temp'].notnull()
Filter3_df.loc[mask2, 'Zip'] = Filter3_df.loc[mask2, 'City'].str[-5:]
Filter3_df.loc[mask2, 'State'] = Filter3_df.loc[mask2, 'temp']
del Filter3_df['temp']
Filter3_df['City']= float('NaN')
after this the table for Filter3_df looks like this
Street
City
State
Zip
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
4000 E Bristol St Ste 3 Elkhart
In
46514
but when i update this filtered_df back to the original df using
Final.update(Filter3_df)
I am not getting the intended output instead I am getting the output as
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
In
46514
kindly let me know where am i going wrong.

From the docs, pandas.DataFrame.update:
Modify in place using non-NA values from another DataFrame.
Replace Filter3_df['City']= float('NaN'), which is NA for floats, with the value you really want:
Filter3_df['City'] = ""

BeautifulSoup Assistance

I am trying to scrape the following website (https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0) and ultimately am interested in storing some of the data inside each 'li class="search-result-item"' to perform further analytics.
Example of one "search-result-item"
I want to capture the <h3>,<span class="plaque-role"> and <span class="plaque-location"> in a python dictionary:
<li class="search-result-item"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></li>
So far I am trying to isolate all the "search-result-item" but my current code prints absolutely nothing. If someone can help me sort that problem out and point me in the right direction to storing each data element into a python dictionary I would be very grateful.
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()

You're not getting anything because the search results are generated by JavaScript. Use the API endpoint they fetch the data from.
For example:
import requests
api_url = "https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0&_=1653043005731"
plaques = requests.get(api_url).json()["plaques"]
for plaque in plaques:
print(plaque["title"])
print(plaque["address"])
print(f"https://www.english-heritage.org.uk{plaque['path']}")
print("-" * 80)
Output:
GWYNNE-VAUGHAN, Dame Helen (1879-1967)
Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden
https://www.english-heritage.org.uk/visit/blue-plaques/helen-gwynne-vaughan/
--------------------------------------------------------------------------------
READING, Lady Stella (1894-1971)
41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/stella-lady-reading/
--------------------------------------------------------------------------------
32 SOHO SQUARE
32 Soho Square, Soho, London, W1D 3AP, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/soho-square/
--------------------------------------------------------------------------------
14 BUCKINGHAM STREET
14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/buckingham-street/
--------------------------------------------------------------------------------
ABRAHAMS, Harold (1899-1978)
Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet
https://www.english-heritage.org.uk/visit/blue-plaques/abrahams-harold/
--------------------------------------------------------------------------------
ADAM, ROBERT and HOOD, THOMAS and GALSWORTHY, JOHN and BARRIE, SIR JAMES
1-3 Robert Street, Adelphi, Charing Cross, London, WC2N 6BN, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adam-hood-galsworthy-barrie/
--------------------------------------------------------------------------------
ADAMS, Henry Brooks (1838-1918)
98 Portland Place, Marylebone, London, W1B 1ET, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/united-states-embassy/
--------------------------------------------------------------------------------
ADELPHI, The
The Adelphi Terrace, Charing Cross, London, WC2N 6BJ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adelphi/
--------------------------------------------------------------------------------
ALDRIDGE, Ira (1807-1867)
5 Hamlet Road, Upper Norwood, London, SE19 2AP, London Borough of Bromley
https://www.english-heritage.org.uk/visit/blue-plaques/aldridge-ira/
--------------------------------------------------------------------------------
ALEXANDER, Sir George (1858-1918)
57 Pont Street, Chelsea, London, SW1X 0BD, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/george-alexander/
--------------------------------------------------------------------------------
ALLENBY, Field Marshal Edmund Henry Hynman, Viscount Allenby (1861-1936)
24 Wetherby Gardens, South Kensington, London, SW5 0JR, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/field-marshal-viscount-allenby/
--------------------------------------------------------------------------------
ALMA-TADEMA, Sir Lawrence, O.M. (1836-1912)
44 Grove End Road, St John's Wood, London, NW8 9NE, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/lawrence-alma-tadema/
--------------------------------------------------------------------------------

Content is generated dynamically by JavaScript so you wont find the elements / info you are looking for with BeautifulSoup, instead use their API.
Example
import requests
url = 'https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url).json()
data = []
for e in page['plaques']:
data.append(dict((k,v) for k,v in e.items() if k in ['title','professions','address']))
data
Output
[{'title': 'GWYNNE-VAUGHAN, Dame Helen (1879-1967)', 'address': 'Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden', 'professions': 'Botanist and Military Officer'}, {'title': 'READING, Lady Stella (1894-1971)', 'address': '41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster', 'professions': "Founder of the Women's Voluntary Service"}, {'title': '32 SOHO SQUARE', 'address': '32 Soho Square, Soho, London, W1D 3AP, City Of Westminster', 'professions': 'Botanists'}, {'title': '14 BUCKINGHAM STREET', 'address': '14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster', 'professions': 'Statesman, Diarist, Naval Official, Painter'}, {'title': 'ABRAHAMS, Harold (1899-1978)', 'address': 'Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet', 'professions': 'Athlete'}, ...]

python combining 2 re.findall strings in columns and rows in csv

#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
res = d+a
for tup in res:
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
print(tup)
I'm getting sale dates then addresses when just printing to screen. I have tried several things to get to csv but I end up all data in 1 column or 1 row. I would like to just sale dates, addresses 2 columns with all returned rows.
This is what I get just using print()
8/25/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/8/2021
9/8/2021
9/8/2021
9/8/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/22/2021
9/29/2021
9/29/2021
9/29/2021
11/17/2021
4/30/3021
40 PAVILICA ROAD STOCKTON NJ 08559
129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
63 PHLOX COURT WHITEHOUSE STATION NJ 08889
41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9 MAPLE AVENUE FRENCHTOWN NJ 08825
95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
27 WORMAN ROAD STOCKTON NJ 08559
30 COLD SPRINGS ROAD CALIFON NJ 07830
211 OLD CROTON ROAD FLEMINGTON NJ 08822
3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
80 SCHAAF ROAD BLOOMSBURY NJ 08804
9 CAMBRIDGE DRIVE MILFORD NJ 08848
5 VAN FLEET ROAD NESHANIC STATION NJ 08853
34 WASHINGTON STREET ANNANDALE NJ 08801
229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
28 ROSE RUN LAMBERTVILLE NJ 08530
Any Help would be great I have been playing with this all day and can't seem to get it right no matter what I try

My two cents :
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
separator = ','
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
for date, address in zip(d, a):
print(re.sub("</td>|<td>",'',str(date)),
separator,
re.sub("</td>|<td>",'',str(address)))
Output, date and address are now in one row:
8/25/2021 , 40 PAVILICA ROAD STOCKTON NJ 08559
9/1/2021 , 129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
9/1/2021 , 63 PHLOX COURT WHITEHOUSE STATION NJ 08889
9/1/2021 , 41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
9/1/2021 , 461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9/1/2021 , 9 MAPLE AVENUE FRENCHTOWN NJ 08825
9/8/2021 , 95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
9/8/2021 , 27 WORMAN ROAD STOCKTON NJ 08559
9/8/2021 , 30 COLD SPRINGS ROAD CALIFON NJ 07830
9/8/2021 , 211 OLD CROTON ROAD FLEMINGTON NJ 08822
9/15/2021 , 3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
9/15/2021 , 61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
9/15/2021 , 802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
9/15/2021 , 2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
9/15/2021 , 80 SCHAAF ROAD BLOOMSBURY NJ 08804
9/15/2021 , 9 CAMBRIDGE DRIVE MILFORD NJ 08848
9/22/2021 , 5 VAN FLEET ROAD NESHANIC STATION NJ 08853
9/29/2021 , 34 WASHINGTON STREET ANNANDALE NJ 08801
9/29/2021 , 229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
9/29/2021 , 1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
11/17/2021 , 29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
4/30/3021 , 28 ROSE RUN LAMBERTVILLE NJ 08530
Extra, to export to CSV using pandas :
import pandas as pd
date_list = []
address_list = []
for date, address in zip(d, a):
date_list.append(re.sub("</td>|<td>",'',str(date)))
address_list.append(re.sub("</td>|<td>",'',str(address)))
df = pd.DataFrame([date_list, address_list]).T
df.columns = ['Date', 'Address']
df.to_csv('data.csv')

It seems to me that instead of using two regular expressions you should rather use one with named groups. I leave it to you to try.
Given that you have two corresponding lists of values, the simplest way would be instead of concatenating:
res = d+a
just going through pairs of them:
for tup, tup2 in zip(d, a):
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
tup2 = re.sub("<td>",'',str(tup2))
tup2 = re.sub("</td>",'',str(tup2))
print(tup, tup2)

#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content)) #this is a list
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content)) #this is a list
## create a dataframe with two lists and remove tags
df = pd.DataFrame(list(zip(d,a)), columns=['sales_date','address'])
for cols in df.columns:
df[cols] = df[cols].map(lambda x: x.lstrip('<td>').rstrip('</td>'))
df.to_csv("result.csv")

How to find coordinates from a list of addresses in a dataframe

I am trying to create 2 columns in my dataframe for Longitude and Latitude which I want to find by using my address column called 'Details'.
I have tried from
geopy.extra.rate_limiter import RateLimiter
locator=Nominatim(user_agent="MyGeocoder")
results['location']=results['Details'].apply
results['point']=results['location'].apply(lambda loc:tuple(loc['point']) if loc else None)
results[['latitude', 'longitude',]]=pd.DataFrame(results['point'].tolist(), index=results.index)
But this gives the error "method object is not subscriptable"
I want to create a loop to get all coordinates for each address
Details Sale Price Post Code Year Sold
1 53 Eastbury Grove, London, W4 2JT Flat, Lease... 450000.0 W4 2020
2 Flat 148 Wedgwood House Lambeth Walk, London, ... 325000.0 E11 2020
3 63 Russell Road, Wimbledon, London, SW19 1QN ... 800000.0 W19 2020
4 Flat 2 9 Queens Gate Place, London, SW7 5NX F... 400000.0 W7 2020
5 83 Chingford Mount Road, London, E4 8LU Freeh... 182000.0 E4 2020
... ... ... ... ...
47 702 Rutherford Heights Rodney Road, London, SE... 554750.0 E17 2015
48 Flat 48 Highlands Court Highland Road, London,... 340000.0 E19 2015
49 5 Mount Nod Road, London, SW16 2LQ Flat, Leas... 395000.0 W16 2015
50 6 Woodmill Street, London, SE16 3GG Terraced,... 1010000.0 E16 2015
51 402 Rutherford Heights Rodney Road, London, SE... 403200.0 E17 2015
300 rows × 4 columns

Try this
import pandas as pd
import geopandas
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
locator = Nominatim(user_agent="myGeocoder")
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
def lat_long(row):
loc = locator.geocode(row["Details"])
row["latitude"] = loc.latitude
row["longitude"] = loc.longitude
return row
results.apply(lat_long, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautifulsoup find text between span - python

Related

select random pairs from remaining unique values in a list

After updating a df from filtred df old values reappear

BeautifulSoup Assistance

python combining 2 re.findall strings in columns and rows in csv

How to find coordinates from a list of addresses in a dataframe

Categories

Resources