How to get all data using beautifulsoup? - python

I am trying to scrape all the addresses for "Recent Sales" in this page:
https://www.compass.com/agents/irene-vuong/
My current code looks like:
url = 'https://www.compass.com/agents/irene-vuong/'
url = requests.get(url)
soup = BeautifulSoup(url.text, 'html')
for item in soup.findAll('div', attrs={'class': 'uc-listingCard-content'}):
new = item.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
My output is :
256-258 Wyckoff Street
1320 Glenwood Road
1473 East 55th Street
145 Winter Avenue
25-02 Brookhaven Avenue
which is the addresses of "current" listings.
My expected output is:
352 94th Street
1754 West 12th Street
2283 E 23rd st
2063 Brown Street
3423 Avenue U
2256 Stuart Street
Which are the addresses under "Recent Sales". No matter what, I only get current listing addresses, but not all listing addresses. I tried to use re.compile(r'Recent Sales') but it would not work. I'm not sure how to get to "Recent Sales".
Any help will be greatly appreciated.
+++++
I also tried to use text 'Recent Sales' as below:
for item in soup.findAll(text=re.compile(r'Recent Sales')).findNext():
for i in item.find('div', attrs={'class':'profile-acive-listings'}):
new = i.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
But I get an error of:
AttributeError: ResultSet object has no attribute 'findNext'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
+++ Also tried to use class data-tn : recent sales:
for item in soup.findAll('div', attrs={'data-tn':'recent-sales'}):
new = item.findAll('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
But it won't return anything.

You can use Selenium. It renders your page in an automated browser. From the rendered page you can then get the full HTML and retrieve your listings.
Try this:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://www.compass.com/agents/irene-vuong/")
html = browser.page_source
soup = BeautifulSoup(html, 'html')
for item in soup.findAll('div', attrs={'class': 'uc-listingCard-content'}):
new = item.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
This prints out:
256-258 Wyckoff Street
1320 Glenwood Road
1473 East 55th Street
145 Winter Avenue
25-02 Brookhaven Avenue
352 94th Street
1754 West 12th Street
2283 E 23rd St
2063 Brown Street
3423 Avenue U
2256 Stuart Street
East 61st Street
Edit:
If you want to parse the data from the raw HTML you have to get it a script tag.
Try this:
import json
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.compass.com/agents/irene-vuong/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html')
script = soup.find_all("script")[4]
data = json.loads(script.text.split("window.__AGENT_PROFILE__ = ")[1])
data = data["data"]
df_sales = pd.DataFrame(data["closedDeals"]["sales"])
df_rentals = pd.DataFrame(data["closedDeals"]["rentals"])
This gives you Pandas dataframes with all the listing data like this.
listingIdSHA listingType location size price detailedInfo media dealInfo isOffMLS pageLink pageLinkSlug canonicalPageLink userListingCompliance
0 210837948508195937 2 {'prettyAddress': '352 94th Street', 'city': '... {'bedrooms': 4, 'bathrooms': 2.75} {'lastKnown': 1250000, 'formatted': '$1,250,000'} {'amenities': ['Driveway', 'Open Kitchen', 'Ga... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/352-94th-street-brooklyn-ny-11209/210... 352-94th-street-brooklyn-ny-11209 /listing/352-94th-street-brooklyn-ny-11209/210... {'descriptionCompliance': 0}
1 122690464561282785 2 {'prettyAddress': '1754 West 12th Street', 'ci... {'bedrooms': 4, 'bathrooms': 2} {'lastKnown': 1040000, 'formatted': '$1,040,000'} {'amenities': ['Basement', 'Private Outdoor Sp... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/1754-west-12th-street-brooklyn-ny-112... 1754-west-12th-street-brooklyn-ny-11223 /listing/1754-west-12th-street-brooklyn-ny-112... {'descriptionCompliance': 0}
2 NaN 2 {'prettyAddress': '2283 E 23rd St', 'neighborh... {'bedrooms': 3, 'bathrooms': 2} {'lastKnown': 800000, 'formatted': '$800,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False NaN 2283-e-23rd-st NaN NaN
3 235974146369023201 2 {'prettyAddress': '2063 Brown Street', 'city':... {'bedrooms': 3, 'bathrooms': 2} {'lastKnown': 755000, 'formatted': '$755,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/2063-brown-street-brooklyn-ny-11229/2... 2063-brown-street-brooklyn-ny-11229 /listing/2063-brown-street-brooklyn-ny-11229/2... {'descriptionCompliance': 0}
4 186865317970981409 2 {'prettyAddress': '3423 Avenue U', 'city': 'Br... {'bedrooms': 5, 'bathrooms': 2} {'lastKnown': 627000, 'formatted': '$627,000'} {'amenities': ['Hardwood Floors', 'Garage', 'C... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/3423-avenue-u-brooklyn-ny-11234/18686... 3423-avenue-u-brooklyn-ny-11234 /listing/3423-avenue-u-brooklyn-ny-11234/18686... {'descriptionCompliance': 0}
5 286987776170131617 2 {'prettyAddress': '2256 Stuart Street', 'city'... {'bedrooms': 3, 'bathrooms': 1} {'lastKnown': 533000, 'formatted': '$533,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/2256-stuart-street-brooklyn-ny-11229/... 2256-stuart-street-brooklyn-ny-11229 /listing/2256-stuart-street-brooklyn-ny-11229/...
To retrieve just the listing adresses use this further step:
from pandas import json_normalize
df_sales = df_sales.location.apply(lambda x: dict(x))
df_sales = json_normalize(df_sales)
df_rentals = df_rentals.location.apply(lambda x: dict(x))
df_rentals = json_normalize(df_rentals)
Output:
prettyAddress city state zipCode geoId neighborhood subNeighborhoods
0 352 94th Street Brooklyn NY 11209 nyc NaN NaN
1 1754 West 12th Street Brooklyn NY 11223 nyc NaN NaN
2 2283 E 23rd St NaN NaN NaN nyc Sheepshead Bay [Sheepshead Bay]
3 2063 Brown Street Brooklyn NY 11229 nyc NaN NaN
4 3423 Avenue U Brooklyn NY 11234 nyc NaN NaN
5 2256 Stuart Street Brooklyn NY 11229 nyc NaN NaN
Edit:
You can get more clean data like so:
df_sales = pd.DataFrame(data["closedDeals"]["sales"])
columns = ['listingIdSHA', 'listingType', 'location', 'size', 'price']
df_sales = df_sales[columns]
expanded_data = []
for column in ['location', 'size', 'price']:
expanded = df_sales[column].apply(lambda x: dict(x))
expanded_data.append(json_normalize(expanded))
expanded_data = pd.concat(expanded_data, axis=1)
df_sales_cleaned = pd.concat([df_sales[['listingIdSHA', 'listingType']], expanded_data], axis=1)
display(df_sales_cleaned)
Output:
listingIdSHA listingType prettyAddress city state zipCode geoId neighborhood subNeighborhoods bedrooms bathrooms lastKnown formatted
0 210837948508195937 2 352 94th Street Brooklyn NY 11209 nyc NaN NaN 4 2.75 1250000 $1,250,000
1 122690464561282785 2 1754 West 12th Street Brooklyn NY 11223 nyc NaN NaN 4 2.00 1040000 $1,040,000
2 NaN 2 2283 E 23rd St NaN NaN NaN nyc Sheepshead Bay [Sheepshead Bay] 3 2.00 800000 $800,000
3 235974146369023201 2 2063 Brown Street Brooklyn NY 11229 nyc NaN NaN 3 2.00 755000 $755,000
4 186865317970981409 2 3423 Avenue U Brooklyn NY 11234 nyc NaN NaN 5 2.00 627000 $627,000
5 286987776170131617 2 2256 Stuart Street Brooklyn NY 11229 nyc NaN NaN 3 1.00 533000 $533,000

i recently got a project where I'm using this too, but even without Regex
try the code like this
for item in soup.findAll(text=re.compile(r'Recent Sales')):
for i in item.encode_contents().find('div', {'class':'profile-acive-listings'}):
new = i.find('a', {'class': 'uc-listingCard-title'})
print(new.text)

Related

python combining 2 re.findall strings in columns and rows in csv

#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
res = d+a
for tup in res:
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
print(tup)
I'm getting sale dates then addresses when just printing to screen. I have tried several things to get to csv but I end up all data in 1 column or 1 row. I would like to just sale dates, addresses 2 columns with all returned rows.
This is what I get just using print()
8/25/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/8/2021
9/8/2021
9/8/2021
9/8/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/22/2021
9/29/2021
9/29/2021
9/29/2021
11/17/2021
4/30/3021
40 PAVILICA ROAD STOCKTON NJ 08559
129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
63 PHLOX COURT WHITEHOUSE STATION NJ 08889
41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9 MAPLE AVENUE FRENCHTOWN NJ 08825
95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
27 WORMAN ROAD STOCKTON NJ 08559
30 COLD SPRINGS ROAD CALIFON NJ 07830
211 OLD CROTON ROAD FLEMINGTON NJ 08822
3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
80 SCHAAF ROAD BLOOMSBURY NJ 08804
9 CAMBRIDGE DRIVE MILFORD NJ 08848
5 VAN FLEET ROAD NESHANIC STATION NJ 08853
34 WASHINGTON STREET ANNANDALE NJ 08801
229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
28 ROSE RUN LAMBERTVILLE NJ 08530
Any Help would be great I have been playing with this all day and can't seem to get it right no matter what I try
My two cents :
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
separator = ','
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
for date, address in zip(d, a):
print(re.sub("</td>|<td>",'',str(date)),
separator,
re.sub("</td>|<td>",'',str(address)))
Output, date and address are now in one row:
8/25/2021 , 40 PAVILICA ROAD STOCKTON NJ 08559
9/1/2021 , 129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
9/1/2021 , 63 PHLOX COURT WHITEHOUSE STATION NJ 08889
9/1/2021 , 41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
9/1/2021 , 461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9/1/2021 , 9 MAPLE AVENUE FRENCHTOWN NJ 08825
9/8/2021 , 95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
9/8/2021 , 27 WORMAN ROAD STOCKTON NJ 08559
9/8/2021 , 30 COLD SPRINGS ROAD CALIFON NJ 07830
9/8/2021 , 211 OLD CROTON ROAD FLEMINGTON NJ 08822
9/15/2021 , 3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
9/15/2021 , 61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
9/15/2021 , 802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
9/15/2021 , 2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
9/15/2021 , 80 SCHAAF ROAD BLOOMSBURY NJ 08804
9/15/2021 , 9 CAMBRIDGE DRIVE MILFORD NJ 08848
9/22/2021 , 5 VAN FLEET ROAD NESHANIC STATION NJ 08853
9/29/2021 , 34 WASHINGTON STREET ANNANDALE NJ 08801
9/29/2021 , 229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
9/29/2021 , 1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
11/17/2021 , 29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
4/30/3021 , 28 ROSE RUN LAMBERTVILLE NJ 08530
Extra, to export to CSV using pandas :
import pandas as pd
date_list = []
address_list = []
for date, address in zip(d, a):
date_list.append(re.sub("</td>|<td>",'',str(date)))
address_list.append(re.sub("</td>|<td>",'',str(address)))
df = pd.DataFrame([date_list, address_list]).T
df.columns = ['Date', 'Address']
df.to_csv('data.csv')
It seems to me that instead of using two regular expressions you should rather use one with named groups. I leave it to you to try.
Given that you have two corresponding lists of values, the simplest way would be instead of concatenating:
res = d+a
just going through pairs of them:
for tup, tup2 in zip(d, a):
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
tup2 = re.sub("<td>",'',str(tup2))
tup2 = re.sub("</td>",'',str(tup2))
print(tup, tup2)
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content)) #this is a list
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content)) #this is a list
## create a dataframe with two lists and remove tags
df = pd.DataFrame(list(zip(d,a)), columns=['sales_date','address'])
for cols in df.columns:
df[cols] = df[cols].map(lambda x: x.lstrip('<td>').rstrip('</td>'))
df.to_csv("result.csv")

Problem concatenating URL and scraping data

I am trying to append a URL in python to scrape details from the target URL.
I have the below code but it seems to be scraping the data from url1 rather than URL.
I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website.
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL.
My console output (using Spyder IDE) is as below for print(URL)
url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code.
Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
Output:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]

Pandas drop duplicates with partially completed data in each row and combine data

I have a dataframe with duplicate IDs but the data is partially completed in multiple areas.
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, '333 Street', np.nan],
[1234, 'Customer A', '12345 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, np.nan, np.nan],
[1233, 'Customer B', '444 Street', '3335 Street', np.nan],
[1233, 'Customer B', '555 Street', '666 Street', np.nan],
[1233, 'Customer B', '553 Street', '666 Street', 'abc#email.com'],
[1235, 'Customer C', '1553 Street', '644 Street', 'abc#email.com'],
[1235, 'Customer C', '2553 Street', '644 Street', 'abc#email.com']],
columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])
df
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
1 1234 Customer A NaN 333 Street NaN
2 1234 Customer A 12345 Street NaN NaN
3 1234 Customer A NaN NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
5 1233 Customer B 555 Street 666 Street NaN
6 1233 Customer B 553 Street 666 Street abc#email.com
7 1235 Customer C 1553 Street 644 Street abc#email.com
8 1235 Customer C 2553 Street 644 Street abc#email.com
I want to preserve all of the data so it creates new columns if the data is there so that it looks like the dataframe below:
I tried the following but it removes data that I want to preserve.
df.drop_duplicates(subset=['ID'], inplace=True)
df
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
7 1235 Customer C 1553 Street 644 Street abc#email.com
EDIT: I added more data because it was unclear from the original post that there can be IDs with multiple rows.
Here's one approach using apply and create new columns, using dict creation for pd.Series
In [1057]: cols = ['Billing Address', 'Shipping Address']
In [1058]: (df.groupby(['ID', 'Customer'])
.apply(lambda g: pd.Series({'%s %s' % (x, i+1): v[x]
for i, v in enumerate(g[cols].to_dict('r'))
for x in v})))
Out[1058]:
Billing Address 1 Billing Address 2 Shipping Address 1 \
ID Customer
1233 Customer B 444 Street 555 Street 333 Street
1234 Customer A 123 Street NaN NaN
Shipping Address 2
ID Customer
1233 Customer B 666 Street
1234 Customer A 333 Street
Here is a potential solution, though it is not efficient at all in term of memory used in the process.
The idea is to loop over the number of rows you can have for a unique ID and merge your dataframe with the nth row:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),\
on = 'ID',how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address +=1
ID Customer_1 Billing Address_1 Shipping Address_1 Customer_2 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN Customer A NaN 333 Street
1 1233 Customer B 444 Street 333 Street Customer B 555 Street 666 Street
To fit your desired output, we need to merge on ['ID','Customer'] as it is in your example the same key:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),on = ['ID','Customer'],how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address+=1
ID Customer Billing Address_1 Shipping Address_1 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN NaN 333 Street
1 1233 Customer B 444 Street 333 Street 555 Street 666 Street

Pandas - remove numbers from start of string in series

I've got a series of addresses and would like a series with just the street name. The only catch is some of the addresses don't have a house number, and some do.
So if I have a series that looks like:
Idx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
What function would I write to get
Idx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
where any 'words' made entirely of numeric characters at the beginning of the string have been removed? As you can see above, I would like to retain the 3 that '3RD STREET' starts with. I'm thinking a regular expression but this is beyond me. Thanks!
You can use str.replace with regex ^\d+\s+ to remove leading digits:
s.str.replace('^\d+\s+', '')
Out[491]:
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
Name: Idx, dtype: object
str.replace('\d+\s', '') is what I came up with:
df = pd.DataFrame({'IDx': ['11000 SOUTH PARK',
'20314 BRAKER LANE',
'203 3RD ST',
'BIRMINGHAM PARK',
'E 12TH']})
df
Out[126]:
IDx
0 11000 SOUTH PARK
1 20314 BRAKER LANE
2 203 3RD ST
3 BIRMINGHAM PARK
4 E 12TH
df.IDx = df.IDx.str.replace('\d+\s', '')
df
Out[128]:
IDx
0 SOUTH PARK
1 BRAKER LANE
2 3RD ST
3 BIRMINGHAM PARK
4 E 12TH

How to compare fields from two CSV files with an arithmetic condition?

I have two csv files. The first file contains names of all countries with their capital cities,
CSV 1:
Capital Country Country Code
Budapest Hungary HUN
Rome Italy ITA
Dublin Ireland IRL
Paris France FRA
Berlin Germany DEU
.
.
.
CSV 2:
Second CSV file contains trip details of a bus,
Trip City Trip Country No. of pax
Budapest HUN 24
Paris FRA 36
Munich DEU 9
Florence ITA 5
Milan ITA 25
Rome ITA 2
Rome ITA 45
I would like to add a new column df["Touism visit"] with the values of no of pax, if the Trip City (from CSV 2) is a capital of a country (from CSV 1) and if the number of pax is more than 10.
Thank you.
Try this:
df2['tourism'] = 0
df2.loc[df2['Trip City'].isin(df1['Capital']) & (df2['No. of pax'] > 10), 'tourism'] = df2.loc[df2['Trip City'].isin(df1['Capital'])& (df2['No. of pax'] > 10), 'No. of pax']
I get :
Trip_City Trip_Country No._of_pax tourism
0 Budapest HUN 24 24
1 Paris FRA 36 36
2 Munich DEU 9 0
3 Florence ITA 5 0
4 Milan ITA 25 0
5 Rome ITA 2 0
6 Rome ITA 45 45
(I had to add _s to get pd.read_clipboard() to work properly)
this might also help,
import the dfs
df1 = pd.read_csv("CSV1.csv")
df2 = pd.read_csv("CSV2.csv")
make a dictionary out of the pandas Series
my_dict=dict(zip((df1["Country_Code"]),(df1["Capital"])))
define a function that test your conditions (note i used np.logical_and() to combine the conditions. A normal and
def isTourism(country_code,trip_city,No_of_pax):
if np.logical_and((my_dict[country_code]==trip_city),(No_of_pax >= 10)):
return "Yes"
else:
return "No
call function with map
df2["Tourism"] = list(map(isTourism,df2["Trip Country"],df2["Trip City"], df2["No. Of pax"]))
print(df2)
Trip City Trip Country No. Of pax Tourism
0 Budapest HUN 24 Yes
1 Paris FRA 36 Yes
2 Munich DEU 9 No
3 Florence ITA 5 No
4 Milan ITA 25 No
5 Rome ITA 2 No
6 Rome ITA 45 Yes
If you filter your second dataframe to only the values > 10, you could merge and sum as follows:
import pandas as pd
df1 = pd.DataFrame({'Capital': ['Budapest', 'Rome', 'Dublin', 'Paris',
'Berlin'],
'Country': ['Hungary', 'Italy', 'Ireland', 'France',
'Germany'],
'Country Code': ['HUN', 'ITA', 'IRL', 'FRA', 'DEU']
})
df2 = pd.DataFrame({'Trip City': ['Budapest', 'Paris', 'Munich', 'Florence',
'Milan', 'Rome', 'Rome'],
'Trip Country': ['HUN', 'FRA', 'DEU', 'ITA', 'ITA',
'ITA', 'ITA'],
'No. of pax': [24, 36, 9, 5, 25, 2, 45]
})
df2 = df2[df2['No. of pax'] > 10]
combined = df1.merge(df2,
left_on=['Capital', 'Country Code'],
right_on=['Trip City', 'Trip Country'],
how='left').groupby(['Capital', 'Country Code'],
sort=False,
as_index=False)['No. of pax'].sum()
print combined
This prints:
Capital Country Code No. of pax
0 Budapest HUN 24.0
1 Rome ITA 45.0
2 Dublin IRL NaN
3 Paris FRA 36.0
4 Berlin DEU NaN

Categories

Resources