I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.
Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.
Related
I have a dataset with two columns: date and text. The text column contains unstructured information. I have a list of city names to search for in a text column.
I need to get two sets of data:
list_city = [New York, Los Angeles, Chicago]
When all records from the list with a text message match with the dataframe lines
Sample example:
df_1
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_1_res:
df_1_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_1_res= df_1.loc[df_1["text"].str.contains(list_city[0]) & df_1["text"].str.contains(list_city[1]) & df_1["text"].str.contains(list_city[2])]
When at least one value from the list matches the text in the dataframe lines
Sample example:
df_2
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_2_res:
df_2_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_2_res= df_2.loc[df_1["text"].str.contains(list_city[0]) | df_2["text"].str.contains(list_city[1]) | df_2["text"].str.contains(list_city[2])]
How can it be improved? Since it is planned to change the number of cities in the filtering list.
here is one way to do it
For case # 1 : AND Condition
# use re.IGNORECASE to make findall case insensitive
import re
(df_1.loc[df_1['text'].str
.findall(r'|'.join(list_city), re.IGNORECASE)
.apply(lambda x: len(x)).eq(len(list_city))])
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
CASE #2 : OR Condition
#create an OR condition using join
# filter using loc
df_2.loc[df_1['text'].str.contains(r'|'.join(list_city))]
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix
Try using the isin() function
Output:
Solution with pandas.Series.str.count
list_city = ['New York', 'Los Angeles', 'Chicago']
print('\nfirst task - ALL')
df1 = df[df['text'].str.count(r'|'.join(list_city)).eq(len(list_city))]
print(df1)
print('\nsecond task - ANY')
df1 = df[df['text'].str.count(r'|'.join(list_city)).gt(0)]
print(df1)
first task - ALL
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
second task - ANY
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix
The data i have is something similar to this:
country
population
area
city
city_population
USA
331893745
9833520
New York
8804190
USA
331893745
9833520
Los Angeles
3898747
USA
331893745
9833520
Chicago
2746388
UK
243610
66366000
London
7556900
UK
243610
66366000
Birmingham
984333
Canada
9984670
38532853
Toronto
2600000
Canada
9984670
38532853
Montreal
1600000
Canada
9984670
38532853
Calgary
1019942
I am looking for output like this:
country
population
area
cities
USA
331893745
9833520
{'New York' : 8804190, 'Los Angeles' : 3898747, 'Chicago' : 2746388}
UK
243610
66366000
{'London' : 7556900, 'Birmingham' : 984333}
Canada
9984670
38532853
{'Toronto' : 2600000, 'Montreal' : 1600000, 'Calgary' : 1019942}
So basically I want to group by the country column and then put city and city_population into a JSON-like column while keeping the other columns.
Any help is appreciated.
What you want is pandas groupby function, which creates groups depending on multiple columns with the same value. These groups can then be transformed with other functions based on your problem. In your case, I would apply a lambda function, which takes the city column and city_population and creates a dictionary (JSON-like structure). The next two statements are only to have a nice index and the correct column name.
(df.groupby(by=['country', 'population', 'area'])
.apply(lambda x: dict(zip(x['city'], x['city_population'])))
.reset_index()
.rename(columns={0:'Cities'}))
Output:
country population area Cities
0 Canada 9984670 38532853 {'Toronto': 2600000, 'Montreal': 1600000, 'Calgary': 1019942}
1 UK 243610 66366000 {'London': 7556900, 'Birmingham': 984333}
2 USA 331893745 9833520 {'New York': 8804190, 'Los Angeles': 3898747, 'Chicago': 2746388}
I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.
One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])
You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)
I want to create a new column called Playercategory,
Name Country
Kobe United States
John Italy
Charly Japan
Braven Japan / United States
Rocky Germany / United States
Bran Lithuania
Nick United States / Ukraine
Jonas Nigeria
if the player's nationality is 'United States' or United States with any other country except European country, then Playercategory=="American"
if the player's nationality is European country or European country with any other country, then Playercategory=="Europe" (ex: 'Italy', 'Italy / United States', 'Germany / United States', 'Lithuania / Australia','Belgium')
For all the other players, then Playercategory=="Non"
Expected Output:
Name Country Playercategory
Kobe United States American
John Italy Europe
Charles Japan Non
Braven Japan / United States American
Rocky Germany / United States Europe
Bran Lithuania Europe
Nick United States / Ukraine American
Jonas Nigeria Non
What I tried:
First I created a list with Europe countries:
euroCountries=['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland',
'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden']
i know how to check one condition,like below way,
df["PlayerCatagory"] = np.where(df["Country"].isin(euroCountries), "Europe", "Non")
But don't know how to concat the above three conditions and create PlayerCategory correctly.
Really appreciate your support!!!!!!!
Use numpy.select with test first if match euroCountries in Series.str.contains and then test if match United States:
m1 = df['Country'].str.contains('|'.join(euroCountries))
m2 = df['Country'].str.contains('United States')
Or you can test splitted values with Series.str.split, DataFrame.eq or DataFrame.isin and then if at least one match per rows by DataFrame.any:
df1 = df['Country'].str.split(' / ', expand=True)
m1 = df1.eq('United States').any(axis=1)
m2 = df1.isin(euroCountries).any(axis=1)
df["PlayerCatagory"] = np.select([m1, m2], ['Europe','American'], default='Non')
print (df)
Name Country PlayerCatagory
0 Kobe United States American
1 John Italy Europe
2 Charly Japan Non
3 Braven Japan / United States American
4 Rocky Germany / United States Europe
5 Bran Lithuania Europe
6 Nick United States / Ukraine American
7 Jonas Nigeria Non
I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:
The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland
And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe