Splitting a Pandas DataFrame column into two columns - python

I'm working on a simple web scrape, DataFrame project. I have a simple 8x1 DataFrame, and I'm trying to split it into an 8x2 DataFrame. So far this is what my DataFrame looks like:
dframe = DataFrame(data, columns=['Active NPGL Teams'], index=[1, 2, 3, 4, 5, 6, 7, 8])
Active NPGL Teams
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)
I would like to add a column, "Years Active" and split the "(2014-present)", "(2015-present)" into the "Years Active" column. How do I split my data?

You can use
dframe['Active NPGL Teams'].str.split(r' (?=\()', expand=True)
0 1
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)
The key is the regex r' (?=\()' which matches a space only if it is followed by an open parenthesis (lookahead assertion).
Another approach (which is about 5% slower but more flexible) is to user Series.str.extract.
dframe['Active NPGL Teams'].str.extract(r'^(?P<Team>.+) (?P<YearsActive>\(.+\))$',
expand=True)
Team YearsActive
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)

Related

How to filter with conditions to add to new column

I was trying to work with dataframe that looks like:
home
away
home_score
away_score
Tampa Bay
Colorado
3
1
San Jose
Colombus
1
3
New England
San Jose
1
5
Colorado
Tampa Bay
2
0
New England
KC Wizards
2
1
My goal is to compare 'home_score' with 'away_score' and choose the string from 'home' or 'away' to store that value in to separate column based on which score was lower.
For example, for the first row, as away_score is 1 I should be able to add "Colorado" to a separate column.
Desired outcome:
home
away
home_score
away_score
lost_team
Tampa Bay
Colorado
3
1
Colorado
I tried to search for the task but I was not successful in finding methods.
You can use np.where
df['lost_team'] = np.where(df['home_score'] < df['away_score'], df['home'], df['away'])
print(df)
# Output
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Colombus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
If a draw is possible, use np.select:
conds = [df['home_score'] < df['away_score'],
df['home_score'] > df['away_score']]
choices = [df['home'], df['away']]
draw = df[['home', 'away']].agg(list, axis=1)
df['lost_team'] = np.select(condlist=conds, choicelist=choices, default=draw).explode()
df = df.explode('lost_team')
print(df)
# Output
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Colombus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
5 Team A Team B 0 0 Team A # Row 1
5 Team A Team B 0 0 Team B # Row 2
You can pandas.DataFrame.apply with axis=1 to check the condition on each row and save the result:
df['lost_team'] = df.apply(lambda row:
'Equal' if row['home_score'] == row['away_score'] else (
row['away'] if row['home_score'] > row['away_score'] else row['home']),
axis=1)
print(df)
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Columbus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
5 Team A Team B 1 1 Equal

How do I change a list I read from a file so that it changes when the program is running but resets when the program ends?

I've been trying to make this program that reads from a file and puts the contents of the file into a list, which has worked out so far. I'm able to do menu options 1, 2, and 6.
But option 3 (where it sorts the list alphabetically), doesn't do anything to change the list.
This is after I've tried copying the sorted list into the "global" team_list. Note, I am NOT trying to change the file, I only wish to change the list within the different features in the while loop so that if you pick option 3 to sort them, it sorts the list alphabetically. But if you were to choose option 2 after choosing option 3, it would display all the names alphabetically.
The .txt file:
Boston Americans
New York Giants
Chicago White Sox
Chicago Cubs
Chicago Cubs
Pittsburgh Pirates
Philadelphia Athletics
Philadelphia Athletics
Boston Red Sox
Philadelphia Athletics
Boston Braves
Boston Red Sox
Boston Red Sox
Chicago White Sox
Boston Red Sox
Cincinnati Reds
Cleveland Indians
New York Giants
New York Giants
New York Yankees
Washington Senators
Pittsburgh Pirates
St. Louis Cardinals
New York Yankees
New York Yankees
Philadelphia Athletics
Philadelphia Athletics
St. Louis Cardinals
New York Yankees
New York Giants
St. Louis Cardinals
Detroit Tigers
New York Yankees
New York Yankees
New York Yankees
New York Yankees
Cincinnati Reds
New York Yankees
St. Louis Cardinals
New York Yankees
St. Louis Cardinals
Detroit Tigers
St. Louis Cardinals
New York Yankees
Cleveland Indians
New York Yankees
New York Yankees
New York Yankees
New York Yankees
New York Yankees
New York Giants
Brooklyn Dodgers
New York Yankees
Milwaukee Braves
New York Yankees
Los Angeles Dodgers
Pittsburgh Pirates
New York Yankees
New York Yankees
Los Angeles Dodgers
St. Louis Cardinals
Los Angeles Dodgers
Baltimore Orioles
St. Louis Cardinals
Detroit Tigers
New York Mets
Baltimore Orioles
Pittsburgh Pirates
Oakland Athletics
Oakland Athletics
Oakland Athletics
Cincinnati Reds
Cincinnati Reds
New York Yankees
New York Yankees
Pittsburgh Pirates
Philadelphia Phillies
Los Angeles Dodgers
St. Louis Cardinals
Baltimore Orioles
Detroit Tigers
Kansas City Royals
New York Mets
Minnesota Twins
Los Angeles Dodgers
Oakland Athletics
Cincinnati Reds
Minnesota Twins
Toronto Blue Jays
Toronto Blue Jays
Atlanta Braves
New York Yankees
Florida Marlins
New York Yankees
New York Yankees
New York Yankees
Arizona Diamondbacks
Anaheim Angels
Florida Marlins
Boston Red Sox
Chicago White Sox
St. Louis Cardinals
Boston Red Sox
Philadelphia Phillies
START_DATE = 1903
END_DATE = 2009
FILE = 'WorldSeriesWinners.txt'
def main():
team_list = file_to_list(FILE)
quit_program = False
while not quit_program:
print('-' * 50)
print('1. Search a team')
print('2. Display team names')
print('3. Sort team names in alphabetical order')
print('4. Sort team names in reverse-alphabetical order')
print('5. Remove a team name')
print('6. Quit')
user_response = int(input('Choose an option (1-6): '))
while user_response < 1 or user_response > 6:
user_response = int(input('Please choose a valid option (1-6): '))
if user_response == 1:
wins = 0
chosen_team = input('Enter a team name: ')
if chosen_team in team_list:
for index in range(len(team_list)):
if team_list[index] == chosen_team:
wins = wins + 1
print(f'The {chosen_team} won the world series {wins} times between {START_DATE} and {END_DATE}')
else:
print(f'The {chosen_team} are not on the list.')
elif user_response == 2:
display(team_list)
elif user_response == 3:
unique_team_names = set(team_list)
new_team_list = list(unique_team_names)
new_team_list.sort()
elif user_response == 6:
quit_program = True
print('Goodbye')
def file_to_list(file_name):
try:
team_file = open(file_name, 'r')
team_list = []
team = team_file.readline()
while team != '':
team = team.rstrip('\n')
team_list.append(team)
team = team_file.readline()
team_file.close()
return team_list
except IOError:
print('File not found')
def display(team_list):
unique_team_names = set(team_list)
new_team_list = list(unique_team_names)
for index in range(len(new_team_list)):
print(new_team_list[index])
if __name__ == '__main__':
main()

pandas join gives NaN values

I want to join 2 DataFrames
Zipcode Database (first 10 entries)
0 zip_code City State County Population
0 0 90001 Los Angeles California Los Angeles 54481
1 1 90002 Los Angeles California Los Angeles 44584
2 2 90003 Los Angeles California Los Angeles 58187
3 3 90004 Los Angeles California Los Angeles 67850
4 4 90005 Los Angeles California Los Angeles 43014
5 5 90006 Los Angeles California Los Angeles 62765
6 6 90007 Los Angeles California Los Angeles 45021
7 7 90008 Los Angeles California Los Angeles 30840
8 8 90009 Los Angeles California Los Angeles -
9 9 90010 Los Angeles California Los Angeles 1943
And data (first 10 entries)
buyer zip_code
0 SWEENEY,THOMAS R & MICHELLE H NaN
1 DOUGHERTY,HERBERT III & JENNIFER M NaN
2 WEST COAST RLTY SVCS INC NaN
3 LOVE,JULIE M NaN
4 SAHAR,DAVID NaN
5 SILBERSTERN,BRADLEY E TRUST 91199
6 LEE,SUSAN & JIMMY C 92025
7 FRAZZANO REAL ESTATE I NC NaN
8 RUV INVESTMENTS LLC 91730
9 KAOS KAPITAL LLC NaN
So the final table should have [buyer, zip_code, City, County]. I'm joining with respect to Zip code.
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
But the city and county columns are NaN even for the tuples in data where zipcode is actually present.
buyer zip_code City County
10 LANDON AVE TRUST 37736 NaN NaN NaN
11 UMAR,AHMAD NaN NaN NaN
12 3 JPS INC 90717 NaN NaN
13 T & L HOLDINGS INC 95610 NaN NaN
14 CAHP HOLDINGS LLC 90808 NaN NaN
15 REBUILDING TOGETHER LONG BEACH 92344 NaN NaN
16 COLFIN AI-CA 4 LLC NaN NaN NaN
17 GUTIERREZ,HUGO 91381 NaN NaN
18 VALBRIDGE CAP GOLDEN GATE FUND NaN NaN NaN
19 SOLARES,OSCAR 92570 NaN NaN
Why is this the case? The zipcode database has all zipcodes from 90001 - 999950.
My first thought is the datatype of "zip_code" in both are different:
print(zipcode_database['zip_code'].dtype)
print(data['zip_code'].dtype)
Output:
int64
object
Thought of typecasting with astype, but this does not work with NaN values. Any thoughts?
You can cast NaN values to float types, but not int. In your case I would cast the zip_code field in both DataFrames to a float and then join.
zipcode_database.zip_code = zipcode_database.zip_code.astype(float)
data.zip_code = data.zip_code.astype(float)
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
I can't reproduce anything meaningful from your example data (no matching zip codes), but that should fix the issue.

Fillna with most frequent if most frequent occurs else fillna with most frequent value of the entire column

I have a panda dataframe
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
I want to fill NaNs based on most frequent state if the state appears before so I group by state and apply the following code:
df['City'] = df.groupby('State').transform(lambda x:x.fillna(x.value_counts().idxmax()))
The above code works for if all states have occurred before the output will be
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
However I want to add a condtion so that if a state never occur its city will be the most frequent in the entire City column ie if the dataframe is
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
12 NaN NY
NY has never occurred before I want output to be
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY
The code above gives a ValueError: ('attempt to get argmax of an empty sequence') because "NY" has never occurred before.
IIUC:
def f(x):
if x.count()<=0:
return np.nan
return x.value_counts().index[0]
df['City'] = df.groupby('State')['City'].transform(f)
df['City'] = df['City'].fillna(df['City'].value_counts().idxmax())
Output:
City State
0 Cambridge MA
1 Washignton DC
2 Cambridge MA
3 Washignton DC
4 Cambridge MA
5 Miami FL
6 Cambridge MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washignton DC
12 Cambridge NY
You can solve this by the following code
mode = df['City'].mode()[0]
df['City'] = df.groupby('State')['City'].apply(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() >=1 else mode , inplace = False))
df['City']= df['City'].fillna(df['City'].value_counts().idxmax())
output:
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY

Not able to extract complete city list

I am using the following code to extract the list of cities mentioned on this page, but it gives me just the first 23 cities.
Can't figure out where I am going wrong!
import requests,bs4
res=requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text=bs4.BeautifulSoup(res.text,"lxml")
fields=text.select('td[bgcolor="silver"] > font[size="-2"] > b')
print len(fields)
for field in fields:
print field.getText()
This is the output I am getting:
23
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
But this webpage contains 125 cities.
lxml works fine for me, I get 124 cities using your own code so it has nothing to do with the parser, you are either using an old version of bs4 or it is an encoding issue, you should call .content and let requests handle the encoding, you are also missing a city using your logic, to get all 125:
import requests, bs4
res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text = bs4.BeautifulSoup(res.content,"lxml")
rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
print(len(rows))
for row in rows:
print(row.get_text())
If we run it, you can see we get all the cities:
In [1]: import requests,bs4
In [2]: res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
In [3]: text = bs4.BeautifulSoup(res.text,"lxml")
In [4]: rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
In [5]: print(len(rows))
125
In [6]: for row in rows:
...: print(row.get_text())
...:
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
Chicago
London
Shenzhen
Essen/Düsseldorf
Tehran
Bogota
Lima
Bangkok
Johannesburg/East Rand
Chennai
Taipei
Baghdad
Santiago
Bangalore
Hyderabad
St Petersburg
Philadelphia
Lahore
Kinshasa
Miami
Ho Chi Minh City
Madrid
Tianjin
Kuala Lumpur
Toronto
Milan
Shenyang
Dallas/Fort Worth
Boston
Belo Horizonte
Khartoum
Riyadh
Singapore
Washington
Detroit
Barcelona
Houston
Athens
Berlin
Sydney
Atlanta
Guadalajara
San Francisco/Oakland
Montreal.
Monterey
Melbourne
Ankara
Recife
Phoenix/Mesa
Durban
Porto Alegre
Dalian
Jeddah
Seattle
Cape Town
San Diego
Fortaleza
Curitiba
Rome
Naples
Minneapolis/St. Paul
Tel Aviv
Birmingham
Frankfurt
Lisbon
Manchester
San Juan
Katowice
Tashkent
Fukuoka
Baku/Sumqayit
St. Louis
Baltimore
Sapporo
Tampa/St. Petersburg
Taichung
Warsaw
Denver
Cologne/Bonn
Hamburg
Dubai
Pretoria
Vancouver
Beirut
Budapest
Cleveland
Pittsburgh
Campinas
Harare
Brasilia
Kuwait
Munich
Portland
Brussels
Vienna
San Jose
Damman
Copenhagen
Brisbane
Riverside/San Bernardino
Cincinnati
Accra

Categories

Resources