New column in pandas dataframe based on existing column values

New column in pandas dataframe based on existing column values - python

I have a dataframe with a column named 'States' that lists various U.S. states. I need to create another column with a region specifier like 'Atlantic Coast' I have lists of the states that belong to various regions so if the state in df['States'] matches a state in the list 'Atlantic_states' the specifier 'Atlantic Coast' is inserted into the new column df['region specifier'] my code below shows the list I want to compare my dataframe values with and the output of the df['States'] column.
#list of states
Atlantic_states = ['Virginia',
'Massachusetts',
'Maine',
'New York',
'Rhode Island',
'Connecticut',
'New Hampshire',
'Maryland',
'Delaware',
'New Jersey',
'North Carolina',
'South Carolina',
'Georgia',
'Florida']
print(df['States'])
Out:
States
0 Virginia
1 Massachusetts
2 Maine
3 New York
4 Rhode Island
5 Connecticut
6 New Hampshire
7 Maryland
8 Delaware
9 New Jersey
10 North Carolina
11 South Carolina
12 Georgia
13 Florida
14 Wisconsin
15 Michigan
16 Ohio
17 Pennsylvania
18 Illinois
19 Indiana
20 Minnesota
21 New York
22 Washington
23 Oregon
24 California

Whilst Andy's answer works it is not the most efficient way of doing this. There is a handy method that can be called on almost all pandas Series-like objects: .isin(). Entries to this can be lists, dicts and pandas Series.
df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
'Connecticut','New Hampshire','Maryland', 'Delaware',
'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
'Indiana','Minnesota','New York','Washington','Oregon',
'California'],
columns=['States'])
Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
'Connecticut', 'New Hampshire', 'Maryland', 'Delaware',
'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
'Florida']
df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
'Unknown')
df.head()
Out[1]:
States Coast
0 Virginia Atlantic Coast
1 Massachusetts Atlantic Coast
2 Maine Atlantic Coast
3 New York Atlantic Coast
4 Rhode Island Atlantic Coast
Benchmarks
Here are some timings using for mapping the first 10 letters of the alphabet to some random int numbers:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))
for apply
%%timeit
def is_atlantic(state):
return True if state in letters else False
df.numbers.apply(is_atlantic)
Out[]: 1 loops, best of 3: 432 ms per loop
Now for map as suggested by JohnE
%%timeit
df.numbers.map(letters)
Out[]: 10 loops, best of 3: 56.9 ms per loop
and finally for isin (also suggested by Nickil Maveli)
%%timeit
df.numbers.isin(letters)
Out[]: 10 loops, best of 3: 20.9 ms per loop
So we see that .isin() is much quicker than .apply() and twice as quick as .map().
Note: apply and isin just return the boolean masks and map fills with the desired strings. Even so, when assigning to another column isin wins out by about 2/3 of the time of map.

You have a couple options. First, to directly answer the question as posed:
Option 1
Create a function that returns whether a state is in the Atlantic region or not
def is_atlantic(state):
return "Atlantic" if state in Atlantic_states else "Unknown"
Now, you use .apply() and get the results (and return it to your new column)
df['Region'] = df['State'].apply(is_atlantic)
This returns a data frame that looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Unknown
15 Michigan Unknown
16 Ohio Unknown
17 Pennsylvania Unknown
18 Illinois Unknown
19 Indiana Unknown
20 Minnesota Unknown
21 New York Atlantic
22 Washington Unknown
23 Oregon Unknown
24 California Unknown
Option 2
The first option gets cumbersome if you have multiple lists you want to check against. Instead of having multiple lists, I recommend creating a single dictionary with the State as the key and the region as the value. With only 50 values this should be easy enough to maintain.
regions = {
'Virginia': 'Atlantic',
'Massachusetts': 'Atlantic',
'Maine': 'Atlantic',
'New York': 'Atlantic',
'Rhode Island': 'Atlantic',
'Connecticut': 'Atlantic',
'New Hampshire': 'Atlantic',
'Maryland': 'Atlantic',
'Delaware': 'Atlantic',
'New Jersey': 'Atlantic',
'North Carolina': 'Atlantic',
'South Carolina': 'Atlantic',
'Georgia': 'Atlantic',
'Florida': 'Atlantic',
'Wisconsin': 'Midwest',
'Michigan': 'Midwest',
'Ohio': 'Midwest',
'Pennsylvania': 'Midwest',
'Illinois': 'Midwest',
'Indiana': 'Midwest',
'Minnesota': 'Midwest',
'New York': 'Atlantic',
'Washington': 'West',
'Oregon': 'West',
'California': 'West'
}
You can use .apply() again, with a slightly modified function:
def get_region(state):
return regions[state]
df['Region'] = df['State'].apply(get_region)
This time your dataframe looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Midwest
15 Michigan Midwest
16 Ohio Midwest
17 Pennsylvania Midwest
18 Illinois Midwest
19 Indiana Midwest
20 Minnesota Midwest
21 New York Atlantic
22 Washington West
23 Oregon West
24 California West

Related

select random pairs from remaining unique values in a list

Updated: Not sure I explained it well first time.
I have a scheduling problem, or more accurately, a "first come first served" problem. A list of available assets are assigned a set of spaces, available in pairs (think cars:parking spots, diners:tables, teams:games). I need a rough simulation (random) that chooses the first two to arrive from available pairs, then chooses the next two from remaining available pairs, and so on, until all spaces are filled.
Started using teams:games to cut my teeth. The first pair is easy enough. How do I then whittle it down to fill the next two spots from among the remaining available entities? Tried a bunch of different things, but coming up short. Help appreciated.
import itertools
import numpy as np
import pandas as pd
a = ['Georgia','Oregon','Florida','Texas'], ['Georgia','Oregon','Florida','Texas']
b = [(x,y) for x,y in itertools.product(*a) if x != y]
c = pd.DataFrame(b)
c.columns = ['home', 'away']
print(c)
d = c.sample(n = 2, replace = False)
print(d)
The first results is all possible combinations. But, once the first slots are filled, there can be no repeats. in example below, once Oregon and Georgia are slated in, the only remaining options to choose from are Forlida:Texas or Texas:Florida. Obviously just the sample function alone produces duplicates frequently. I will need this to scale up to dozens, then hundreds of entities:slots. Many thanks in advance!
home away
0 Georgia Oregon
1 Georgia Florida
2 Georgia Texas
3 Oregon Georgia
4 Oregon Florida
5 Oregon Texas
6 Florida Georgia
7 Florida Oregon
8 Florida Texas
9 Texas Georgia
10 Texas Oregon
11 Texas Florida
home away
3 Oregon Georgia
5 Oregon Texas

Not exactly sure what you are trying to do. But if you want to randomly pair your unique entities you can simply randomly order them and then place them in a 2-columns dataframe. I wrote this with all the US states minus one (Wyomi):
states = ['Alaska','Alabama','Arkansas','Arizona','California',
'Colorado','Connecticut','District of Columbia','Delaware',
'Florida','Georgia','Hawaii','Iowa','Idaho','Illinois',
'Indiana','Kansas','Kentucky','Louisiana','Massachusetts',
'Maryland','Maine','Michigan','Minnesota','Missouri',
'Mississippi','Montana','North Carolina','North Dakota',
'Nebraska','New Hampshire','New Jersey','New Mexico',
'Nevada','New York','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina',
'South Dakota','Tennessee','Texas','Utah','Virginia',
'Vermont','Washington','Wisconsin','West Virginia']
a=states.copy()
random.shuffle(states)
c = pd.DataFrame({'home':a[::2],'away':a[1::2]})
print(c)
#Output
home away
0 West Virginia Minnesota
1 New Hampshire Louisiana
2 Nevada Florida
3 Alabama Indiana
4 Delaware North Dakota
5 Georgia Rhode Island
6 Oregon Pennsylvania
7 New York South Dakota
8 Maryland Kansas
9 Ohio Hawaii
10 Colorado Wisconsin
11 Iowa Idaho
12 Illinois Missouri
13 Arizona Mississippi
14 Connecticut Montana
15 District of Columbia Vermont
16 Tennessee Kentucky
17 Alaska Washington
18 California Michigan
19 Arkansas New Jersey
20 Massachusetts Utah
21 Oklahoma New Mexico
22 Virginia South Carolina
23 North Carolina Maine
24 Texas Nebraska
Not sure if this is exactly what you were asking for though.
If you need to schedule all the fixtures of the season, you can check this answer --> League fixture generator in python

Cannot turn string into an integer python3

I'm attempting to convert the following into integers. I have literally tried everything and keep getting errors.
For instance:
pop2007 = pop2007.astype('int32')
ValueError: invalid literal for int() with base 10: '4,779,736'
Below is the DF I'm trying to convert. I've even attempted the .values method with no success.
pop2007
Alabama 4,779,736
Alaska 710,231
Arizona 6,392,017
Arkansas 2,915,918
California 37,253,956
Colorado 5,029,196
Connecticut 3,574,097
Delaware 897,934
Florida 18,801,310
Georgia 9,687,653
Idaho 1,567,582
Illinois 12,830,632
Indiana 6,483,802
Iowa 3,046,355
Kansas 2,853,118
Kentucky 4,339,367
Louisiana 4,533,372
Maine 1,328,361
Maryland 5,773,552
Massachusetts 6,547,629
Michigan 9,883,640
Minnesota 5,303,925
Mississippi 2,967,297
Missouri 5,988,927
Montana 989,415
Nebraska 1,826,341
Nevada 2,700,551
New Hampshire 1,316,470
New Jersey 8,791,894
New Mexico 2059179
New York 19378102
North Carolina 9535483
North Dakota 672591
Ohio 11536504
Oklahoma 3751351
Oregon 3831074
Pennsylvania 12702379
Rhode Island 1052567
South Carolina 4625364
South Dakota 814180
Tennessee 6346105
Texas 25,145,561
Utah 2,763,885
Vermont 625,741
Virginia 8,001,024
Washington 6,724,540
West Virginia 1,852,994
Wisconsin 5,686,986
Wyoming 563,626
Name: 3, dtype: object

You can't turn a string with commas into an integer. Try this.
my_int = '1,000,000'
my_int = int(my_int.replace(',', ''))
print(my_int)

Have you tried pop2007.replace(',','') to remove the commas from your string values before converting to integers?

Is there a way to bin categorical data in pandas?

I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?

import pandas as pd
def label_states (row):
if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
return 'north-east'
if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
return 'midwest'
if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
return 'south'
return 'etc'
df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])
df['label'] = df.apply(lambda row: label_states(row), axis=1)
df

Assume that your df contains:
State - US state code.
other columns, for the test (see below) I included only State Name.
Of course it can contain more columns and more than one row for each state.
To add region names (a new column), define regions DataFrame,
containing columns:
State - US state code.
Region - Region name.
Then merge these DataFrames and save the result back under df:
df = df.merge(regions, on='State')
A part of the result is:
State Name State Region
0 Alabama AL Southeast
1 Arizona AZ Southwest
2 Arkansas AR South
3 California CA West
4 Colorado CO Southwest
5 Connecticut CT Northeast
6 Delaware DE Northeast
7 Florida FL Southeast
8 Georgia GA Southeast
9 Idaho ID Northwest
10 Illinois IL Central
11 Indiana IN Central
12 Iowa IA East North Central
13 Kansas KS South
14 Kentucky KY Central
15 Louisiana LA South
Of course, there are numerous variants of how to assign US states to regions,
so if you want to use other variant, define regions DataFrame according
to your classification.

Derive a new pandas column based on a certain value of a row and apply until the next value appears again

In a pandas dataframe string column, I want to derive a new column based on the value of a row until the next value appears again. What is the most efficient way to do this / clean way to do achieve this?
Input Dataframe:
import pandas as pd
df = pd.DataFrame({'neighborhood':['Chicago City', 'Wicker Park', 'Bucktown','Lincoln Park','West Loop','River North','Milwaukee City','Bay View','East Side','South Side','Bronzeville','North Side','New York City','Harlem','Midtown','Chinatown']})
My desired dataframe output would be:
neighborhood city
0 Chicago City Chicago
1 Wicker Park Chicago
2 Bucktown Chicago
3 Lincoln Park Chicago
4 West Loop Chicago
5 River North Chicago
6 Milwaukee City Milwaukee
7 Bay View Milwaukee
8 East Side Milwaukee
9 South Side Milwaukee
10 Bronzeville Milwaukee
11 North Side Milwaukee
12 New York City New York
13 Harlem New York
14 Midtown New York
15 Chinatown New York

1) If the first column contains 'City', copy it to the second column but cut out the ' City' part
2) Fill NA's with a forward fill method
import numpy as np
df['city'] = np.where(
df.neighborhood.str.contains('City'),
df.neighborhood.str.replace(' City', '', case = False),
None)
Result:
neighborhood city
0 Chicago City Chicago
1 Wicker Park None
2 Bucktown None
3 Lincoln Park None
4 West Loop None
5 River North None
6 Milwaukee City Milwaukee
7 Bay View None
8 East Side None
9 South Side None
10 Bronzeville None
11 North Side None
12 New York City New York
13 Harlem None
14 Midtown None
15 Chinatown None
df['city'] = df['city'].fillna(method = 'ffill')
Result:
neighborhood city
0 Chicago City Chicago
1 Wicker Park Chicago
2 Bucktown Chicago
3 Lincoln Park Chicago
4 West Loop Chicago
5 River North Chicago
6 Milwaukee City Milwaukee
7 Bay View Milwaukee
8 East Side Milwaukee
9 South Side Milwaukee
10 Bronzeville Milwaukee
11 North Side Milwaukee
12 New York City New York
13 Harlem New York
14 Midtown New York
15 Chinatown New York

Use .str.extract + ffill
df['city'] = df.neighborhood.str.extract('(.*)\sCity').ffill()

you can just map a custom defined function that behaves as intended
city = None
def generate(s):
global city
if 'City' in s: city = s.replace('City','')
return city
df['neighborhood'].map(generate)
this will return the intended output

Filling out empty cells with lists of values

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.

Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

New column in pandas dataframe based on existing column values - python

Related

select random pairs from remaining unique values in a list

Cannot turn string into an integer python3

Is there a way to bin categorical data in pandas?

Derive a new pandas column based on a certain value of a row and apply until the next value appears again

Filling out empty cells with lists of values

Categories

Resources