writing to CSV in python - python

My csv writer currently does not produced row by row it just jumbles it up. Any help would be great, basically i need csv with the 4 lines in yields sections below in one colulmn.
tweets_df=tweets_df.dropna()
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
print(regex_getter(i))
yields
Burlington, VT
Minneapolis, MN
Bloomington, IN
Irvine, CA
with open('Bernie.csv', 'w') as mycsvfile:
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
row = regex_getter(i)
writer.writerow([i])
def regex_getter(entry):
txt = entry
re1='((?:[a-z][a-z]+))' # Word 1
re2='(,)' # Any Single Character 1
re3='(\\s+)' # White Space 1
re4='((?:(?:AL)|(?:AK)|(?:AS)|(?:AZ)|(?:AR)|(?:CA)|(?:CO)|(?:CT)|(?:DE)|(?:DC)|(?:FM)|(?:FL)|(?:GA)|(?:GU)|(?:HI)|(?:ID)|(?:IL)|(?:IN)|(?:IA)|(?:KS)|(?:KY)|(?:LA)|(?:ME)|(?:MH)|(?:MD)|(?:MA)|(?:MI)|(?:MN)|(?:MS)|(?:MO)|(?:MT)|(?:NE)|(?:NV)|(?:NH)|(?:NJ)|(?:NM)|(?:NY)|(?:NC)|(?:ND)|(?:MP)|(?:OH)|(?:OK)|(?:OR)|(?:PW)|(?:PA)|(?:PR)|(?:RI)|(?:SC)|(?:SD)|(?:TN)|(?:TX)|(?:UT)|(?:VT)|(?:VI)|(?:VA)|(?:WA)|(?:WV)|(?:WI)|(?:WY)))(?![a-z])' # US State 1
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
c1=m.group(2)
ws1=m.group(3)
usstate1=m.group(4)
return str((word1 + c1 +ws1 + usstate1))
What my data looks without the regex method, it basically takes out all data that is not in City, State format. It excluded everything not like Raleigh, NC for example.
for i in tweets_df.ix[:,0]:
print(i)
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks

I would do it this way:
states = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NA': 'National',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'
}
# sample DF
data = """\
location
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
"""
df = pd.read_csv(io.StringIO(data), sep=r'\|')
re_states = r'.*,\s*(?:' + '|'.join(states.keys()) + ')'
df.loc[df.location.str.contains(re_states), 'location'].to_csv('filtered.csv', index=False)
Explanation:
In [3]: df
Out[3]:
location
0 Indiana, USA
1 Burlington, VT
2 United States
3 Saint Paul - Minneapolis, MN
4 Inland Valley, The Pass, S. CA
5 In the Dreamatorium
6 Nova Scotia;Canada
7 North Carolina, USA
8 INTP. West Michigan
9 Los Angeles, California
10 Waterbury Connecticut
11 Right side of the tracks
generated RegEx:
In [9]: re_states
Out[9]: '.*,\\s*(?:VA|AK|ND|CA|CO|AR|MD|DC|KY|LA|OR|VT|IL|CT|OH|GA|WA|AS|NC|MN|NH|ID|HI|NA|MA|MS|WV|VI|FL|MO|MI|AL|ME|GU|NM|SD|WY|AZ|MP|DE|RI|PA|
NJ|WI|OK|TN|TX|KS|IN|NV|NY|NE|PR|UT|IA|MT|SC)'
Search mask:
In [10]: df.location.str.contains(re_states)
Out[10]:
0 False
1 True
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
Name: location, dtype: bool
Filtered DF:
In [11]: df.loc[df.location.str.contains(re_states)]
Out[11]:
location
1 Burlington, VT
3 Saint Paul - Minneapolis, MN
Now just spool it to CSV:
df.loc[df.location.str.contains(re_states), 'location'].to_csv('d:/temp/filtered.csv', index=False)
filtered.csv:
"Burlington, VT"
"Saint Paul - Minneapolis, MN"
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Related

How to check if multiple values contain a CSV dataset with Python?

I'm trying to check if the CSV file contains all the states from the U.S. As you can see in my code I've imported the CSV file as a list in Python. I'm trying to solve this problem, without using pandas or another module.
I've created a list of the states, but I'm wondering what is the most efficient solution to check what how many states the CSV dataset contains?
import csv
with open('president_county_candidate.csv', newline='', encoding='utf_8') as file:
reader = csv.reader(file)
data = list(reader)
print(data)
[['state', 'county', 'candidate', 'party', 'votes'], ['Delaware', 'Kent County', 'Joe Biden', 'DEM', '44518'], ['Delaware', 'Kent County', 'Donald Trump', 'REP', '40976'], ['Delaware', 'Kent County', 'Jo Jorgensen', 'LIB', '1044'], ['Delaware', 'Kent County', 'Howie Hawkins', 'GRN', '420'], ['Delaware', 'Kent County', ' Write-ins', 'WRI', '0'], ['Delaware', 'New Castle County', 'Joe Biden', 'DEM', '194245'], ['Delaware', 'New Castle County', 'Donald Trump', 'REP', '87687'], ['Delaware', 'New Castle County', 'Jo Jorgensen', 'LIB', '2932'], ['Delaware', 'New Castle County', 'Howie Hawkins', 'GRN', '1277'], ['Delaware', 'New Castle County', ' Write-ins', 'WRI', '0'], ['Delaware', 'Sussex County', 'Donald Trump', 'REP', '71196'], ['Delaware', 'Sussex County', 'Joe Biden', 'DEM', '56657'], ['Delaware', 'Sussex County', 'Jo Jorgensen', 'LIB', '1003'], ['Delaware', 'Sussex County', 'Howie Hawkins', 'GRN', '437'], ['District of Columbia', 'District of Columbia', 'Joe Biden', 'DEM', '31723'], ['District of Columbia', 'District of Columbia', 'Donald Trump', 'REP', '1239'], ['District of Columbia', 'District of Columbia', ' Write-ins', 'WRI', '206'], ['District of Columbia', 'District of Columbia', 'Howie Hawkins', 'GRN', '192'], ['District of Columbia', 'District of Columbia', 'Jo Jorgensen', 'LIB', '147'], ['District of Columbia', 'District of Columbia', 'Gloria La Riva', 'PSL', '77'], ['District of Columbia', 'District of Columbia', 'Brock Pierce', 'IND', '28'], ['District of Columbia', 'Ward 2', 'Joe Biden', 'DEM', '25228'], ['District of Columbia', 'Ward 2', 'Donald Trump', 'REP', '2466'], ['District of Columbia', 'Ward 2', ' Write-ins', 'WRI', '298'], ['District of Columbia', 'Ward 2', 'Jo Jorgensen', 'LIB', '229'], ['District of Columbia', 'Ward 2', 'Howie Hawkins', 'GRN', '96'], ['District of Columbia', 'Ward 2', 'Gloria La Riva', 'PSL', '37'], ['District of Columbia', 'Ward 2', 'Brock Pierce', 'IND', '32']]
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Idaho', 'Hawaii',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts','Michigan','Minnesota','Mississippi',
'Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico',
'New York', 'North Carolina','North Dakota','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
'Utah','Vermont','Virginia','Washington','West Virginia', 'Wisconsin','Wyoming']
If your objective is only to
check if the CSV file contains all the states from the U.S
then you can find the unique set of states in your file and make sure they count exactly as 50.
number = len(set(record[0].lower() for record in data[1:]))
# Expected: number should be 50
This example will count every state found in your data list:
counter = {}
for state, *_ in data:
if state in states:
counter.setdefault(state, 0)
counter[state] += 1
for state in states:
print("{:<20} {}".format(state, counter.get(state, 0)))
print()
print("Total states found:", len(counter))
Prints:
Alabama 0
Alaska 0
Arizona 0
Arkansas 0
California 0
Colorado 0
Connecticut 0
Delaware 14
Florida 0
Georgia 0
Idaho 0
Hawaii 0
Illinois 0
Indiana 0
Iowa 0
Kansas 0
Kentucky 0
Louisiana 0
Maine 0
Maryland 0
Massachusetts 0
Michigan 0
Minnesota 0
Mississippi 0
Missouri 0
Montana 0
Nebraska 0
Nevada 0
New Hampshire 0
New Jersey 0
New Mexico 0
New York 0
North Carolina 0
North Dakota 0
Ohio 0
Oklahoma 0
Oregon 0
Pennsylvania 0
Rhode Island 0
South Carolina 0
South Dakota 0
Tennessee 0
Texas 0
Utah 0
Vermont 0
Virginia 0
Washington 0
West Virginia 0
Wisconsin 0
Wyoming 0
Total states found: 1
P.S.: To speed up, you can convert states from list to set beforehand.
First, a tip: It is probably easier to use csv.DictReader in this case as it will give you labelled rows and automatically skip the first now. Not necessary, but makes the code easier to read
import csv
with open('test.csv') as f:
data = list(csv.DictReader(f))
print(data)
# prints: [
# {'state': 'Delaware', ' county': ' Kent County', ' candidate': ' Joe Biden', ' party': ' DEM', ' votes': ' 44518'},
# {'state': 'Delaware', ' county': ' Kent County', ' candidate': ' Donald Trump', ' party': ' REP', ' votes': ' 40976'}
# ...
# ]
Then, you can use this expression to get all the states that are mentioned in the csv file:
states_in_csv = set(line['state'] for line in data)
print(states_in_csv)
# {'Delaware', 'District of Columbia', ... }
line['state'] for line in data is a list comprehension that extracts just the 'state' field of each of those lines. set() makes the set of those states, i.e. removed all duplicates.
Then, you can easily test how many of your states are represented in your table. For example:
num_states = 0
for state in [""]:
if state in states_in_csv:
num_states += 1
print("number of states:", num_states)
This is very efficient because checking if a value is in a set is a constant time operation, so you don't have to search your whole table for each state.
It looks like your states list contains every state. If you just want to know how many states were in the table, you can simply use len(states_in_csv)
You can use a HashMap or in this case a Python Dictionary that is the most efficient data structure for this job. This snippet can help you:
dict={}
for i in data:
#verify if the state exist
if i[0] in states:
if i[0] in dict.keys():
dict[i[0]] +=1
else:
dict[i[0]]=1
for k in dict.keys():
if dict[k]>1:
print(f"There are {dict[k]} times the {k} state")
import csv
with open('president_county_candidate.csv', newline='', encoding='utf_8') as file:
reader = csv.reader(file)
data = list(reader)
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Idaho', 'Hawaii',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts','Michigan','Minnesota','Mississippi',
'Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico',
'New York', 'North Carolina','North Dakota','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
'Utah','Vermont','Virginia','Washington','West Virginia', 'Wisconsin','Wyoming']
for state in data:
for i in states:
if state == i:
print(state)

How to remove excel row in first excel based on data of second excel file pandas

I have one type of excel file with school data such as address, school name, principals name and etc. And second type of excel file with address, school name,rating, nubmer of telephone and etc. The question is: how can I delete particular rows in first excel file based on addresses of second?
first excel file:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
2 2 Tutwiler Prison 1209 Fort Lashley Ave., Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
second:
School_Name ... Address
0 Pine View School ... 0 Mp 1361 Ak Hwy, Dot Lake, AK 99737
1 A.D. Henderson University School ... 1 168 3Rd Avenue, Eagle, AK 99738
2 School For Advanced Studies - South ... 2 249 Jon Summar Way, Tok, AK 99780
3 Tutwiler 3 1209 Fort Lashley Ave., Talladega, AL 35160
the output must be:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
I tried to use for loop, pandas
import pandas as pd
from pandas import ExcelWriter
writer = pd.ExcelWriter('US1234.xlsx', engine='xlsxwriter')
data = []
data_schools = []
df = pd.read_excel('DZ13288pubprin.xlsx')
lists = [[] for i in range(2)]
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY',
'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
print(len(states))
def checking_top_100(nameofschool):
for i in states:
df2 = pd.read_excel('TOP-100.xlsx', sheet_name=[i])
for a in df2[i]['SchoolName']:
if nameofschool in a:
pass
else:
return nameofschool
def sort_by_value(state, index):
for i in range(len(df.SchoolName)):
if df.LocationState[i] == state:
# print(df.SchoolName[i])
school_name = checking_top_100(df.SchoolName[i])
lists[index].append(school_name)
lists[index].append(
df.LocationAddress[i] + ', ' + df.LocationCity[i] + ', ' + df.LocationState[i] + ' ' + df.LocationZip[
i])
# lists[index].append(df.EmailAddress[i])
print(lists[index][0::2])
def data_to_excel(state, index):
dfi = pd.DataFrame({
'SchoolName': lists[index][0::2],
# 'Principal Name': lists[index][1::3],
# 'Email Address': lists[index][2::3],
'Address': lists[index][1::2]
})
dfi.to_excel(writer, sheet_name=state)
# checking_top_100()
for i in range(len(states)):
sort_by_value(states[i], i)
data_to_excel(states[i], i)
writer.save()
I suggest you take a look at pandas.DataFrame.isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html). As this would return a boolean array (True or False) depending on whether or not the address is found in the second dataframe, you could then simply use boolean indexing to filter out the subset of the data where the address is not found.
In other words, you could do something like:
dataframe1[dataframe1.Address.isin(dataframe2.Address) == False]
This should give you the result you want.

How to generate a new data frame by matching common values in this case?

I have two data frames like this:
data_2019_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'New York', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Florida', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Medicine', 'Medicine', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [3.6, 3.2, 2.9, 2.4, 3.1, 1.5, 1.4, 0.9, 4.4, 2.0, 1.9]}
data_2020_dict = {'state': ['Kansas', 'Texas', 'California', 'Idaho', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Texas', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Finance', 'Finance', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [2.3, 1.8, 1.6, 7.2, 5.9, 4.1, 0.2, 5.1, 2.3, 2.2]}
data_2019 = pd.DataFrame(data_2019_dict)
data_2020 = pd.DataFrame(data_2020_dict)
Each data frame shows that in a year, which states perform well in those industries. What I want to generate, but get stuck, is: For each state, what industries are performed well in both years? The resulting data frame will look like this:
# Manually generated for illustration
data_both_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'Nevada', 'New York', 'Virginia', 'Louisiana', 'Florida', 'Kansas', 'California', 'Idaho'],
'common_industry': ['', 'Agriculture', '', 'Medicine', 'Manufacture', '', '', 'Manufacture', '', '', '', ''],
'common_industry_count': [0, 1, 0, 2, 2, 0, 0, 1, 0, 0, 0, 0]
}
data_both = pd.DataFrame(data_both_dict)
First DataFrame.merge for common rows by both columns, rename column and add counts by Series.value_counts and Series.map:
df = (data_2019.merge(data_2020, on=['state','industry'])
.rename(columns={'industry':'common_industry'}))
df['common_industry_count'] = df['state'].map(df['state'].value_counts())
df = df[['state','common_industry','common_industry_count']]
print (df)
state common_industry common_industry_count
0 Texas Agriculture 1
1 Nevada Medicine 2
2 Louisiana Manufacture 1
3 Nevada Manufacture 2
Then get all states by concat with removed duplicates by Series.drop_duplicates and one column DataFrame by Series.to_frame:
both = pd.concat([data_2019['state'], data_2020['state']]).drop_duplicates().to_frame()
print (both)
state
0 Ohio
1 Texas
2 Pennsylvania
3 Nevada
4 New York
7 Virginia
8 Louisiana
9 Florida
0 Kansas
2 California
3 Idaho
Last merge with left join and replace missing values by Series.fillna:
df = both.merge(df, how='left')
df['common_industry_count'] = df['common_industry_count'].fillna(0).astype(int)
df['common_industry'] = df['common_industry'].fillna('')
print (df)
state common_industry common_industry_count
0 Ohio 0
1 Texas Agriculture 1
2 Pennsylvania 0
3 Nevada Medicine 2
4 Nevada Manufacture 2
5 New York 0
6 Virginia 0
7 Louisiana Manufacture 1
8 Florida 0
9 Kansas 0
10 California 0
11 Idaho 0

Is there a way to bin categorical data in pandas?

I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?
import pandas as pd
def label_states (row):
if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
return 'north-east'
if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
return 'midwest'
if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
return 'south'
return 'etc'
df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])
df['label'] = df.apply(lambda row: label_states(row), axis=1)
df
Assume that your df contains:
State - US state code.
other columns, for the test (see below) I included only State Name.
Of course it can contain more columns and more than one row for each state.
To add region names (a new column), define regions DataFrame,
containing columns:
State - US state code.
Region - Region name.
Then merge these DataFrames and save the result back under df:
df = df.merge(regions, on='State')
A part of the result is:
State Name State Region
0 Alabama AL Southeast
1 Arizona AZ Southwest
2 Arkansas AR South
3 California CA West
4 Colorado CO Southwest
5 Connecticut CT Northeast
6 Delaware DE Northeast
7 Florida FL Southeast
8 Georgia GA Southeast
9 Idaho ID Northwest
10 Illinois IL Central
11 Indiana IN Central
12 Iowa IA East North Central
13 Kansas KS South
14 Kentucky KY Central
15 Louisiana LA South
Of course, there are numerous variants of how to assign US states to regions,
so if you want to use other variant, define regions DataFrame according
to your classification.

New column in pandas dataframe based on existing column values

I have a dataframe with a column named 'States' that lists various U.S. states. I need to create another column with a region specifier like 'Atlantic Coast' I have lists of the states that belong to various regions so if the state in df['States'] matches a state in the list 'Atlantic_states' the specifier 'Atlantic Coast' is inserted into the new column df['region specifier'] my code below shows the list I want to compare my dataframe values with and the output of the df['States'] column.
#list of states
Atlantic_states = ['Virginia',
'Massachusetts',
'Maine',
'New York',
'Rhode Island',
'Connecticut',
'New Hampshire',
'Maryland',
'Delaware',
'New Jersey',
'North Carolina',
'South Carolina',
'Georgia',
'Florida']
print(df['States'])
Out:
States
0 Virginia
1 Massachusetts
2 Maine
3 New York
4 Rhode Island
5 Connecticut
6 New Hampshire
7 Maryland
8 Delaware
9 New Jersey
10 North Carolina
11 South Carolina
12 Georgia
13 Florida
14 Wisconsin
15 Michigan
16 Ohio
17 Pennsylvania
18 Illinois
19 Indiana
20 Minnesota
21 New York
22 Washington
23 Oregon
24 California
Whilst Andy's answer works it is not the most efficient way of doing this. There is a handy method that can be called on almost all pandas Series-like objects: .isin(). Entries to this can be lists, dicts and pandas Series.
df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
'Connecticut','New Hampshire','Maryland', 'Delaware',
'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
'Indiana','Minnesota','New York','Washington','Oregon',
'California'],
columns=['States'])
Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
'Connecticut', 'New Hampshire', 'Maryland', 'Delaware',
'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
'Florida']
df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
'Unknown')
df.head()
Out[1]:
States Coast
0 Virginia Atlantic Coast
1 Massachusetts Atlantic Coast
2 Maine Atlantic Coast
3 New York Atlantic Coast
4 Rhode Island Atlantic Coast
Benchmarks
Here are some timings using for mapping the first 10 letters of the alphabet to some random int numbers:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))
for apply
%%timeit
def is_atlantic(state):
return True if state in letters else False
df.numbers.apply(is_atlantic)
Out[]: 1 loops, best of 3: 432 ms per loop
Now for map as suggested by JohnE
%%timeit
df.numbers.map(letters)
Out[]: 10 loops, best of 3: 56.9 ms per loop
and finally for isin (also suggested by Nickil Maveli)
%%timeit
df.numbers.isin(letters)
Out[]: 10 loops, best of 3: 20.9 ms per loop
So we see that .isin() is much quicker than .apply() and twice as quick as .map().
Note: apply and isin just return the boolean masks and map fills with the desired strings. Even so, when assigning to another column isin wins out by about 2/3 of the time of map.
You have a couple options. First, to directly answer the question as posed:
Option 1
Create a function that returns whether a state is in the Atlantic region or not
def is_atlantic(state):
return "Atlantic" if state in Atlantic_states else "Unknown"
Now, you use .apply() and get the results (and return it to your new column)
df['Region'] = df['State'].apply(is_atlantic)
This returns a data frame that looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Unknown
15 Michigan Unknown
16 Ohio Unknown
17 Pennsylvania Unknown
18 Illinois Unknown
19 Indiana Unknown
20 Minnesota Unknown
21 New York Atlantic
22 Washington Unknown
23 Oregon Unknown
24 California Unknown
Option 2
The first option gets cumbersome if you have multiple lists you want to check against. Instead of having multiple lists, I recommend creating a single dictionary with the State as the key and the region as the value. With only 50 values this should be easy enough to maintain.
regions = {
'Virginia': 'Atlantic',
'Massachusetts': 'Atlantic',
'Maine': 'Atlantic',
'New York': 'Atlantic',
'Rhode Island': 'Atlantic',
'Connecticut': 'Atlantic',
'New Hampshire': 'Atlantic',
'Maryland': 'Atlantic',
'Delaware': 'Atlantic',
'New Jersey': 'Atlantic',
'North Carolina': 'Atlantic',
'South Carolina': 'Atlantic',
'Georgia': 'Atlantic',
'Florida': 'Atlantic',
'Wisconsin': 'Midwest',
'Michigan': 'Midwest',
'Ohio': 'Midwest',
'Pennsylvania': 'Midwest',
'Illinois': 'Midwest',
'Indiana': 'Midwest',
'Minnesota': 'Midwest',
'New York': 'Atlantic',
'Washington': 'West',
'Oregon': 'West',
'California': 'West'
}
You can use .apply() again, with a slightly modified function:
def get_region(state):
return regions[state]
df['Region'] = df['State'].apply(get_region)
This time your dataframe looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Midwest
15 Michigan Midwest
16 Ohio Midwest
17 Pennsylvania Midwest
18 Illinois Midwest
19 Indiana Midwest
20 Minnesota Midwest
21 New York Atlantic
22 Washington West
23 Oregon West
24 California West

Categories

Resources