I'm trying to check if the CSV file contains all the states from the U.S. As you can see in my code I've imported the CSV file as a list in Python. I'm trying to solve this problem, without using pandas or another module.
I've created a list of the states, but I'm wondering what is the most efficient solution to check what how many states the CSV dataset contains?
import csv
with open('president_county_candidate.csv', newline='', encoding='utf_8') as file:
reader = csv.reader(file)
data = list(reader)
print(data)
[['state', 'county', 'candidate', 'party', 'votes'], ['Delaware', 'Kent County', 'Joe Biden', 'DEM', '44518'], ['Delaware', 'Kent County', 'Donald Trump', 'REP', '40976'], ['Delaware', 'Kent County', 'Jo Jorgensen', 'LIB', '1044'], ['Delaware', 'Kent County', 'Howie Hawkins', 'GRN', '420'], ['Delaware', 'Kent County', ' Write-ins', 'WRI', '0'], ['Delaware', 'New Castle County', 'Joe Biden', 'DEM', '194245'], ['Delaware', 'New Castle County', 'Donald Trump', 'REP', '87687'], ['Delaware', 'New Castle County', 'Jo Jorgensen', 'LIB', '2932'], ['Delaware', 'New Castle County', 'Howie Hawkins', 'GRN', '1277'], ['Delaware', 'New Castle County', ' Write-ins', 'WRI', '0'], ['Delaware', 'Sussex County', 'Donald Trump', 'REP', '71196'], ['Delaware', 'Sussex County', 'Joe Biden', 'DEM', '56657'], ['Delaware', 'Sussex County', 'Jo Jorgensen', 'LIB', '1003'], ['Delaware', 'Sussex County', 'Howie Hawkins', 'GRN', '437'], ['District of Columbia', 'District of Columbia', 'Joe Biden', 'DEM', '31723'], ['District of Columbia', 'District of Columbia', 'Donald Trump', 'REP', '1239'], ['District of Columbia', 'District of Columbia', ' Write-ins', 'WRI', '206'], ['District of Columbia', 'District of Columbia', 'Howie Hawkins', 'GRN', '192'], ['District of Columbia', 'District of Columbia', 'Jo Jorgensen', 'LIB', '147'], ['District of Columbia', 'District of Columbia', 'Gloria La Riva', 'PSL', '77'], ['District of Columbia', 'District of Columbia', 'Brock Pierce', 'IND', '28'], ['District of Columbia', 'Ward 2', 'Joe Biden', 'DEM', '25228'], ['District of Columbia', 'Ward 2', 'Donald Trump', 'REP', '2466'], ['District of Columbia', 'Ward 2', ' Write-ins', 'WRI', '298'], ['District of Columbia', 'Ward 2', 'Jo Jorgensen', 'LIB', '229'], ['District of Columbia', 'Ward 2', 'Howie Hawkins', 'GRN', '96'], ['District of Columbia', 'Ward 2', 'Gloria La Riva', 'PSL', '37'], ['District of Columbia', 'Ward 2', 'Brock Pierce', 'IND', '32']]
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Idaho', 'Hawaii',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts','Michigan','Minnesota','Mississippi',
'Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico',
'New York', 'North Carolina','North Dakota','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
'Utah','Vermont','Virginia','Washington','West Virginia', 'Wisconsin','Wyoming']
If your objective is only to
check if the CSV file contains all the states from the U.S
then you can find the unique set of states in your file and make sure they count exactly as 50.
number = len(set(record[0].lower() for record in data[1:]))
# Expected: number should be 50
This example will count every state found in your data list:
counter = {}
for state, *_ in data:
if state in states:
counter.setdefault(state, 0)
counter[state] += 1
for state in states:
print("{:<20} {}".format(state, counter.get(state, 0)))
print()
print("Total states found:", len(counter))
Prints:
Alabama 0
Alaska 0
Arizona 0
Arkansas 0
California 0
Colorado 0
Connecticut 0
Delaware 14
Florida 0
Georgia 0
Idaho 0
Hawaii 0
Illinois 0
Indiana 0
Iowa 0
Kansas 0
Kentucky 0
Louisiana 0
Maine 0
Maryland 0
Massachusetts 0
Michigan 0
Minnesota 0
Mississippi 0
Missouri 0
Montana 0
Nebraska 0
Nevada 0
New Hampshire 0
New Jersey 0
New Mexico 0
New York 0
North Carolina 0
North Dakota 0
Ohio 0
Oklahoma 0
Oregon 0
Pennsylvania 0
Rhode Island 0
South Carolina 0
South Dakota 0
Tennessee 0
Texas 0
Utah 0
Vermont 0
Virginia 0
Washington 0
West Virginia 0
Wisconsin 0
Wyoming 0
Total states found: 1
P.S.: To speed up, you can convert states from list to set beforehand.
First, a tip: It is probably easier to use csv.DictReader in this case as it will give you labelled rows and automatically skip the first now. Not necessary, but makes the code easier to read
import csv
with open('test.csv') as f:
data = list(csv.DictReader(f))
print(data)
# prints: [
# {'state': 'Delaware', ' county': ' Kent County', ' candidate': ' Joe Biden', ' party': ' DEM', ' votes': ' 44518'},
# {'state': 'Delaware', ' county': ' Kent County', ' candidate': ' Donald Trump', ' party': ' REP', ' votes': ' 40976'}
# ...
# ]
Then, you can use this expression to get all the states that are mentioned in the csv file:
states_in_csv = set(line['state'] for line in data)
print(states_in_csv)
# {'Delaware', 'District of Columbia', ... }
line['state'] for line in data is a list comprehension that extracts just the 'state' field of each of those lines. set() makes the set of those states, i.e. removed all duplicates.
Then, you can easily test how many of your states are represented in your table. For example:
num_states = 0
for state in [""]:
if state in states_in_csv:
num_states += 1
print("number of states:", num_states)
This is very efficient because checking if a value is in a set is a constant time operation, so you don't have to search your whole table for each state.
It looks like your states list contains every state. If you just want to know how many states were in the table, you can simply use len(states_in_csv)
You can use a HashMap or in this case a Python Dictionary that is the most efficient data structure for this job. This snippet can help you:
dict={}
for i in data:
#verify if the state exist
if i[0] in states:
if i[0] in dict.keys():
dict[i[0]] +=1
else:
dict[i[0]]=1
for k in dict.keys():
if dict[k]>1:
print(f"There are {dict[k]} times the {k} state")
import csv
with open('president_county_candidate.csv', newline='', encoding='utf_8') as file:
reader = csv.reader(file)
data = list(reader)
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Idaho', 'Hawaii',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts','Michigan','Minnesota','Mississippi',
'Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico',
'New York', 'North Carolina','North Dakota','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
'Utah','Vermont','Virginia','Washington','West Virginia', 'Wisconsin','Wyoming']
for state in data:
for i in states:
if state == i:
print(state)
Related
I have one type of excel file with school data such as address, school name, principals name and etc. And second type of excel file with address, school name,rating, nubmer of telephone and etc. The question is: how can I delete particular rows in first excel file based on addresses of second?
first excel file:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
2 2 Tutwiler Prison 1209 Fort Lashley Ave., Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
second:
School_Name ... Address
0 Pine View School ... 0 Mp 1361 Ak Hwy, Dot Lake, AK 99737
1 A.D. Henderson University School ... 1 168 3Rd Avenue, Eagle, AK 99738
2 School For Advanced Studies - South ... 2 249 Jon Summar Way, Tok, AK 99780
3 Tutwiler 3 1209 Fort Lashley Ave., Talladega, AL 35160
the output must be:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
I tried to use for loop, pandas
import pandas as pd
from pandas import ExcelWriter
writer = pd.ExcelWriter('US1234.xlsx', engine='xlsxwriter')
data = []
data_schools = []
df = pd.read_excel('DZ13288pubprin.xlsx')
lists = [[] for i in range(2)]
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY',
'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
print(len(states))
def checking_top_100(nameofschool):
for i in states:
df2 = pd.read_excel('TOP-100.xlsx', sheet_name=[i])
for a in df2[i]['SchoolName']:
if nameofschool in a:
pass
else:
return nameofschool
def sort_by_value(state, index):
for i in range(len(df.SchoolName)):
if df.LocationState[i] == state:
# print(df.SchoolName[i])
school_name = checking_top_100(df.SchoolName[i])
lists[index].append(school_name)
lists[index].append(
df.LocationAddress[i] + ', ' + df.LocationCity[i] + ', ' + df.LocationState[i] + ' ' + df.LocationZip[
i])
# lists[index].append(df.EmailAddress[i])
print(lists[index][0::2])
def data_to_excel(state, index):
dfi = pd.DataFrame({
'SchoolName': lists[index][0::2],
# 'Principal Name': lists[index][1::3],
# 'Email Address': lists[index][2::3],
'Address': lists[index][1::2]
})
dfi.to_excel(writer, sheet_name=state)
# checking_top_100()
for i in range(len(states)):
sort_by_value(states[i], i)
data_to_excel(states[i], i)
writer.save()
I suggest you take a look at pandas.DataFrame.isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html). As this would return a boolean array (True or False) depending on whether or not the address is found in the second dataframe, you could then simply use boolean indexing to filter out the subset of the data where the address is not found.
In other words, you could do something like:
dataframe1[dataframe1.Address.isin(dataframe2.Address) == False]
This should give you the result you want.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webpage = urlopen('https://en.wikipedia.org/wiki/List_of_largest_banks')
bs = BeautifulSoup(webpage,'html.parser')
print(bs)
spanList= bs.find_all('span',{'class':'flagicon'})
for span in spanList:
print(span.a['title'])
Though its printing the list of countries in the first table but after printing its giving error:
Traceback (most recent call last):
File "C:/Users/Jegathesan/Desktop/python programmes/scrape5.py", line 10, in <module>
print(span.a['title'])
TypeError: 'NoneType' object is not subscriptable
import pandas as pd
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_banks")
for item in range(4):
goal = df[item].iloc[:, 1].values.tolist()
print(goal)
print("*" * 100)
Output:
['Industrial and Commercial Bank of China', 'China Construction Bank', 'Agricultural Bank of China', 'Bank of China', 'Mitsubishi UFJ Financial Group', 'JPMorgan Chase', 'HSBC Holdings PLC', 'Bank of America', 'BNP Paribas', 'Crédit Agricole', 'Citigroup Inc.', 'Japan Post Bank', 'Wells Fargo', 'Sumitomo Mitsui Financial Group', 'Mizuho Financial Group', 'Banco Santander', 'Deutsche Bank', 'Société Générale', 'Groupe BPCE', 'Barclays', 'Bank of Communications', 'Postal Savings Bank of China', 'Royal Bank of Canada', 'Lloyds Banking Group', 'ING Group', 'Toronto-Dominion Bank', 'China Merchants Bank', 'Crédit Mutuel', 'Norinchukin Bank', 'UBS', 'Industrial Bank (China)', 'UniCredit', 'Goldman Sachs', 'Shanghai Pudong Development Bank', 'Intesa Sanpaolo', 'Royal Bank of Scotland Group', 'China CITIC Bank', 'China Minsheng Bank', 'Morgan Stanley', 'Scotiabank', 'Credit Suisse', 'Banco Bilbao Vizcaya Argentaria', 'Commonwealth Bank', 'Standard Chartered', 'Australia and New Zealand Banking Group', 'Rabobank', 'Nordea', 'Westpac', 'China Everbright Bank', 'Bank of Montreal', 'DZ Bank', 'National Australia Bank', 'Danske Bank', 'State Bank of India', 'Resona Holdings', 'Commerzbank', 'Sumitomo Mitsui Trust Holdings', 'Ping An Bank', 'Canadian Imperial Bank of Commerce', 'U.S. Bancorp', 'CaixaBank', 'Truist Financial', 'ABN AMRO Group', 'KB Financial Group Inc', 'Shinhan Bank', 'Sberbank of Russia', 'Nomura Holdings', 'DBS
Bank', 'Itaú Unibanco', 'PNC Financial Services', 'Huaxia Bank', 'Nonghyup Bank', 'Capital One', 'Bank of Beijing', 'The Bank of New York Mellon', 'Banco do Brasil', 'Hana Financial Group', 'OCBC Bank', 'Banco Bradesco', 'Handelsbanken', 'Caixa Econômica Federal', 'KBC Bank', 'China Guangfa Bank', 'Nationwide Building
Society', 'Woori Bank', 'DNB ASA', 'SEB Group', 'Bank of Shanghai', 'United Overseas Bank', 'Bank of Jiangsu', 'La Banque postale', 'Landesbank Baden-Württemberg', 'Erste Group', 'Industrial Bank of Korea', 'Qatar National Bank', 'Banco Sabadell', 'Swedbank', 'BayernLB', 'State Street Corporation', 'China Zheshang Bank', 'Bankia']
****************************************************************************************************
['China', 'United States', 'Japan', 'France', 'South Korea', 'United Kingdom', 'Canada', 'Germany', 'Spain', 'Australia', 'Brazil', 'Netherlands', 'Singapore',
'Sweden', 'Italy', 'Switzerland', 'Austria', 'Belgium', 'Denmark', 'Finland', 'India', 'Luxembourg', 'Norway', 'Russia']
****************************************************************************************************
['JPMorgan Chase', 'Industrial and Commercial Bank of China', 'Bank of America', 'Wells Fargo', 'China Construction Bank', 'HSBC Holdings PLC', 'Agricultural Bank of China', 'Citigroup Inc.', 'Bank of China', 'China Merchants Bank', 'Royal
Bank of Canada', 'Banco Santander', 'Commonwealth Bank', 'Mitsubishi UFJ Financial Group', 'Toronto-Dominion Bank', 'BNP Paribas', 'Goldman Sachs', 'Sberbank of Russia', 'Morgan Stanley', 'U.S. Bancorp', 'HDFC Bank', 'Itaú Unibanco', 'Westpac', 'Scotiabank', 'ING Group', 'UBS', 'Charles Schwab', 'PNC Financial Services', 'Lloyds Banking Group', 'Sumitomo Mitsui Financial Group', 'Bank of Communications', 'Australia and New Zealand Banking Group', 'Banco Bradesco', 'National Australia Bank', 'Intesa Sanpaolo', 'Banco Bilbao Vizcaya Argentaria', 'Japan Post Bank', 'The Bank of New York Mellon', 'Shanghai Pudong Development Bank', 'Industrial Bank (China)', 'Bank of China (Hong Kong)', 'Bank of Montreal', 'Crédit
Agricole', 'DBS Bank', 'Nordea', 'Capital One', 'Royal Bank of Scotland Group',
'Mizuho Financial Group', 'Credit Suisse', 'Postal Savings Bank of China', 'China Minsheng Bank', 'UniCredit', 'China CITIC Bank', 'Hang Seng Bank', 'Société Générale', 'Barclays', 'Canadian Imperial Bank of Commerce', 'Bank Central Asia',
'Truist Financial', 'Oversea-Chinese Banking Corp', 'State Bank of India', 'State Street', 'Deutsche Bank', 'KBC Bank', 'Danske Bank', 'Ping An Bank', 'Standard Chartered', 'United Overseas Bank', 'QNB Group', 'Bank Rakyat']
****************************************************************************************************
['United States', 'China', 'United Kingdom', 'Canada', 'Australia', 'Japan', 'France', 'Spain', 'Brazil', 'India', 'Singapore', 'Switzerland', 'Italy', 'Hong Kong', 'Indonesia', 'Russia', 'Netherlands']
****************************************************************************************************
your initial code was checking the span tag in all the html parsed.
the modified code will get all the table tag found in html and store in a list.
using your statement to get the span tag for a specific table (i.e the first table)
from bs4 import BeautifulSoup
from urllib.request import urlopen
webpage = urlopen('https://en.wikipedia.org/wiki/List_of_largest_banks')
bs = BeautifulSoup(webpage,'html.parser')
# tableList is extracting all "table" elements in a list
tableList = bs.table.findAll()
# spanList will access the table [index] in the tableList and find all span
# to access other table change the list index
spanList= tableList[0].findAll('span',{'class':'flagicon'})
for span in spanList:
try:
print(span.a['title'])
except:
print("title tag is not found.")
I have two data frames like this:
data_2019_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'New York', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Florida', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Medicine', 'Medicine', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [3.6, 3.2, 2.9, 2.4, 3.1, 1.5, 1.4, 0.9, 4.4, 2.0, 1.9]}
data_2020_dict = {'state': ['Kansas', 'Texas', 'California', 'Idaho', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Texas', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Finance', 'Finance', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [2.3, 1.8, 1.6, 7.2, 5.9, 4.1, 0.2, 5.1, 2.3, 2.2]}
data_2019 = pd.DataFrame(data_2019_dict)
data_2020 = pd.DataFrame(data_2020_dict)
Each data frame shows that in a year, which states perform well in those industries. What I want to generate, but get stuck, is: For each state, what industries are performed well in both years? The resulting data frame will look like this:
# Manually generated for illustration
data_both_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'Nevada', 'New York', 'Virginia', 'Louisiana', 'Florida', 'Kansas', 'California', 'Idaho'],
'common_industry': ['', 'Agriculture', '', 'Medicine', 'Manufacture', '', '', 'Manufacture', '', '', '', ''],
'common_industry_count': [0, 1, 0, 2, 2, 0, 0, 1, 0, 0, 0, 0]
}
data_both = pd.DataFrame(data_both_dict)
First DataFrame.merge for common rows by both columns, rename column and add counts by Series.value_counts and Series.map:
df = (data_2019.merge(data_2020, on=['state','industry'])
.rename(columns={'industry':'common_industry'}))
df['common_industry_count'] = df['state'].map(df['state'].value_counts())
df = df[['state','common_industry','common_industry_count']]
print (df)
state common_industry common_industry_count
0 Texas Agriculture 1
1 Nevada Medicine 2
2 Louisiana Manufacture 1
3 Nevada Manufacture 2
Then get all states by concat with removed duplicates by Series.drop_duplicates and one column DataFrame by Series.to_frame:
both = pd.concat([data_2019['state'], data_2020['state']]).drop_duplicates().to_frame()
print (both)
state
0 Ohio
1 Texas
2 Pennsylvania
3 Nevada
4 New York
7 Virginia
8 Louisiana
9 Florida
0 Kansas
2 California
3 Idaho
Last merge with left join and replace missing values by Series.fillna:
df = both.merge(df, how='left')
df['common_industry_count'] = df['common_industry_count'].fillna(0).astype(int)
df['common_industry'] = df['common_industry'].fillna('')
print (df)
state common_industry common_industry_count
0 Ohio 0
1 Texas Agriculture 1
2 Pennsylvania 0
3 Nevada Medicine 2
4 Nevada Manufacture 2
5 New York 0
6 Virginia 0
7 Louisiana Manufacture 1
8 Florida 0
9 Kansas 0
10 California 0
11 Idaho 0
I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?
import pandas as pd
def label_states (row):
if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
return 'north-east'
if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
return 'midwest'
if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
return 'south'
return 'etc'
df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])
df['label'] = df.apply(lambda row: label_states(row), axis=1)
df
Assume that your df contains:
State - US state code.
other columns, for the test (see below) I included only State Name.
Of course it can contain more columns and more than one row for each state.
To add region names (a new column), define regions DataFrame,
containing columns:
State - US state code.
Region - Region name.
Then merge these DataFrames and save the result back under df:
df = df.merge(regions, on='State')
A part of the result is:
State Name State Region
0 Alabama AL Southeast
1 Arizona AZ Southwest
2 Arkansas AR South
3 California CA West
4 Colorado CO Southwest
5 Connecticut CT Northeast
6 Delaware DE Northeast
7 Florida FL Southeast
8 Georgia GA Southeast
9 Idaho ID Northwest
10 Illinois IL Central
11 Indiana IN Central
12 Iowa IA East North Central
13 Kansas KS South
14 Kentucky KY Central
15 Louisiana LA South
Of course, there are numerous variants of how to assign US states to regions,
so if you want to use other variant, define regions DataFrame according
to your classification.
My csv writer currently does not produced row by row it just jumbles it up. Any help would be great, basically i need csv with the 4 lines in yields sections below in one colulmn.
tweets_df=tweets_df.dropna()
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
print(regex_getter(i))
yields
Burlington, VT
Minneapolis, MN
Bloomington, IN
Irvine, CA
with open('Bernie.csv', 'w') as mycsvfile:
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
row = regex_getter(i)
writer.writerow([i])
def regex_getter(entry):
txt = entry
re1='((?:[a-z][a-z]+))' # Word 1
re2='(,)' # Any Single Character 1
re3='(\\s+)' # White Space 1
re4='((?:(?:AL)|(?:AK)|(?:AS)|(?:AZ)|(?:AR)|(?:CA)|(?:CO)|(?:CT)|(?:DE)|(?:DC)|(?:FM)|(?:FL)|(?:GA)|(?:GU)|(?:HI)|(?:ID)|(?:IL)|(?:IN)|(?:IA)|(?:KS)|(?:KY)|(?:LA)|(?:ME)|(?:MH)|(?:MD)|(?:MA)|(?:MI)|(?:MN)|(?:MS)|(?:MO)|(?:MT)|(?:NE)|(?:NV)|(?:NH)|(?:NJ)|(?:NM)|(?:NY)|(?:NC)|(?:ND)|(?:MP)|(?:OH)|(?:OK)|(?:OR)|(?:PW)|(?:PA)|(?:PR)|(?:RI)|(?:SC)|(?:SD)|(?:TN)|(?:TX)|(?:UT)|(?:VT)|(?:VI)|(?:VA)|(?:WA)|(?:WV)|(?:WI)|(?:WY)))(?![a-z])' # US State 1
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
c1=m.group(2)
ws1=m.group(3)
usstate1=m.group(4)
return str((word1 + c1 +ws1 + usstate1))
What my data looks without the regex method, it basically takes out all data that is not in City, State format. It excluded everything not like Raleigh, NC for example.
for i in tweets_df.ix[:,0]:
print(i)
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
I would do it this way:
states = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NA': 'National',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'
}
# sample DF
data = """\
location
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
"""
df = pd.read_csv(io.StringIO(data), sep=r'\|')
re_states = r'.*,\s*(?:' + '|'.join(states.keys()) + ')'
df.loc[df.location.str.contains(re_states), 'location'].to_csv('filtered.csv', index=False)
Explanation:
In [3]: df
Out[3]:
location
0 Indiana, USA
1 Burlington, VT
2 United States
3 Saint Paul - Minneapolis, MN
4 Inland Valley, The Pass, S. CA
5 In the Dreamatorium
6 Nova Scotia;Canada
7 North Carolina, USA
8 INTP. West Michigan
9 Los Angeles, California
10 Waterbury Connecticut
11 Right side of the tracks
generated RegEx:
In [9]: re_states
Out[9]: '.*,\\s*(?:VA|AK|ND|CA|CO|AR|MD|DC|KY|LA|OR|VT|IL|CT|OH|GA|WA|AS|NC|MN|NH|ID|HI|NA|MA|MS|WV|VI|FL|MO|MI|AL|ME|GU|NM|SD|WY|AZ|MP|DE|RI|PA|
NJ|WI|OK|TN|TX|KS|IN|NV|NY|NE|PR|UT|IA|MT|SC)'
Search mask:
In [10]: df.location.str.contains(re_states)
Out[10]:
0 False
1 True
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
Name: location, dtype: bool
Filtered DF:
In [11]: df.loc[df.location.str.contains(re_states)]
Out[11]:
location
1 Burlington, VT
3 Saint Paul - Minneapolis, MN
Now just spool it to CSV:
df.loc[df.location.str.contains(re_states), 'location'].to_csv('d:/temp/filtered.csv', index=False)
filtered.csv:
"Burlington, VT"
"Saint Paul - Minneapolis, MN"
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.