Extract string from column following a specific pattern - python

Please forgive my panda newbie question, but I have a column of U.S. towns and states, such as the truncated version shown below (For some strange reason, the name of the column is called 'Alabama[edit]' which is associated with the first 0-7 town values in the column):
0 Auburn (Auburn University)[1]
1 Florence (University of North Alabama)
2 Jacksonville (Jacksonville State University)[2]
3 Livingston (University of West Alabama)[2]
4 Montevallo (University of Montevallo)[2]
5 Troy (Troy University)[2]
6 Tuscaloosa (University of Alabama, Stillman Co...
7 Tuskegee (Tuskegee University)[5]
8 Alaska[edit]
9 Fairbanks (University of Alaska Fairbanks)[2]
10 Arizona[edit]
11 Flagstaff (Northern Arizona University)[6]
12 Tempe (Arizona State University)
13 Tucson (University of Arizona)
14 Arkansas[edit]
15 Arkadelphia (Henderson State University, Ouach...
16 Conway (Central Baptist College, Hendrix Colle...
17 Fayetteville (University of Arkansas)[7]
18 Jonesboro (Arkansas State University)[8]
19 Magnolia (Southern Arkansas University)[2]
20 Monticello (University of Arkansas at Monticel...
21 Russellville (Arkansas Tech University)[2]
22 Searcy (Harding University)[5]
23 California[edit]
The towns that are in each state are below each state name, e.g. Fairbanks (column value 9) is a town in the state of Alaska.
What I want to do is to split up the town names based on the state names so that I have two columns 'State' and 'RegionName' where each state name is associated with each town name, like so:
RegionName State
0 Auburn (Auburn University)[1] Alabama
1 Florence (University of North Alabama) Alabama
2 Jacksonville (Jacksonville State University)[2] Alabama
3 Livingston (University of West Alabama)[2] Alabama
4 Montevallo (University of Montevallo)[2] Alabama
5 Troy (Troy University)[2] Alabama
6 Tuscaloosa (University of Alabama, Stillman Co... Alabama
7 Tuskegee (Tuskegee University)[5] Alabama
8 Fairbanks (University of Alaska Fairbanks)[2] Alaska
9 Flagstaff (Northern Arizona University)[6] Arizona
10 Tempe (Arizona State University) Arizona
11 Tucson (University of Arizona) Arizona
12 Arkadelphia (Henderson State University, Ouach... Arkansas
. . .etc.
I know that each state name is followed by a string '[edit]', which I assume I can use to do the split and assignment of the town names. But I don't know how to do this.
Also, I know that there's a lot of other data cleaning I need to do, such as removing the strings within parentheses and within the brackets '[]'. That can be done later...the important part is splitting up the states and towns and assigning each town to its proper U.S. Any advice would be most appreciated.

Without much context or access to your data, I'd suggest something along these lines. First, modify the code that reads your data:
df = pd.read_csv(..., header=None, names=['RegionName'])
# add header=False so as to read the first row as data
Now, extract the state name using str.extract, this should only extract names as long as they are succeeded by the substring "[edit]". You can then forward fill all NaN values using ffill.
df['State'] = df['RegionName'].str.extract(
r'(?P<State>.*)(?=\s*\[edit\])'
).ffill()

Related

Swap df1 column with df2 column, based on value

Goal: swap out df_hsa.stateabbr with df_state.state, based on 'df_state.abbr`.
Is there such a function, where I mention source, destination, and based-on dataframe columns?
Do I need to order both DataFrames similarly?
df_hsa:
hsa stateabbr county
0 259 AL Butler
1 177 AL Calhoun
2 177 AL Cleburne
3 172 AL Chambers
4 172 AL Randolph
df_state:
abbr state
0 AL Alabama
1 AK Alaska
2 AZ Arizona
3 AR Arkansas
4 CA California
Desired Output:
df_hsa with state column instead of stateabbr.
hsa state county
0 259 Alabama Butler
1 177 Alabama Calhoun
2 177 Alabama Cleburne
3 172 Alabama Chambers
4 172 Alabama Randolph
you can simply join after setting the index to be "stateabbr"
df_hsa.set_index("stateabbr").join(df_state.set_index("abbr"))
output:
hsa county state
AL 259 Butler Alabama
AL 177 Calhoun Alabama
AL 177 Cleburne Alabama
AL 172 Chambers Alabama
AL 172 Randolph Alabama
if you also want the original index your can add .set_index(df_hsa.index) at the end of the line

DataFrame from Dictionary with variable length keys

So for this assignment I managed to create a dictionary, where the keys are State names (eg: Alabama, Alaska, Arizona), and the values are lists of regions for each state. The problem is that the lists of regions are of different lengths - so each state can have a different number of regions associated.
Example : 'Alabama': ['Auburn',
'Florence',
'Jacksonville',
'Livingston',
'Montevallo',
'Troy',
'Tuscaloosa',
'Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
How can I unload this into a pandas Dataframe? What I want is basically 2 columns - "State", "Region". Something similar to what you would get if you would do a "GroupBy" on state for the regions.
If you work on pandas 0.25+, you can use explode:
pd.Series(states).explode()
Output:
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alabama Livingston
Alabama Montevallo
Alabama Troy
Alabama Tuscaloosa
Alabama Tuskegee
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
Arizona Tucson
dtype: object
You can also use concat which works for most pandas version:
pd.concat(pd.DataFrame({'state':k, 'Region':v}) for k,v in states.items())
Output:
state Region
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
0 Alaska Fairbanks
0 Arizona Flagstaff
1 Arizona Tempe
2 Arizona Tucson
You can also do this by dividing the dictionary into lists. Although that will be a little longer approach. For Example:
Example = {'Alabama': ['Auburn','Florence','Jacksonville','Livingston','Montevallo','Troy','Tuscaloosa','Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
new_list_of_keys = []
new_list_of_values = []
keys = list(Example.keys())
values = list(Example.values())
for i in range(len(keys)):
for j in range(len(values[i])):
new_list_of_values.append(values[i][j])
new_list_of_keys.append(keys[i])
df = pd.DataFrame(zip(new_list_of_keys, new_list_of_values), columns = ['State', 'Region'])
This will give output as:
State Region
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson

Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame

I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!
I came up with this regex (txt is your text from the question):
import re
import numpy as np
import pandas as pd
data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
Prints:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
EDIT: Now should work even for your big document from the regex101

Generating an Attribute Error when appending to a python dict [duplicate]

Please forgive my panda newbie question, but I have a column of U.S. towns and states, such as the truncated version shown below (For some strange reason, the name of the column is called 'Alabama[edit]' which is associated with the first 0-7 town values in the column):
0 Auburn (Auburn University)[1]
1 Florence (University of North Alabama)
2 Jacksonville (Jacksonville State University)[2]
3 Livingston (University of West Alabama)[2]
4 Montevallo (University of Montevallo)[2]
5 Troy (Troy University)[2]
6 Tuscaloosa (University of Alabama, Stillman Co...
7 Tuskegee (Tuskegee University)[5]
8 Alaska[edit]
9 Fairbanks (University of Alaska Fairbanks)[2]
10 Arizona[edit]
11 Flagstaff (Northern Arizona University)[6]
12 Tempe (Arizona State University)
13 Tucson (University of Arizona)
14 Arkansas[edit]
15 Arkadelphia (Henderson State University, Ouach...
16 Conway (Central Baptist College, Hendrix Colle...
17 Fayetteville (University of Arkansas)[7]
18 Jonesboro (Arkansas State University)[8]
19 Magnolia (Southern Arkansas University)[2]
20 Monticello (University of Arkansas at Monticel...
21 Russellville (Arkansas Tech University)[2]
22 Searcy (Harding University)[5]
23 California[edit]
The towns that are in each state are below each state name, e.g. Fairbanks (column value 9) is a town in the state of Alaska.
What I want to do is to split up the town names based on the state names so that I have two columns 'State' and 'RegionName' where each state name is associated with each town name, like so:
RegionName State
0 Auburn (Auburn University)[1] Alabama
1 Florence (University of North Alabama) Alabama
2 Jacksonville (Jacksonville State University)[2] Alabama
3 Livingston (University of West Alabama)[2] Alabama
4 Montevallo (University of Montevallo)[2] Alabama
5 Troy (Troy University)[2] Alabama
6 Tuscaloosa (University of Alabama, Stillman Co... Alabama
7 Tuskegee (Tuskegee University)[5] Alabama
8 Fairbanks (University of Alaska Fairbanks)[2] Alaska
9 Flagstaff (Northern Arizona University)[6] Arizona
10 Tempe (Arizona State University) Arizona
11 Tucson (University of Arizona) Arizona
12 Arkadelphia (Henderson State University, Ouach... Arkansas
. . .etc.
I know that each state name is followed by a string '[edit]', which I assume I can use to do the split and assignment of the town names. But I don't know how to do this.
Also, I know that there's a lot of other data cleaning I need to do, such as removing the strings within parentheses and within the brackets '[]'. That can be done later...the important part is splitting up the states and towns and assigning each town to its proper U.S. Any advice would be most appreciated.
Without much context or access to your data, I'd suggest something along these lines. First, modify the code that reads your data:
df = pd.read_csv(..., header=None, names=['RegionName'])
# add header=False so as to read the first row as data
Now, extract the state name using str.extract, this should only extract names as long as they are succeeded by the substring "[edit]". You can then forward fill all NaN values using ffill.
df['State'] = df['RegionName'].str.extract(
r'(?P<State>.*)(?=\s*\[edit\])'
).ffill()

Itering through a list if identical elements

I have the following function, which returns the pandas series of States - Associated Counties
def answer():
census_df.set_index(['STNAME', 'CTYNAME'])
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
print(name, state, cname)
Alabama 1 Tallapoosa County
Alabama 1 Tuscaloosa County
Alabama 1 Walker County
Alabama 1 Washington County
Alabama 1 Wilcox County
Alabama 1 Winston County
Alaska 2 Alaska
Alaska 2 Aleutians East Borough
Alaska 2 Aleutians West Census Area
Alaska 2 Anchorage Municipality
Alaska 2 Bethel Census Area
Alaska 2 Bristol Bay Borough
Alaska 2 Denali Borough
Alaska 2 Dillingham Census Area
Alaska 2 Fairbanks North Star Borough
I would like to know the state with the most counties in it. I can iterate through each state like this:
counter = 0
counter2 = 0
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
if state == 1:
counter += 1
print(counter)
if state == 1:
counter2 += 1
print(counter2)
and so on. I can range the number of states (rng = range(1, 56)) and iterate through it, but creating 56 lists is a nightmare. Is there an easier way if doing so?
Pandas allows us to do such operations without loops/iterating:
In [21]: df.STNAME.value_counts()
Out[21]:
Alaska 9
Alabama 6
Name: STNAME, dtype: int64
In [24]: df.STNAME.value_counts().head(1)
Out[24]:
Alaska 9
Name: STNAME, dtype: int64
or
In [18]: df.groupby('STNAME')['CTYNAME'].count()
Out[18]:
STNAME
Alabama 6
Alaska 9
Name: CTYNAME, dtype: int64
In [19]: df.groupby('STNAME')['CTYNAME'].count().idxmax()
Out[19]: 'Alaska'

Categories

Resources