Extract part from an address in pandas dataframe column

Extract part from an address in pandas dataframe column - python

I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.

Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Related

how to spot delete spaces in pandas column

i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?

I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :

Apply function by extracting values from another dataframe

I am new to pandas dataframe. I would like to apply a function on an old dataframe (df1) by extracting values from another dataframe (df2).
DF2 looks like this (the actual one has ~500 rows)
Judge old_court_name new_court_name
John eighth circuit first circuit
Ruth us court claims. fifth circuit
Ben district connecticut district ohio
Then I've written a function
def addJudgeCourt(df1, Judge, old_court_name, new_court_name):
How do I tell pandas to extract the last three items by iterating from the dataframe2? Thanks!

Quickest way to find partial string match between two pandas dataframes

I have two location-based pandas DataFrames.
df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.
df1.address[0] --> "Avon Road, Ealing, London, UK"
df2: Which just has cities of UK, such as "London".
df2.city[5] --> "London"
I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.
df1.city[0] --> "London"
Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.
I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:
df1['address'].str.contains("London",na=False).index.values
Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.
I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.
Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster.
Here is an example code:
cityIndex = set(df2.city)
addressLocations = []
for address in df1.address:
location = None
# Warning: ignore characters like '-' in the cities
for word in re.findall(r'[a-zA-Z0-9]+', address):
if word in cityIndex:
location = word
break
addressLocations.append(location)
df1['city'] = addressLocations

Searching for an item within a list in a column and saving that item to a new column

I am very new to Python and need help!
I want to search a column of a data frame for an item in a list and if found, store that item in a new column. My the location column is messy and am trying to extract a state abbreviation if there is one.
So far I have been able to find the columns where the search terms are found (I’m not sure if this is 100% correct), how would I take the search term that was found and store it in a new column?
state_search=('CO', 'CA', 'WI', 'VA', 'NY', 'PA', 'MA', 'TX',)
pattern = '|'.join(state_search)
state_jobs_df=jobs_data_df.loc[jobs_data_df['location'].str.contains(pattern), :]
I want to take the state that was found and store that in a new 'state' column. Thanks for any help.
print (jobs_data_df)
location
0 Madison, WI 53702
1 Senior Training Leader located in Raynham, MA ...
2 Dixon CA
3 Camphill, PA Weekends and nights
4 Charlottesville, VA Some travel required
5 Houston, TX
6 Denver, CO 80215
7 Respiratory Therapy Primary Location : TX- Som...

Use Series.str.extract with word boundaries and filter non missing rows by Series.notna or DataFrame.dropna:
pat = '|'.join(r"\b{}\b".format(x) for x in state_search)
jobs_data_df['state'] = jobs_data_df['location'].str.extract('('+ pat + ')', expand=False)
jobs_data_df = jobs_data_df[jobs_data_df['state'].notna()]
Or:
jobs_data_df = jobs_data_df.dropna(subset=['state'])

It's a bit hack-y, but a simpler solution might take a form similar to:
for row in dataRows:
for state in state_search:
if state in row:
#put state in correct column here
break #should break just the inner loop; if that doesn't happen, delete this line
It's probably helpful to think about how the underlying program would have to approach the problem (checking each row for a string that matches one of your states, then doing something with it), and go at that directly. Unless you're dealing with a huge load of data, it may not be worth going crazy fancy with regular expressions or the like.

Append column to dataframe containing count of duplicates on another row

New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract part from an address in pandas dataframe column - python

Use Series.str.split with selecting by indexing: all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Related

how to spot delete spaces in pandas column

Apply function by extracting values from another dataframe

Quickest way to find partial string match between two pandas dataframes

Searching for an item within a list in a column and saving that item to a new column

Append column to dataframe containing count of duplicates on another row

Categories

Resources