Quickest way to find partial string match between two pandas dataframes - python

I have two location-based pandas DataFrames.
df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.
df1.address[0] --> "Avon Road, Ealing, London, UK"
df2: Which just has cities of UK, such as "London".
df2.city[5] --> "London"
I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.
df1.city[0] --> "London"
Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.
I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:
df1['address'].str.contains("London",na=False).index.values
Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.
I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.
Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster.
Here is an example code:
cityIndex = set(df2.city)
addressLocations = []
for address in df1.address:
location = None
# Warning: ignore characters like '-' in the cities
for word in re.findall(r'[a-zA-Z0-9]+', address):
if word in cityIndex:
location = word
break
addressLocations.append(location)
df1['city'] = addressLocations

Related

how to spot delete spaces in pandas column

i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?
I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :

Search for variable name using iloc function in pandas dataframe

I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.
An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400
You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]
I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.

Extract part from an address in pandas dataframe column

I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Searching for an item within a list in a column and saving that item to a new column

I am very new to Python and need help!
I want to search a column of a data frame for an item in a list and if found, store that item in a new column. My the location column is messy and am trying to extract a state abbreviation if there is one.
So far I have been able to find the columns where the search terms are found (I’m not sure if this is 100% correct), how would I take the search term that was found and store it in a new column?
state_search=('CO', 'CA', 'WI', 'VA', 'NY', 'PA', 'MA', 'TX',)
pattern = '|'.join(state_search)
state_jobs_df=jobs_data_df.loc[jobs_data_df['location'].str.contains(pattern), :]
I want to take the state that was found and store that in a new 'state' column. Thanks for any help.
print (jobs_data_df)
location
0 Madison, WI 53702
1 Senior Training Leader located in Raynham, MA ...
2 Dixon CA
3 Camphill, PA Weekends and nights
4 Charlottesville, VA Some travel required
5 Houston, TX
6 Denver, CO 80215
7 Respiratory Therapy Primary Location : TX- Som...
Use Series.str.extract with word boundaries and filter non missing rows by Series.notna or DataFrame.dropna:
pat = '|'.join(r"\b{}\b".format(x) for x in state_search)
jobs_data_df['state'] = jobs_data_df['location'].str.extract('('+ pat + ')', expand=False)
jobs_data_df = jobs_data_df[jobs_data_df['state'].notna()]
Or:
jobs_data_df = jobs_data_df.dropna(subset=['state'])
It's a bit hack-y, but a simpler solution might take a form similar to:
for row in dataRows:
for state in state_search:
if state in row:
#put state in correct column here
break #should break just the inner loop; if that doesn't happen, delete this line
It's probably helpful to think about how the underlying program would have to approach the problem (checking each row for a string that matches one of your states, then doing something with it), and go at that directly. Unless you're dealing with a huge load of data, it may not be worth going crazy fancy with regular expressions or the like.

Append column to dataframe containing count of duplicates on another row

New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.

Categories

Resources