I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]
Related
i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?
I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :
I am new to pandas dataframe. I would like to apply a function on an old dataframe (df1) by extracting values from another dataframe (df2).
DF2 looks like this (the actual one has ~500 rows)
Judge old_court_name new_court_name
John eighth circuit first circuit
Ruth us court claims. fifth circuit
Ben district connecticut district ohio
Then I've written a function
def addJudgeCourt(df1, Judge, old_court_name, new_court_name):
How do I tell pandas to extract the last three items by iterating from the dataframe2? Thanks!
I have two location-based pandas DataFrames.
df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.
df1.address[0] --> "Avon Road, Ealing, London, UK"
df2: Which just has cities of UK, such as "London".
df2.city[5] --> "London"
I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.
df1.city[0] --> "London"
Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.
I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:
df1['address'].str.contains("London",na=False).index.values
Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.
I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.
Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)
Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster.
Here is an example code:
cityIndex = set(df2.city)
addressLocations = []
for address in df1.address:
location = None
# Warning: ignore characters like '-' in the cities
for word in re.findall(r'[a-zA-Z0-9]+', address):
if word in cityIndex:
location = word
break
addressLocations.append(location)
df1['city'] = addressLocations
I am very new to Python and need help!
I want to search a column of a data frame for an item in a list and if found, store that item in a new column. My the location column is messy and am trying to extract a state abbreviation if there is one.
So far I have been able to find the columns where the search terms are found (I’m not sure if this is 100% correct), how would I take the search term that was found and store it in a new column?
state_search=('CO', 'CA', 'WI', 'VA', 'NY', 'PA', 'MA', 'TX',)
pattern = '|'.join(state_search)
state_jobs_df=jobs_data_df.loc[jobs_data_df['location'].str.contains(pattern), :]
I want to take the state that was found and store that in a new 'state' column. Thanks for any help.
print (jobs_data_df)
location
0 Madison, WI 53702
1 Senior Training Leader located in Raynham, MA ...
2 Dixon CA
3 Camphill, PA Weekends and nights
4 Charlottesville, VA Some travel required
5 Houston, TX
6 Denver, CO 80215
7 Respiratory Therapy Primary Location : TX- Som...
Use Series.str.extract with word boundaries and filter non missing rows by Series.notna or DataFrame.dropna:
pat = '|'.join(r"\b{}\b".format(x) for x in state_search)
jobs_data_df['state'] = jobs_data_df['location'].str.extract('('+ pat + ')', expand=False)
jobs_data_df = jobs_data_df[jobs_data_df['state'].notna()]
Or:
jobs_data_df = jobs_data_df.dropna(subset=['state'])
It's a bit hack-y, but a simpler solution might take a form similar to:
for row in dataRows:
for state in state_search:
if state in row:
#put state in correct column here
break #should break just the inner loop; if that doesn't happen, delete this line
It's probably helpful to think about how the underlying program would have to approach the problem (checking each row for a string that matches one of your states, then doing something with it), and go at that directly. Unless you're dealing with a huge load of data, it may not be worth going crazy fancy with regular expressions or the like.
New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.