how to spot delete spaces in pandas column - python

i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?

I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :

Related

How to add large number of column names with spaces to statsmodels.api.GLM.from_formula

I want to add a large number of column names to the below code in place of "Protocol"
model = sm.GLM.from_formula("Output~Protocol", family = sm.families.Binomial(), data=df2)
Some of the column names have spaces in them, which is also causing error. I want to find something like adding a dot after ~ like in R.

Quickest way to find partial string match between two pandas dataframes

I have two location-based pandas DataFrames.
df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.
df1.address[0] --> "Avon Road, Ealing, London, UK"
df2: Which just has cities of UK, such as "London".
df2.city[5] --> "London"
I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.
df1.city[0] --> "London"
Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.
I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:
df1['address'].str.contains("London",na=False).index.values
Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.
I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.
Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)
Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster.
Here is an example code:
cityIndex = set(df2.city)
addressLocations = []
for address in df1.address:
location = None
# Warning: ignore characters like '-' in the cities
for word in re.findall(r'[a-zA-Z0-9]+', address):
if word in cityIndex:
location = word
break
addressLocations.append(location)
df1['city'] = addressLocations

Regular expression to rename the column by stripping the column name

I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:
df:
Q36r9: sales platforms - Before purchasing a new car Q36r32: Advertising letters - Before purchasing a new car
Not Selected Selected
So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN
so desired output would be like this:
sales platforms Advertising letters
NaN Advertising letters
Edited: Another Problem if i have column name like:
Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
If i just want to get something in between ":" and "-". It should extract "WeChat"
IIUC,
we can take advantage of some regex and greed matching using .* which matches everything between a defined pattern
import re
df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
print(df.columns)
sales platforms Advertising letters
0 Not Selected None
Edit:
with greedy matching we can use +?
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Q36r9: sales platforms - Before purchasing a new car Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
0 1
import re
[re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
['sales platforms', 'WeChat']

Removing white space at the beginning of values in multiple columns

I found a solution to this:
df['Name']=df['Name'].str.lstrip
df['Parent']=df['Name'].str.lstrip
I have this DataFrame df (there is a white space at the left of "A" and "C" in the second row (which doesn't show well here). I would like to remove that space.
Mark Name Parent age
10 A C 1
12 A C 2
13 B D 3
I tried
df['Name'].str.lstrip()
df['Parent'].str.lstrip()
then tried
df.to_excel('test.xlsx')
but the result in excel didn't remove the white spaces
I then tried defining another variable
x=df['Name'].str.lstrip
x.to_excel('test.xlsx')
that worked fine in Excel, but this is a new dataFrame, and only had the x column
I then tried repeating the same for 'Parent', and to play around with joining multiple dataframes to the original dataframe, but I still couldnt' get it to work, and that seems too convoluted anyway
Finally, even if my first attempt had worked, I would like to be able to replace the white spaces in one go, without having to do a separate call for each column name
You could try using
df['Name'].replace(" ", "")
this would delete all whitespaces though.

pandas DataFrame conditional string split

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:
(A/Egypt/84/2001(H1N2))
A/Brazil/1759/2004(H3N2)
A/Argentina/126/2004
I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:
df['Strain Name'] = df['Original Name'].str.split("(")
However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.
Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?
So, based on EdChum's recommendation, I'll post my answer here.
Minimal data frame required for tackling this problem:
Index Strain Name Year
0 (A/Egypt/84/2001(H1N2)) 2001
1 A/Brazil/1759/2004(H3N2) 2004
2 A/Argentina/126/2004 2004
Code for getting the strain names only, without parentheses or anything else inside the parentheses:
df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))
This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

Categories

Resources