Problem
How to replace X with _, given the following dataframe:
data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
The streets need to be edited, replacing each X with an underscore _.
Notice that the number of Integers changes, as does the number of Xs. Also, street names such as Xerxes should not be edited to _er_es, but rather left unedited. Only the street number section should change.
Desired Output
data = {'street':['13__ First St', '2___ First St', '47_ Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
Progress
Some potential regex building blocks include:
1. [0-9]+ to capture numbers
2. X+ to capture Xs
3. ([0-9]+)(X+) to capture groups
df['street']replace("[0-9]+)(X+)", value=r"\2", regex=True, inplace=False)
I'm pretty weak with regex, so my approach may not be the best. Preemptive thank you for any guidance or solutions!
IIUC, this would do:
def repl(m):
return m.group(1) + '_'*len(m.group(2))
df['street'].str.replace("^([0-9]+)(X*)", repl)
Output:
0 13__ First St
1 2___ First St
2 47_ Second Ave
Name: street, dtype: object
IIUC, we can pass a function into the repl argument much like re.sub
def repl(m):
return '_' * len(m.group())
df['street'].str.replace(r'([X])+',repl)
out:
0 13__ First St
1 2___ First St
2 47_ Second Ave
Name: street, dtype: object
if you need to match only after numbers, we can add a '\d{1}' which will only match after a single instance of X
df['street'].str.replace(r'\d{1}([X]+)+',repl)
Assuming 'X' only occurs in the 'street' column
streetresult=re.sub('X','_',str(df['street']))
Your desired output should be the result
Code I tested
import pandas as pd
import re
data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
for i in data:
streetresult=re.sub('X','_',str(df['street']))
print(streetresult)
Related
This should be easy, but I'm stumped.
I have a df that includes a column of PLACENAMES. Some of these have multiple word names:
Able County
Baker County
Charlie County
St. Louis County
All I want to do is to create a new column in my df that has just the name, without the "county" word:
Able
Baker
Charlie
St. Louis
I've tried a variety of things:
1. places['name_split'] = places['PLACENAME'].str.split()
2. places['name_split'] = places['PLACENAME'].str.split()[:-1]
3. places['name_split'] = places['PLACENAME'].str.rsplit(' ',1)[0]
4. places = places.assign(name_split = lambda x: ' '.join(x['PLACENAME].str.split()[:-1]))
Works - splits the names into a list ['St.','Louis','County']
The list splice is ignored, resulting in the same list ['St.','Louis','County'] rather than ['St.','Louis']
Raises a ValueError: Length of values (2) does not match length of index (41414)
Raises a TypeError: sequence item 0: expected str instance, list found
I've also defined a function and called it with .assign():
def processField(namelist):
words = namelist[:-1]
name = ' '.join(words)
return name
places = places.assign(name_split = lambda x: processField(x['PLACENAME]))
This also raises a TypeError: sequence item 0: expected str instance, list found
This seems to be a very simple goal and I've probably overthought it, but I'm just stumped. Suggestions about what I should be doing would be deeply appreciated.
Apply Series.str.rpartition function:
places['name_split'] = places['PLACENAME'].str.rpartition()[0]
Use str.replace to remove the last word and the preceding spaces:
places['new'] = place['PLACENAME'].str.replace(r'\s*\w+$', '', regex=True)
# or
places['new'] = place['PLACENAME'].str.replace(r'\s*\S+$', '', regex=True)
# or, only match 'County'
places['new'] = place['PLACENAME'].str.replace(r'\s*County$', '', regex=True)
Output:
PLACENAME new
0 Able County Able
1 Baker County Baker
2 Charlie County Charlie
3 St. Louis County St. Louis
regex demo
Currently each column that I want to delimit stores address information (street,city,zipcode) and I want to separate each section in another column. Ex. Streets, city, zip codes will all have their own columns. However, I need to repeat this process for all columns stored in my list address_columns
What the data looks like:
address_columns = ['QUESTION_47', 'QUESTION_56', 'QUESTION_65', 'QUESTION_83', 'QUESTION_92']
(How each column looks like using Fake addresses)
QUESTION 47
64 Fordham St, Toms River, NJ 08753
7352 Poor House St. Hartford, CT 06106
8591 Peninsula Lane, Copperas Cove, TX 76522
Rough idea of how to implement my problem:\
Step One.
all strings before the first comma go in the first column
all strings before the second comma go in the second column
all strings before third comma go in the third column
etc..
Step Two.
identify text after the last comma and put it in the last additional
comma
Step Three.
Repeat for all columns in the list
You can use the pandas.Series.str.split method with the expand=True argument. For example:
import pandas as pd
df = pd.DataFrame({'q47': ['11 4 st,seattle,22222', '11 9st,chicago,23456']})
dfnew = df['q47'].str.split(',', expand=True).rename(columns={0:'street', 1:'city', 2:'zip'})
lets say this is your data:
df = pd.DataFrame({'QUESTION_47': {0: '64 Fordham St, Toms River, NJ 08753',
1: '7352 Poor House St, Hartford, CT 06106',
2: '8591 Peninsula Lane, Copperas Cove, TX 76522'},
'QUESTION_56': {0: '1234 Some House St, City, CT 11111',
1: '234 Some Other St, City, AB 23456',
2: '90 Yet Other St, City, AB 12345'}})
you can expand each column into 3, with suffix of question, and then stack them horizontally:
# holder dataframe
df_all = pd.DataFrame()
# loop over columns in dataframe
for c in df.columns:
df_ = pd.DataFrame()
ext= c[8:] #extract question number from column
df_[['street'+ext, 'city'+ext, 'zipcode'+ext]] = df[c].str.split(',', expand=True)
#concatenate new expansion to previously accumulated questyions
df_all = pd.concat([df_all, df_], axis=1)
output:
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
With the following dataframe, I'm trying to create a new guest_1 column that takes the first two words in each item of the guest column. At the bottom, you can see my desired output.
Is there some sort of "if doesn't exist, then..." logic I can apply here?
I've tried the following, but the obvious difficulty is accounting for a person with a single word for a name.
df.guest_1 = data.guest.str.split().str.get(0) + ' ' + data.guest.str.split().str.get(1)
df = pd.DataFrame(
{'date': ['2018-11-21','2018-02-26'],
'guest': ['Anthony Scaramucci & Michael Avenatti', 'Robyn'],
})
df.guest_1 = ['Anthony Scaramucci', 'Robyn']
You can split, slice, and join. This will gracefully handle out-of-bounds slices:
df.guest.str.split().str[:2].str.join(' ')
df['guest_1'] = df.guest.str.split().str[:2].str.join(' ')
df
date guest guest_1
0 2018-11-21 Anthony Scaramucci & Michael Avenatti Anthony Scaramucci
1 2018-02-26 Robyn Robyn
I am trying to figure out how to remove a word from a group of words in a column and insert that removed word into a new column. I figured out how to remove a part of a column and insert it into a new row, but I cannot figure out how to target a specific word (by placement I assume; "Mr." is always the 2nd word; or maybe by taking the word between the first "," and ".'s which is also always constant in my data set).
Name Age New_Name
Doe, Mr. John 23 Mr.
Anna, Mrs. Fox 33 Mrs.
EDITED the above to add another row
How would I remove the "Mr." from the name column and insert it into the "New_Name" column?
So far I have come up with:
data['New_name'] = data.Name.str[:2]
This doesn't allow me to specifically target "Mr." though.
I think I have to use a string.split, but the exact code is eluding me.
If the Mr. is always in the same position as indicated by your example, this can be accomplished with list interpolation:
df['New_Name'] = [x.split(' ')[1] for x in df['Name']]
and
d['Name'] = [' '.join(x.split(' ')[::2]) for x in d['Name']]
First, you have to get title from a name (it is between comma and dot) and stores it to another column. Then repeat this operation to remove title from column 'Name':
import pandas as pd
df = pd.DataFrame({'Name':['Doe, Mr. John', 'Anna, Ms. Fox'], 'Age':[23,33]})
df['New_Name'] = df['Name'].apply(lambda x: x[x.find(',')+len(','):x.rfind('.')]+'.')
df['Name'] = df['Name'].apply(lambda x: x.replace(x[x.find(',')+len(','):x.rfind('.')]+'.',''))
print df
Output:
Age Name New_Name
0 23 Doe, John Mr.
1 33 Anna, Fox Ms.
You can use pandas str.replace and str.extract methods
First extract title to form new column
df['New_Name'] = df['Name'].str.extract(',\s([A-Za-z]+.)')
Then use replace to replace extracted string with empty string
df['Name'] = df['Name'].str.replace('\s([A-Za-z]+.)\s', ' ')
You get:
Age Name New_Name
0 23 Doe, John Mr.
name = "Doe, Mr. John"
# if you always expect a title (Mr/Ms) between comma and dot
# split to lastname, title and firstname and strip spaces
newname = [ n.strip() for n in name.replace(".", ",").split(",") ]
print(newname)
#> ['Doe', 'Mr', 'John']
then you can print a title and a firstname-lastname column or other combination of them.