Python, pandas replace entire column with regex match of string - python

I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.

you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.

Related

How to escape double inverted commas in json values from pandas dataframe for parsing?

so this is a long question but please hold on as I try my best to explain my problem:
I have a dataframe which has one column with rows as json, and I am able to parse them correctly
id | email | phone no | details
-------------------------------------------------
0 10 | abc#g.com | 123 | {"name" : "John "Smart" Wick", "address" : "123 c "dumb" road"}
1 12 | xyz#g.com | 789 | {"name" : "Peter Parker", "address" : "L "check" street"}
I want this json to be distributed to columns as:
id
email
phone no
name
address
10
abc#g.com
123
John "Smart" Wick
123 c "dumb" road
12
xyz#g.com
789
Peter Parker
L "check" street
To break the json keys into columns, I am able to do as:
# Check: 1
result = df.pop('details').apply(json.loads).apply(pd.Series).join(df)
This works always until when I come across a situation like that above where there are inverted commas as part of the json value in any field. This data is for representation purpose in reality I have millions of records and the column 'details' has 10+ key/value pairs.
For a hot fix, this is what I have done:
# check: 2
df['details'] = df['details'].str.replace('John "Smart" "Wick','John Smart Wick')
df['details'] = df['details'].str.replace('123 c "dumb" road','123 c dumb road')
df['details'] = df['details'].str.replace('L "check" street','L check street')
Then I run the code at #check: 1 and it works fine and replace it again this time the other way around. In a million records, there are just 2 such cases which leads to such a problem breaking the code so I found those 2 notorious records and for a hot-fix replaced the data to remove the inverted commas and later after processing re-introduced them.
What I want is to have a way such that no matter how many times such issues happen it doesn't create a problem and passes #check: 1 easily and return the original value without me having to catch such records manually and replacing it for the thing to run. I was wondering if regex can do this, and I tried few things but those were not good enough and kept throwing error.
I am able to solve the issue at my level but a universal way to handle all such exceptions in json key/value pair for a column in pandas dataframe will be a great thing to learn. I know the json is not clean here so basically a way to clean it for any such scenarios so that we can do the splitting of the key/value into individual columns.
Thanks for any help.
Edit: I have put this in comment also, if I add escape characters then it works fine like:
df['details'] = df['details'].str.replace('John "Smart" "Wick','John \\"Smart\\" Wick')
df['details'] = df['details'].str.replace('123 c "dumb" road','123 c \\"dumb\\" road')
df['details'] = df['details'].str.replace('L "check" street','L check \\"check\\" street')
This will work too but it still requires me to identify the records manually and add a replace command for those records with escape characters. Can this be done in a loop for the entire 'details' column to self-identify such cases and add escape characters wherever required?
Since there are only two fields in the stringified JSONs, you can use contextual matching with regex to make sure you match any text between the two names or end of the string.
Here is the regex you can use to match and capture the necessary bits:
(?s)("(?:name|address)"\s*:\s*")(.*?)(?="(?:\s*,\s*"(?:name|address)"|}$))
See the regex demo. The matches contain two adjacent groups where the first one needs to be kept as is, and all " chars in the second group should be prepended with a literal backslash.
Use Series.str.replace to perform this manipulation:
import pandas as pd
df = pd.DataFrame(
{'text':['{"name" : "John "Smart" Wick", "address" : "123 c "dumb" road"}']}
)
rx = r'(?s)("(?:name|address)"\s*:\s*")(.*?)(?="(?:\s*,\s*"(?:name|address)"|}$))'
df['text'] = df['text'].str.replace(rx, lambda x: x.group(1) + x.group(2).replace('"',r'\"'), regex=True)
# -> df
# text
# 0 {"name" : "John \"Smart\" Wick", "address" : "123 c \"dumb\" road"}

How to replace string if some characters are the same on pandas?

I have an Import/export trade data of the country. From initial data, some country names have a weird symbol: ��.
For this reason, I am struggling to replace those strings.
Currently, I am replacing country names to their 3 letter country code. For example, China = CHI, Russian Federation = RUS. My code works fine for most of the country names.
Except: C��ina, ��etnam, Turk��, T��rkey, Uzbekist��n, Uzb��kistan etc.
I can manually format it for the first time, however, this data is updating every month, and size is now almost 2 billion rows.
for i,j in all_3n.items():
df['Country'] = df['Country'].str.replace(j,i)
This is the code how I am replacing now. Furthermore, how to replace the whole string, not only the founded string?
For example, for lookup I have Russia and string in the database is Russian Federation, it is returning me RUSn Federation. any ideas on how to overcome these two challenges? Thanks
You should use the code '\uFFFD' for the replacement character �:
df['Country'] = df['Country'].str.replace('\uFFFD', '')

Is there an easy way to remove end of the string in rows of a dataframe?

I'm new into Python/pandas and I'm losing my hair with Regex. I would like to use str.replace() to modify strings into a dataframe.
I have a 'Names' column into dataframe df which looks like this:
Jeffrey[1]
Mike[3]
Philip(1)
Jeffrey[2]
etc...
I would like to remove in each single row of the column the end of the string which follows either the '[' or the '('...
I thought to use something like this below but I have hard time to understand regex, any tip with regard to a nice regex summary for beginner is welcome.
df['Names']=df['Names'].str.replace(r'REGEX??', '')
Thanks!
Extract only the alphabetic letters with Series.str.extract:
df['Names'] = df['Names'].str.extract('([A-Za-z]+)')
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey
This regex would work, with $ indicates the end of the string:
df['Names'] = df['Names'].str.extract('(.*)[\[|\(]\d+[\]\)]$')
You could use split to take everything before the first [ or ( characters.
df['Names'].str.split('\[|\(').str[0]
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey

Find and replace for millions of rows with regex in Python

I have 2 set of data.
First one which serves as dictionary has two columns keyword and id and 180000 rows. Below is some sample data.
Also, note that some of the keyword are as small as 2 character and as big as 700 characters and there is no fixed length of the keywords. Although id has fixed pattern of 3 digit number with a hash symbol before and after the number.
keyword id
salesman #123#
painter #486#
senior painter #215#
Second file has one column which is corpus and it runs into 22 million records and length of each record varies between 10 to 1000. Below is sample data which can be considered as input.
corpus
I am working as a salesman. salesmanship is not my forte, however i have become a good at it
I have been a painter since i was 19
are you the salesman?
Output
corpus
I am working as a #123#. salesmanship is not my forte, however i have become a good at it
I have been a #486# since i was 19
are you the #123#?
Please note that i want to replace complete word and not overlapped words. so in the first sentence salesman was replaced with #123# where as salesmanship was not replaced with #123#ship. This requires me to add regular expression '\b' before and after the keyword. This is why Regex is important for the search function
So this is a search and replace operation for multi-million rows and has regex. I have read
Mass string replace in python?
and
Speed up millions of regex replacements in Python 3, however it is taking me days to do this find and replace, which i can't afford as this is a weekly task. I want to be able to do this much faster. Below is my code
Id = df_dict.Id.tolist()
#convert to list with regex
keyword = [r'\b'+ x + r'\b' for x in df_dict.keyword]
#light on memory to clean file
del df_dict
#replace
df_corpus["corpus_text"].replace(keyword, Id, regex=False,inplace=True)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources