I have a list of strings and I would like to replace the " at the last two strings
"racist superman"|"rudy"|"mancuso"|"king"|"bach"|"racist"|"superman"|"love"|"rudy mancuso poo bear black white official music video"|"iphone x by pineapple"|"lelepons"|"hannahstocking"|"rudymancuso"|"inanna"|"anwar"|"sarkis"|"shots"|"shotsstudios"|"alesso"|"anitta"|"brazil"|"Getting My Driver's License | Lele Pons"
My code looks like this, it does however replace the "" from the other strings and removes the "|".
Note: the input tags_str for the function is received by a file
def extract_tags(tags_str):
b = [n.strip('""').strip().replace('""', '') for n in tags_str.split("|")]
return b
['racist superman', 'rudy', 'mancuso', 'king', 'bach', 'racist', 'superman', 'love', 'rudy mancuso poo bear black white official music video', 'iphone x by pineapple', 'lelepons', 'hannahstocking', 'rudymancuso', 'inanna', 'anwar', 'sarkis', 'shots', 'shotsstudios', 'alesso', 'anitta', 'brazil', "Getting My Driver's License", 'Lele Pons']
As you can see the first strip gets rid of the "" and the second strip() gets rid of whitespaces. However "Getting My Driver's License"
still has double quotes and with the replace('""', '') I expect the double quotes to be replaced, but that's not the case.
The preferred output is:
['racist superman', 'rudy', 'mancuso', 'king', 'bach', 'racist', 'superman', 'love', 'rudy mancuso poo bear black white official music video', 'iphone x by pineapple', 'lelepons', 'hannahstocking', 'rudymancuso', 'inanna', 'anwar', 'sarkis', 'shots', 'shotsstudios', 'alesso', 'anitta', 'brazil', 'Getting My Driver's License', 'Lele Pons']
Edit:
Thanks for answers/comments it got fixed by b = [n.strip('""').strip().replace("'", '') for n in tags_str.split("|")]
since it was a single quote instead of double.
Instead of using the single quote (apostrophe ( ' )) you could use the " ´ " instead.
This would avoid the whole issue.
Related
I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1
I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+ pattern is similar but includes whitespace.
The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.
I'm currently trying to scrape a website for some information but am running into some issues.
I currently have a bs4.element.Tag element with some html and text in it, and when I do "variable.text", I get the following text:
\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t
What I want is to get rid of all the white space characters (\n and \t) to get the relevant information in a list or any iterable form.
I've tried a bunch of regex commands already, but the one that got me closest to my goal was: re.split('[\t\n]',variable.text), I got the following:
['',
'',
'Ulmstead Club',
'',
'',
'',
'',
'',
'911 Lynch Dr',
'',
'',
'',
'',
'',
'',
'',
'Arnold, Maryland',
'',
'',
'',
'',
I've cut off a lot of the output to save some space.
I'm super lost and any help would be greatly appreciated
Try splitting on [\t\n]+:
re.split('[\t\n]+', variable.text.strip())
This would seem to work as it would eliminate the empty string entries in the output array.
My guess is that, this simple expression might be also helpful,
(?:\\n|\\t)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:\\n|\\t)"
test_str = "\\n\\nUlmstead Club\\n\\t\\t\\t\\t\\t911 Lynch Dr\\n\\n\\t\\t\\t\\t\\t\\tArnold, Maryland\\t\\t\\t\\t\\t 21012\\n\\t\\t\\t\\t\\tUnited States\\n(410) 757-9836 \\n\\n Get directions\\n\\n Favorite court \\n\\n\\n\\nTennis Court Details\\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tLocation type:\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tClub\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tMatches played here:\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t0\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
You could use string.replace() function to get rid of the \n and \t, no really needing a regular expression to do so (I have replaced the \n and \t with 2 whitespaces for the next step):
variable.text = variable.text.replace("\n"," ")
variable.text = variable.text.replace("\t"," ")
if you want then to split your data into a list, you could split it through whitespaces, and use remove() to delete any extra empty strings in the list (note that I am not 100% sure of how you want your data separated, I have just made the solution that fitted my logic of how it should be split) :
result = re.split("[\s]\s+",variable.text)
while ('' in result):
result.remove('')
Here is the full code example:
import re
teststring ="\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t"
teststring = teststring.replace("\n"," ")
teststring = teststring.replace("\t"," ")
#split any fields with more than 1 whitespace between them
result = re.split("[\s]\s+",teststring)
#remove any empty string fields of the list
while ('' in result):
result.remove('')
print(result)
Result is:
['Ulmstead Club', '911 Lynch Dr', 'Arnold, Maryland', '21012', 'United States', '(410) 757-9836', 'Get directions', 'Favorite court', 'Tennis Court Details', 'Location type:', 'Club', 'Matches played here:', '0']
I would run 2 regex on the string starting with 1 then 2
Find \s*(?:\r?\n)\s*
Replace \n
https://regex101.com/r/EGTyKB/1
Find [ ]*\t+[ ]*
Replace \t
https://regex101.com/r/XIyi44/1
This clears out all the whitespace cruft and turns it into
a readable block of text.
Ulmstead Club
911 Lynch Dr
Arnold, Maryland 21012
United States
(410) 757-9836
Get directions
Favorite court
Tennis Court Details
Location type:
Club
Matches played here:
0
So I am attempting to use python to write text to a Microsoft word document. The code works perfectly except for when it runs up against a non-ascii character. When that happens, I am greeted by the following error.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I'd attempted to solve this issue by using regular expressions to pluck out and replace non-ascii characters. re.sub(pattern, repl, string, count=0, flags=0) seemed like it it would work. Here is the code that I threw together:
match1 = re.search(ur'ʼ', bodyHTML)
match2 = re.search(ur'ï¬', bodyHTML)
match3 = re.search(ur'fl', bodyHTML)
if match1:
print 'Match 1'
bodyHTML = re.sub(ur'ʼ', "'", bodyHTML)
if match3:
print 'Match 3'
bodyHTML = re.sub(ur'fl', 'fl', bodyHTML)
if match2:
print 'Match 2'
bodyHTML = re.sub(ur'ï¬', 'fi', bodyHTML)
"match1" works perfectly. Whenever there is an ʼ in the text, it is replaced by an apostrophe.
"match2" and "match3" are a different story. Here's an example:
After a short hike we had our ï¬rst glimpse of the museum
Naturally, this triggers a response from match2. But instead of producing
After a short hike we had our first glimpse of the museum
It spits out
After a short hike we had our fiürst glimpse of the museum
This happens several times. "signiï¬cant" becomes "signifiücant" and so on.
I am not sure why this is happening.
I am also running into an issue where match2 is steamrolling any match from match3. In other words
the ripples on the pond and so did the shimmering reflections in the glass walls
becomes
the ripples on the pond and so did the shimmering refiéected in the glass walls
instead of
the ripples on the pond and so did the shimmering reflected in the glass walls
I'm not really sure why match2 is dominating, especially because I put the match3 if statement before the one for match2 specifically so it would remove all of sections with "fl" and leave only the "ï¬" snippets for match2 to mop up.
As far as the other non-ascii characters popping up after running the code...I have no idea.
Any help is much appreciated.
Thank you
For specific 'translation' I use chr like this:
def process_special_characters(result):
'''sub out known yuck for legit stuff (smart quotes, etc.), keep track of how many changes'''
total = 0
result = re.subn(chr(133), chr(95), result) #strange underbar
total += result[1]
result = re.subn(chr(145), chr(39), result[0]) #smart quote
total += result[1]
result = re.subn(chr(146), chr(39), result[0]) #other smart quote
total += result[1]
result = re.subn(chr(150), chr(45), result[0]) #strange hyphen
total += result[1]
result = re.subn(chr(8212), chr(45), result[0]) #strange long hyphen
total += result[1]
result = re.subn(chr(160), chr(32), result[0]) #non breaking space
total += result[1]
return result[0], total
You could easily make a dict and just have the key be your "from" and the value be the "to" and loop it. If it's a short list this works fine. Note, using subn allows you to keep track of the number of changes. You can always just toss that all that logic and just go with sub, just that in my case the business side wanted a count of changes.
You could also have lines like this if it is easier to read:
result = re.subn(chr(133), "_", result) #strange underbar
Good luck! Just toss a comment in if you have more questions and I'll update.
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.