I'm trying to remove all commas that are inside quotes (") with python:
'please,remove all the commas between quotes,"like in here, here, here!"'
^ ^
I tried this, but it only removes the first comma inside the quotes:
re.sub(r'(".*?),(.*?")',r'\1\2','please,remove all the commas between quotes,"like in here, here, here!"')
Output:
'please,remove all the commas between quotes,"like in here here, here!"'
How can I make it remove all the commas inside the quotes?
Assuming you don't have unbalanced or escaped quotes, you can use this regex based on negative lookahead:
>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'
This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.
Note about the lookaead (?!...):
([^"]*"){2} finds a pair of quotes
(([^"]*"){2})* finds 0 or more pair of quotes
[^"]*$ makes sure we don't have any more quotes after last matched quote
So (?!...) asserts that we don't have even number of quotes ahead thus matching commas inside the quoted string only.
You can pass a function as the repl argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.
>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
Here is another option I came up with if you don't want to use regex.
input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
What about doing it with out regex?
input_str = '...'
first_slice = input_str.split('"')
second_slice = [first_slice[0]]
for slc in first_slice[1:]:
second_slice.extend(slc.split(','))
result = ''.join(second_slice)
The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.
This seems to be reasonably fast and provides the same result as the for loop:
#!/bin/python3
data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma\n "ka ku"; "ki; ko"\n "ko;ma"; "ki ma"\n"ehe;";koko'
first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
split01.append(slc)
for slc in first_split[1::2]:
slc_new=",".join(slc.split(";"))
split02.append(slc_new)
resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
resultlist.append(split01[-1])
if len(split01) < len (split02):
resultlist.append(split02[-1])
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)
Results in:
hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
"ka ku"; "ki; ko"
"ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko ma\n ', '; ', '\n ', '; ', '\n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
"ka ku"; "ki, ko"
"ko,ma"; "ki ma"
"ehe,";koko
Related
I have a dataset that has a "tags" column in which each row is a list of tags. For example, the first entry looks something like this
df['tags'][0]
result = "[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']"
I have been able to remove the trailing whitespace from all elements and only the leading whitespace from the first element (so I get something like the below).
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
Does anyone know how to remove the leading whitespace from all but the first element is these lists? They are not of uniform length or anything. Below is the code I have used to get the final result above:
clean_tags_list = []
for item in reviews['Tags']:
string = item.replace("[", "")
string2 = string.replace("'", "")
string3 = string2.replace("]", "")
string4 = string3.replace(",", "")
string5 = string4.strip()
string6 = string5.lstrip()
#clean_tags_list.append(string4.split(" "))
clean_tags_list.append(string6.split(" "))
clean_tags_list[0]
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
IIUC you want to apply strip for the first element and right strip for the other ones. Then, first convert your 'string list' to an actual list with ast.literal_eval and apply strip and rstrip:
from ast import literal_eval
df.tags.agg(literal_eval).apply(lambda x: [item.strip() if x.index(item) == 0 else item.rstrip() for item in x])
If I understand correctly, you can use the code below :
import pandas as pd
df = pd.DataFrame({'tags': [[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']]})
df['tags'] = df['tags'].apply(lambda x: [x[0].strip()] + [e.rstrip() for e in x[1:]])
>>> print(df)
I was also able to figure it out with the below code. (I know that this isn't very efficient but it worked).
will_clean_tag_list = []
for row in clean_tags_list:
for col in range(len(row)):
row[col] = row[col].strip()
will_clean_tag_list.append(row)
Thank you all for the insight! This has been my first post and I really appreciate the help.
The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!
This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.
I have a particularly long, nasty string that looks something like this:
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
and so on. The key defining feature is that each "nameOfString" is followed by a \n with two spaces after it. The first nameOfString has two spaces in front of it as well.
I'm trying to create a list that would look something like this:
niceList = [nameOfString1, Inc_(stuff), nameOfString2, Inc_(Stuff)] and so on.
I've tried to use newString = nastyString.split() as well as newString = nastyString.replace('\n ', ''), but ultimately, these solutions can't work because each nameOfString has a space after the comma and before the 'I' of Inc. Furthermore, not all the nameOfStrings have an 'Inc,' but most do have some sort of space in their name.
Would really appreciate some guidance or direction on how I could tackle this issue, thanks!
May be you can try something like this.
[word for word in nastyString.replace("\n", "").replace(",", "").strip().split(' ') if word !='']
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
# replace '\n' with ','
nastyString = nastyString.replace('\n', ',')
# split at ',' and `strip()` all extra spaces
niceList = [v.strip() for v in nastyString.split(',') if v.strip()]
output:
niceList
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
Update: OP shared new input:
That's awesome, never knew about the strip function. However, I actually am trying to including the "Inc" section, so I was hoping for output of: ['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)'] and so on, any advice?
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
niceList = [v.strip() for v in nastyString.split('\n') if v.strip()]
new output:
niceList
['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)']
You can use regular expressions:
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
new_string = [i for i in re.split("[\n\s,]", nastyString) if i]
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
if you don't like to replacing '\n' do this :
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
word =re.findall(r'.',nastyString)
s=""
for i in word:
s+=i
print s
output :'nameOfString1, Inc_(stuff) nameOfString2, Inc_(stuff) '
now you can use split()
print s.split(',')
Sorry if this post is a bit confusing to read this is my first post on this site and this is a hard question to ask, I have tried my best. I have also tried googling and i can not find anything.
I am trying to make my own command line like application in python and i would like to know how to split a string if a "\" is not in front of a space and to delete the backslash.
This is what i mean.
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I\\', 'Want\\', 'To\\', 'Break\\', 'Free']
When I split c with a space it keeps the backslash however it removes the space.
This is how I want it to be like:
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
If someone can help me that would be great!
Also if it needs Regular expressions could you please explain it more because I have never used them before.
Edit:
Now this has been solved i forgot to ask is there a way on how to detect if the backslash has been escaped or not too?
It looks like you're writing a commandline parser. If that's the case, may I recommend shlex.split? It properly splits a command string according to shell lexing rules, and handles escapes properly. Example:
>>> import shlex
>>> shlex.split('play song I\ Want\ To\ Break\ Free')
['play', 'song', 'I Want To Break Free']
Just split on the space, then replace any string ending with a backslash with with one ending in a space instead:
[s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
This is a list comprehension; c is split on spaces, and each resulting string is examined for a trailing \ backslash at the end; if so, the last character is removed and a space is added.
One slight disadvantage: if the original string ends with a backslash (no space), that last backslash is also replaced by a space.
Demo:
>>> c = r"play song I\ Want\ To\ Break\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
To handle escaped backslashes, you'd count the number of backslashes. An even number means the backslash is escaped:
[s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
for s in c.split(' ')]
Demo:
>>> c = r"play song I\ Want\ To\ Break\\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
... for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break\\\\', 'Free']
I have the following list:
[('Steve Buscemi', 'Mr. Pink'), ('Chris Penn', 'Nice Guy Eddie'), ...]
I need to convert it to a string in the following format:
"(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddit), ..."
I tried doing
str = ', '.join(item for item in items)
but run into the following error:
TypeError: sequence item 0: expected string, tuple found
How would I do the above formatting?
', '.join('(' + ', '.join(i) + ')' for i in L)
Output:
'(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddie)'
You're close.
str = '(' + '), ('.join(', '.join(names) for names in items) + ')'
Output:
'(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddie)'
Breaking it down: The outer parentheses are added separately, while the inner ones are generated by the first '), ('.join. The list of names inside the parentheses are created with a separate ', '.join.
s = ', '.join( '(%s)'%(', '.join(item)) for item in items )
You can simply use:
print str(items)[1:-1].replace("'", '') #Removes all apostrophes in the string
You want to omit the first and last characters which are the square brackets of your list. As mentioned in many comments, this leaves single quotes around the strings. You can remove them with a replace.
NB As noted by #ovgolovin this will remove all apostrophes, even those in the names.
you were close...
print ",".join(str(i) for i in items)
or
print str(items)[1:-1]
or
print ",".join(map(str,items))