Append sections of string to list in Python - python

I have a particularly long, nasty string that looks something like this:
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
and so on. The key defining feature is that each "nameOfString" is followed by a \n with two spaces after it. The first nameOfString has two spaces in front of it as well.
I'm trying to create a list that would look something like this:
niceList = [nameOfString1, Inc_(stuff), nameOfString2, Inc_(Stuff)] and so on.
I've tried to use newString = nastyString.split() as well as newString = nastyString.replace('\n ', ''), but ultimately, these solutions can't work because each nameOfString has a space after the comma and before the 'I' of Inc. Furthermore, not all the nameOfStrings have an 'Inc,' but most do have some sort of space in their name.
Would really appreciate some guidance or direction on how I could tackle this issue, thanks!

May be you can try something like this.
[word for word in nastyString.replace("\n", "").replace(",", "").strip().split(' ') if word !='']
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']

nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
# replace '\n' with ','
nastyString = nastyString.replace('\n', ',')
# split at ',' and `strip()` all extra spaces
niceList = [v.strip() for v in nastyString.split(',') if v.strip()]
output:
niceList
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
Update: OP shared new input:
That's awesome, never knew about the strip function. However, I actually am trying to including the "Inc" section, so I was hoping for output of: ['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)'] and so on, any advice?
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
niceList = [v.strip() for v in nastyString.split('\n') if v.strip()]
new output:
niceList
['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)']

You can use regular expressions:
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
new_string = [i for i in re.split("[\n\s,]", nastyString) if i]
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']

if you don't like to replacing '\n' do this :
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
word =re.findall(r'.',nastyString)
s=""
for i in word:
s+=i
print s
output :'nameOfString1, Inc_(stuff) nameOfString2, Inc_(stuff) '
now you can use split()
print s.split(',')

Related

Each row in DataFrame column is a list. How to remove leading whitespace from second to end entries

I have a dataset that has a "tags" column in which each row is a list of tags. For example, the first entry looks something like this
df['tags'][0]
result = "[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']"
I have been able to remove the trailing whitespace from all elements and only the leading whitespace from the first element (so I get something like the below).
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
Does anyone know how to remove the leading whitespace from all but the first element is these lists? They are not of uniform length or anything. Below is the code I have used to get the final result above:
clean_tags_list = []
for item in reviews['Tags']:
string = item.replace("[", "")
string2 = string.replace("'", "")
string3 = string2.replace("]", "")
string4 = string3.replace(",", "")
string5 = string4.strip()
string6 = string5.lstrip()
#clean_tags_list.append(string4.split(" "))
clean_tags_list.append(string6.split(" "))
clean_tags_list[0]
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
IIUC you want to apply strip for the first element and right strip for the other ones. Then, first convert your 'string list' to an actual list with ast.literal_eval and apply strip and rstrip:
from ast import literal_eval
df.tags.agg(literal_eval).apply(lambda x: [item.strip() if x.index(item) == 0 else item.rstrip() for item in x])
If I understand correctly, you can use the code below :
import pandas as pd
df = pd.DataFrame({'tags': [[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']]})
df['tags'] = df['tags'].apply(lambda x: [x[0].strip()] + [e.rstrip() for e in x[1:]])
>>> print(df)
I was also able to figure it out with the below code. (I know that this isn't very efficient but it worked).
will_clean_tag_list = []
for row in clean_tags_list:
for col in range(len(row)):
row[col] = row[col].strip()
will_clean_tag_list.append(row)
Thank you all for the insight! This has been my first post and I really appreciate the help.

Remove white space after detokenizing a string with apostrophe

I want to remove the white space in words like can't or won't either through regex or when detokenizing
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
detok = MosesDetokenizer()
pattern= "[^\w ]+ "
text= "i can ' t use this cause they won ' t fit"
string= re.sub(pattern, '', text)
tk = tok.tokenize(string)
output= detok.detokenize(tk, return_str = True)
print(output)
"i can 't use this cause they won' t fit"
any ideas on how i can remove the white space after 'can' and 'won' so i can have can't and won't. When i use output = (' '.join(tk)).strip() to detokenize i get double white space, one before and after the apostrophe. Example i can ' t use this cause they won ' t fit
I think that you can simple do something like:
output = "i can 't use this cause they won' t fit"
output = output.replace(" '", "")
print output
"i can't use this cause they won't fit"
#BenT I can't say about the regex but yeah on your output you can apply the following operation:
output = "i can 't use this cause they won' t fit"
output = "'".join(output.split(" '"))
output = "'".join(output.split("' "))
print(output)
"i can't use this cause they won't fit"
One line solution is also there:
output = output.replace("' ", "'").replace(" '", "'")
print(output)
"i can't use this cause they won't fit"

How to split list elements to a line separated by space

I have a list in python as :
values = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("".join(values))
I want output should be as :-
Subjects: Maths English Hindi Science Physical_Edu Accounts
I am new to Python, I used join() method but unable to get expected output.
You could map the str.stripfunction to every element in the list and join them afterwards.
values = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("Subjects:", " ".join(map(str.strip, values)))
Using a regular expression approach:
import re
lst = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
rx = re.compile(r'.*')
print("Subjects: {}".format(" ".join(match.group(0) for item in lst for match in [rx.match(item)])))
# Subjects: Maths English Hindi Science Physical_Edu Accounts
But better use strip() (or even better: rstrip()) as provided in other answers like:
string = "Subjects: {}".format(" ".join(map(str.rstrip, lst)))
print(string)
strip() each element of the string and then join() with a space in between them.
a = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("Subjects: " +" ".join(map(lambda x:x.strip(), a)))
Output:
Subjects: Maths English Hindi Science Physical_Edu Accounts
As pointed out by #miindlek, you can also achieve the same thing, by using map(str.strip, a) in place of map(lambda x:x.strip(), a))
What you can do is use this example to strip the newlines and join them using:
joined_string = " ".join(stripped_array)

Tuple conversion to a string

I have the following list:
[('Steve Buscemi', 'Mr. Pink'), ('Chris Penn', 'Nice Guy Eddie'), ...]
I need to convert it to a string in the following format:
"(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddit), ..."
I tried doing
str = ', '.join(item for item in items)
but run into the following error:
TypeError: sequence item 0: expected string, tuple found
How would I do the above formatting?
', '.join('(' + ', '.join(i) + ')' for i in L)
Output:
'(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddie)'
You're close.
str = '(' + '), ('.join(', '.join(names) for names in items) + ')'
Output:
'(Steve Buscemi, Mr. Pink), (Chris Penn, Nice Guy Eddie)'
Breaking it down: The outer parentheses are added separately, while the inner ones are generated by the first '), ('.join. The list of names inside the parentheses are created with a separate ', '.join.
s = ', '.join( '(%s)'%(', '.join(item)) for item in items )
You can simply use:
print str(items)[1:-1].replace("'", '') #Removes all apostrophes in the string
You want to omit the first and last characters which are the square brackets of your list. As mentioned in many comments, this leaves single quotes around the strings. You can remove them with a replace.
NB As noted by #ovgolovin this will remove all apostrophes, even those in the names.
you were close...
print ",".join(str(i) for i in items)
or
print str(items)[1:-1]
or
print ",".join(map(str,items))

Remove extra spaces in middle of string split join Python

I have the following string which forces my Python script to quit:
"625 625 QUAIL DR UNIT B"
I need to delete the extra spaces in the middle of the string so I am trying to use the following split join script:
import arcgisscripting
import logging
logger = logging.getLogger()
gp = arcgisscripting.create(9.3)
gp.OverWriteOutput = True
gp.Workspace = "C:\ZP4"
fcs = gp.ListWorkspaces("*","Folder")
for fc in fcs:
print fc
rows = gp.UpdateCursor(fc + "//Parcels.shp")
row = rows.Next()
while row:
Name = row.GetValue('SIT_FULL_S').join(s.split())
print Name
row.SetValue('SIT_FULL_S', Name)
rows.updateRow(row)
row = rows.Next()
del row
del rows
Your source code and your error do not match, the error states you didn't define the variable SIT_FULL_S.
I am guessing that what you want is:
Name = ' '.join(row.GetValue('SIT_FULL_S').split())
Use the re module...
>>> import re
>>> str = 'A B C'
>>> re.sub(r'\s+', ' ', str)
'A B C'
I believe you should use regular expressions to match all the places where you find two or more spaces and then replace it (each occurence) with a single space.
This can be made using shorter portion of code:
re.sub(r'\s{2,}', ' ', your_string)
It's a bit unclear, but I think what you need is:
" ".join(row.GetValue('SIT_FULL_S').split())

Categories

Resources