I have written a small python program inside my google app. I am using it for extracting out specific characters out of a string like this
"+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"
C
"
I am using split function for it but it's not splitting the string.Any clues why?
it's giving me something this kind of [u'+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"\n C '] result.
def prog (self,strgs):
self.response.out.write(strgs)
temp1= strgs
self.response.out.write(temp1)
message_split=temp1.split('\n')
#self.response.out.write(message_split)
temp=message_split
self.response.out.write(temp)
message_split_second=strgs.split(',')
m_list=message_split[1:]
self.response.out.write(message_split_second)
collect_strings=''
for j in m_list:
collect_strings=collect_strings+j
message_txt=collect_strings
message_date=message_split_second[0]
message_date=message_date.replace('"',"")
dates=message_date
message_time=message_split_second[0]
message_time=message_time.split('/n')
message_time=message_time[0]
message_time=message_time.replace('"',"")
temp=message_time.split('+')
message_time=temp[0]
times=message_time
cell_number=message_split_second[0]
cell_number=cell_number.replace('"',"")
cellnum=cell_number
return message_txt,dates,times,cellnum
The splits in the first part of your function ought to work. Here's an experiment I just did in Python 2.6:
>>> s = '+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"\n C '
>>> s.split('\n')
['+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"', ' C ']
>>> s.split(',')
['+CMGL: 14', '"REC READ"', '"+918000459019"', '', '"11/11/04', '18:27:53+22"\n C ']
If your self.response.out.write calls aren't doing the same thing, try reducing the function to the very shortest thing that displays the odd behaviour. And check that you know exactly what's being passed in as the strgs argument.
I can't see much wrong with the rest, except that at one point you try to split on /n when you probably meant to use \n.
Related
I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.
I want to apply a regex function to clean text in a dataframe column.
ie:
re1 = re.compile(r' +')
def fixup(x):
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
'<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' #.# ','.').replace(
' #-# ','-').replace('\\', ' \\ ')
return re1.sub(' ', html.unescape(x))
df['text'] = df['text'].apply(fixup).values.astype(str)
However when I run this I get a 'MemoryError' (in jupyter notebook).
I have 128GB of RAM and file to create the dataframe was 4GB.
Also I can see from profiler meory use is <20% when this error is thrown.
The error message give no more detail than 'MemoryError:' at the line I apply the fixup function.
Any ideas to help debug?
Break the replace chain into individual replace operations. Not only that will make your code more readable and maintainable, but the intermediate results will be discarded immediately after use, instead of being kept until all modifications are done:
replacements = ('#39;', "'"), ('amp;', '&'), ('#146;', "'"), ...
for replacement in replacements:
x = x.replace(*replacement)
P.S. Shouldn't 'amp;' be '&'?
I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)
Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .
im sure this is simple but im not good with regexp or string manipulation and i want to learn :)
I have an output from a string I get using snimpy. it looks like this:
ARRIS DOCSIS 3.0 Touchstone WideBand Cable Modem <<HW_REV: 1; VENDOR: Arris Interactive, L.L.C.; BOOTR: 1.2.1.62; SW_REV: 7.3.123; MODEL: CM820A>>
I want to be able to look into that string and use that info in an if to then print some stuff. I want to see if the model is a CM820A and then check the firmware version SW_REV and if its not the right version I want to print the version else I move on to the next string i get from my loop.
host.sysDescr it what returns the above string. as of now I know how to find all the CM820A but then i get sloppy when I try to verify the firmware version.
sysdesc = host.sysDescr
if "CM820A" in str(sysdesc):
if "7.5.125" not in str(sysdesc):
print("Modem CM820A " + modem + " at version " + version)
print(" Sysdesc = " + sysdesc)
if "7.5.125" in sysdesc:
print ("Modem CM820A " + modem + " up to date")
Right now I am able to see if the CM820A has the right version easily but I can't print only the version of the bad modems. I was only able to print the whole string which contains a lot of useless info. I just want to print form that string the SW_REV value.
Question
I need help with how to do this then I will understand better and be able to rewrite this whole thing which I currently am using only to learn python but I want to put to practice for useful purposes.
All you need is split() , you can split your string with a special character for example see the following :
>>> l= s.split(';')
['ARRIS DOCSIS 3.0 Touchstone WideBand Cable Modem <<HW_REV: 1', ' VENDOR: Arris Interactive, L.L.C.', ' BOOTR: 1.2.1.62', ' SW_REV: 7.3.123', ' MODEL: CM820A>>']
>>> for i in l :
... if 'BOOTR' in i:
... print i.split(':')
...
[' BOOTR', ' 1.2.1.62']
So then you can get the second element easily with indexing !
This answer will simply explain how to retrieve your desired information.
You will need to perform multiple splits on your data.
First, I notice that your string's information is subdivided by semi-colons.
so:
description_list = sysdesc.split(";")
will create a list of your major sections. since the sysdesc string has a standard format, you can then access the proper substring:
sub_string = description_list[3]
now, split the substring with the colon:
revision_list = sub_string.split(":")
now, just reference:
revision_list[1]
whenever you want to print it.