Remove -#### in zipcodes

Remove -#### in zipcodes - python

How do I remove the +4 from zipcodes, in python?
I've got data like
85001
52804-3233
Winston-Salem
And I want that to become
85001
52804
Winston-Salem

>>> zip = '52804-3233'
>>> zip[:5]
'52804'
...and of course when you parse your lines from the original data you should insert some kind of rule to distinguish between zipcode to fix and other strings, but I don't know how your data looks like, so I can't help much (you could check if they are only digits and the '-' symbol, maybe?).

>>> import re
>>> s = "52804-3233"
>>> # regex to remove a dash and 4 digits after the dash after 5 digits:
>>> re.sub('(\d{5})-\d{4}', '\\1', s)
'52804'
The \\1 is a so called back reference and gets replaced by the first group, which would be the 5 digit zipcode in this case.

You could try something like this:
for input in inputs:
if input[:5].isnumeric():
input = input[:5]
# Takes the first 5 characters from the string
Just take away the first 5 characters of anything that is numbers in the first 5 positions.

re.sub('-\d{4}$', '', zipcode)

This grabs all items of the format 00000-0000 with a space or other word boundary before and after the number and replaces it with the first five digits. The other regex's posted will match some other number formats that you might not want.
re.sub('\b(\d{5})-\d{4}\b', '\\1', zipcode)

Or without regex:
output = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else line for line in text if line]

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot

You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).

It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai

Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.

First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)

It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31

This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.

You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)

A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I don´t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.

You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'

Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.

You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks)
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
What I want to do is to search this string for all occurrences of "key:", then extract the "values" following "key:".
One further complication for me is that I don't know how long these values belonging to key are (e.g. key:12/eas9 and key:43/e3). All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits.
This is why my idea was to slice from the indices of key plus the next say 10 characters (e.g. key:12/eas9g) and then work backward until isdigit() is false.
I tried to split my initial string (that did contain breaks):
stringA_split = re.split("\n", stringA)
for linex in stringA_split:
index_start = linex.rfind("key:")
index_end = index_start + 8
print(linex[index_start:index_end]
#then work backward
However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion.
How would I then solve this (e.g. as a start with getting all indices of '"key:"' and slice this to a list)?

import re
>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']
\d+ # One or more digits.
[^\d]+ # Everything except a digit (equivalent to [\D]).
[\d] # The final digit
(\d+[^\d]+[\d]) # The group of the expression above
'key:(\d+[^\d]+[\d])' # 'key:' followed by the group expression
If you want key: in your result:
>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']

I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described
import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
ma = re.match(r'(\d+\/.*\d+)', v)
if ma:
print ma.group(1)
This returns:
12/eas9
43/e3

You can apply just one RE that gets all the keys into an array of tuples:
import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)
After the execution, you have:
ret
[('12', 'eas9'), ('43', 'e3')]

edit: a better answer was posted above. I misread the original question when proposing to reverse here, which really wasn't necessary. Good luck!
If you know that the format is always key:, what if you reversed the string and rex for :yek? You'd isolate all keys and then can reverse them back
import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")
word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string

import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)

>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..

Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()

If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove -#### in zipcodes - python

How do I remove the +4 from zipcodes, in python? I've got data like 85001 52804-3233 Winston-Salem And I want that to become 85001 52804 Winston-Salem

>>> import re >>> s = "52804-3233" >>> # regex to remove a dash and 4 digits after the dash after 5 digits: >>> re.sub('(\d{5})-\d{4}', '\\1', s) '52804' The \\1 is a so called back reference and gets replaced by the first group, which would be the 5 digit zipcode in this case.

You could try something like this: for input in inputs: if input[:5].isnumeric(): input = input[:5] # Takes the first 5 characters from the string Just take away the first 5 characters of anything that is numbers in the first 5 positions.

re.sub('-\d{4}$', '', zipcode)

This grabs all items of the format 00000-0000 with a space or other word boundary before and after the number and replaces it with the first five digits. The other regex's posted will match some other number formats that you might not want. re.sub('\b(\d{5})-\d{4}\b', '\\1', zipcode)

Or without regex: output = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else line for line in text if line]

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

Regular expression to retrieve string parts within parentheses separated by commas

Regular expression help to find space after a long string

Slice substrings from long string to a list in python

How do I extract some string from a long string in Python?

Categories

Resources