Reinstate lost leading zeroes in Python list - python

I have a list of geographical postcodes that take the format xxxx (a string of numbers).
However, in the process of gathering and treating the data, the leading zero has been lost in cases where the postcode begins with '0'.
I need to reinstate the leading '0' in such cases.
Postcodes either occur singularly as xxxx, or they occur as a range in my list, xxxx-xxxx.
Have:
v = ['821-322', '877', '2004-2218', '2022']
Desired output:
['0821-0322', '0877', '2004-2218', '2022']
^ ^ ^
Attempt:
for i in range(len(v)):
v[i] = re.sub(pattern, '0' + pattern, v)
However, I'm struggling with the regex pattern, and how to simply get the desired result.
There is no requirement to use re.sub(). Any simple solution will do.

You should use f-string formatting instead!
Here is a one-liner to solve your problem:
>>> v = ['821-322', '877', '2004-2218', '2022']
>>> ["-".join([f'{i:0>4}' for i in x.split("-")]) for x in v]
['0821-0322', '0877', '2004-2218', '2022']
A more verbose example is this:
v = ['821-322', '877', '2004-2218', '2022']
newv = []
for number in v:
num_holder = []
# Split the numbers on "-", returns a list of one if no split occurs
for num in number.split("-"):
# Append the formatted string to num_holder
num_holder.append(f'{num:0>4}')
# After each number has been formatted correctly, create a
# new string which is joined together with a "-" once again and append it to newv
newv.append("-".join(num_holder))
print(newv)
You can read up more on how f-strings work here and a further description of the "mini-language" that is used by the formatter here
The short version of the explanation is this:
f'{num:0>4}'
f tells the interpreter that a format-string is following
{} inside of the string tells the formatter that it is a replacement-field and should be "calculated"
num inside of the brackets is a reference to a variable
: tells the formatter that there is a format-specifier settings following.
0 is the variable / value that should be used to 'fill' the string.
> is the alignment of the variable num on the new string. > means to the right.
4 is the minimum number of characters that we want the resulting string to have. If num is equal to or greater that 4 characters long then the formatter will do nothing.

Related

How can I change how Python sort deals with punctuation?

I'm currently trying to rewrite an R script in Python. I've been tripped up because it looks like R and Python sort some punctuation differently. Specifically '&' and '_'. At some point in my program I sort by an identifier column in a Pandas dataframe.
As an example in Python:
t = ["1&2","1_2"]
sorted(t)
results in
['1&2', '1_2']
Comparatively in R:
t <- c("1&2","1_2")
sort(t)
results in
[1] "1_2" "1&2"
According to various resources (https://www.dconc.gov/home/showpublisheddocument/1481/635548543552170000) Python is doing the correct thing, but unfortunately I need to do the wrong thing here (changing R is not in scope).
Is there a straight forward way that I can change for Python would sort this? Specifically I'll need to be able to do this on pandas dataframes when sorting by an ID column
You have the option of just skipping all the following text to FINALLY and use the provided code for sorting Python lists of strings like they would be sorted in R or learn a bit about Python reading the answer from top to bottom:
Like already mentioned in the comment to your question by Rawson (giving appropriate helpful link) you can define the order in which sorting should take place for any characters you choose to take out of the usual sorting order:
t = ['1&2', '1_2']
print(sorted(t))
alphabet = {"_":-2, "&":-1}
def sortkey(word):
return [ alphabet.get(chr, ord(chr)) for chr in word ]
# what means:
# return [ alphabet[chr] if chr in alphabet else ord(chr) for chr in word ]
print(sortkey(t[0]), sortkey(t[1]))
print(sorted(t, key=sortkey))
gives:
['1&2', '1_2']
[49, -1, 50] [49, -2, 50]
['1_2', '1&2']
Use negative values to define the alphabet order so you can use ord() for any other not redefined parts of the alphabet (advantage: avoiding possible problems with Unicode strings).
If you want to redefine many of the characters and use only the printable ones you can also define an own alphabet string like follows:
# v v
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?#[\]^&`{|}~"""
and then use to sort by it:
print(sorted(t, key=lambda s: [alphabet.index(c) for c in s]))
For extended use on a huge number of data to sort consider to turn the alphabet to a dictionary:
dict_alphabet = { alphabet[i]:i for i in range(len(alphabet)) }
print(sorted(t, key=lambda s: [dict_alphabet[c] for c in s ]))
or best use the in Python available character translation feature available for strings:
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?#[\]^&`{|}~"""
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(sorted(t, key=lambda s: s.translate(table)))
By the way: you can get a list of printable Python characters using the string module:
import string
print(string.printable) # includes Form-Feed, Tab, VT, ...
FINALLY
Below ready to use Python code for sorting lists of strings exactly like they would be sorted in R:
Rcode = """\
s <- "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+,-./:;<=>?#[\\]^_`{|}~"
paste(sort(unlist(strsplit(s, ""))), collapse = "")"""
RsortOrder = "_-,;:!?.()[]{}#*/\\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
# ^--- result of running the R-code online ( [TIO][1] )
# print(''.join(sorted("_-,;:!?.()[]{}#*/\\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ")))
PythonSort = "!#$%&()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
# ===========================================
alphabet = RsortOrder
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(">>>",sorted(["1&2","1_2"], key=lambda s: s.translate(table)))
printing
>>> ['1_2', '1&2']
Run the R-code online using: TIO or generate your own RsortOrder running the provided R-code and using your specific locale setting in R as suggested in the comments to your question by juanpa.arrivillaga .
Alternatively you can use the Python locale module for the purpose of usage of the same locale setting as it is used in R:
( https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python )
import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# locale.strxfrm(string)
# Transforms a string to one that can be used in locale-aware comparisons.
# For example, strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0.
# This function can be used when the same string is compared repeatedly,
# e.g. when collating a sequence of strings.
print("###",sorted(["1&2","1_2"], key=locale.strxfrm))
prints
### ['1_2', '1&2']
Use a custom key for sorting. Here, we can just swap the & and _. We do the swap by using list comprehension and breaking a string into a list of its individual characters, but we swap the & and _ characters. Then we rebuild the string with a ''.join'.
t = ["1&2","1_2", "5&3"]
def swap_chars(s):
return ''.join([c if
c not in ['&', '_']
else '_' if c == '&'
else '&' for c in s])
sorted(t, key = swap_chars)

Create a new variable instance each time I split a string in Python

I have a string into a variable x that includes ">" symbols. I would like to create a new variable each time the string is splitted at the ">" symbol.
The string I have in the variable x is as such (imported from a simple .txt file):
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
The expected output is:
print(var_1)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
print(var_2)
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
print(var_3)
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
To achieve this I am using a simple for loop
count = 3
for v in range(0, count+1):
globals()[f"var_{v}"] = x.split('>')
print(var_3)
This way I am successfully getting a new variable for each count (each count is == to the number of ">").
However the output I am currently getting is:
print(var_1)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_2)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_3)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
How can I troubleshoot the for loop in order to achieve the expected output?
Try to iterate the split result:
for i, token in enumerate(x.split('>')):
# do not include empty string
if token:
globals()[f"var_{i}"] = token
# then deal with the vars
print(var_1)
print(var_2)
..
I would use re.findall here:
import re
inp = """>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"""
vars = re.findall(r'>[^>]+', inp)
print(vars)
# ['>AF1785813\nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA\n',
# '>AF1785815\nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG\n',
# '>AF1785814\nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
Note that re.findall returns all matches inside a single neat list, which can then be iterated or accessed later as needed.
Use the regular expression match the > character followed by the characters on the line following it, up until the next > character or the end of the string.
[^\n]*: This matches zero or more characters that are not newline characters.
[^>]*: This matches zero or more characters that are not the > character.
import re
x = ">AF1785813\nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA\n>AF1785815\nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG\n>AF1785814\nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"
substrings = re.findall(">[^\n]*\n[^>]*", x)
for i, substring in enumerate(substrings, start = 1):
globals()[f"var_{i}"] = substring
output:
>>> print(var_1)
>>> print(var_2)
>>> print(var_3)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA

Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks)
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
What I want to do is to search this string for all occurrences of "key:", then extract the "values" following "key:".
One further complication for me is that I don't know how long these values belonging to key are (e.g. key:12/eas9 and key:43/e3). All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits.
This is why my idea was to slice from the indices of key plus the next say 10 characters (e.g. key:12/eas9g) and then work backward until isdigit() is false.
I tried to split my initial string (that did contain breaks):
stringA_split = re.split("\n", stringA)
for linex in stringA_split:
index_start = linex.rfind("key:")
index_end = index_start + 8
print(linex[index_start:index_end]
#then work backward
However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion.
How would I then solve this (e.g. as a start with getting all indices of '"key:"' and slice this to a list)?
import re
>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']
\d+ # One or more digits.
[^\d]+ # Everything except a digit (equivalent to [\D]).
[\d] # The final digit
(\d+[^\d]+[\d]) # The group of the expression above
'key:(\d+[^\d]+[\d])' # 'key:' followed by the group expression
If you want key: in your result:
>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']
I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described
import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
ma = re.match(r'(\d+\/.*\d+)', v)
if ma:
print ma.group(1)
This returns:
12/eas9
43/e3
You can apply just one RE that gets all the keys into an array of tuples:
import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)
After the execution, you have:
ret
[('12', 'eas9'), ('43', 'e3')]
edit: a better answer was posted above. I misread the original question when proposing to reverse here, which really wasn't necessary. Good luck!
If you know that the format is always key:, what if you reversed the string and rex for :yek? You'd isolate all keys and then can reverse them back
import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")
word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

Python - Parse strings with variable repeating substring

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.
I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Replacing reoccuring characters in strings in Python 3.1

Is it possible to replace a single character inside a string that occurs many times?
Input:
Sentence=("This is an Example. Thxs code is not what I'm having problems with.") #Example input
^
Sentence=("This is an Example. This code is not what I'm having problems with.") #Desired output
Replace the 'x' in "Thxs" with an i, without replacing the x in "Example".
You can do it by including some context:
s = s.replace("Thxs", "This")
Alternatively you can keep a list of words that you don't wish to replace:
whitelist = ['example', 'explanation']
def replace_except_whitelist(m):
s = m.group()
if s in whitelist: return s
else: return s.replace('x', 'i')
s = 'Thxs example'
result = re.sub("\w+", replace_except_whitelist, s)
print(result)
Output:
This example
Sure, but you essentially have to build up a new string out of the parts you want:
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> s[22]
'x'
>>> s[:22] + "i" + s[23:]
"This is an Example. This code is not what I'm having problems with."
For information about the notation used here, see good primer for python slice notation.
If you know whether you want to replace the first occurrence of x, or the second, or the third, or the last, you can combine str.find (or str.rfind if you wish to start from the end of the string) with slicing and str.replace, feeding the character you wish to replace to the first method, as many times as it is needed to get a position just before the character you want to replace (for the specific sentence you suggest, just one), then slice the string in two and replace only one occurrence in the second slice.
An example is worth a thousands words, or so they say. In the following, I assume you want to substitute the (n+1)th occurrence of the character.
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> n = 1
>>> pos = 0
>>> for i in range(n):
>>> pos = s.find('x', pos) + 1
...
>>> s[:pos] + s[pos:].replace('x', 'i', 1)
"This is an Example. This code is not what I'm having problems with."
Note that you need to add an offset to pos, otherwise you will replace the occurrence of x you have just found.

Categories

Resources