Algorithm to extract number of varying length from title of file

Algorithm to extract number of varying length from title of file - python

I have a list of 400,000 file names (column in excel) of the format
xxx.Number.Date.zzz.txt
and I want to extract the Number from the string
Normally I would just set it to take the 5th through 9th character in that string, but the numbers vary in length (2 - 4 digits) and I am not sure how to design an algorithm that can tell how long the number is.
Using python3 if anyone is interested, but really I just need help with the pseudocode
I looked at this previous question, but it did not really answer the question in terms that I can use since it seems like it is using bash functions or I did not understand the explanation:
Extract number of variable length from string

If the format of the file is always xxx.Number.Date.zzz.txt, and we only care about Number, then you could convert the string into a list, and then extract the 1st element of that list. Example:
file = "xxx.4432.Date.zzz.txt"
num = file.split(".")[1]
print(num) # prints 4432
You could write this in a loop to go through your Excel column (check out openpyxl if you haven't yet).

You can use a regular expression (available in most languages):
.*?\.(\d+)\.
which matches the number between the first two dots:
import re
re.match('.*?\.(\d+)\.', 'xxx.12345.Date.zzz.txt').group(1)
#'12345'
An explanation on regex101.
This can also be done in pure Python (easily translatable to other languages):
s = 'xxx.12345.Date.zzz.txt'
out = ''
in_num = False
for c in s:
if in_num:
if c == '.':
break
out += c
elif c == '.':
in_num = True
giving out as: '12345'.
Note that with this second method, we do not verify that the characters between the first fullstops are digits.

Related

string matching with interchangeable characters

I am trying to do a simple string matching between two strings, a small string to a bigger string. The only catch is that I want to equate two characters in the small string to be the same. In particular if there is a character 'I' or a character 'L' in the smaller string, then I want it to be considered interchangeably.
For example let's say my small string is
s = 'AKIIMP'
and then the bigger string is:
b = 'MPKGEXAKILMP'
I want to write a function that will take the two strings and checks if the smaller one is in the big one. In this particular example even though the smaller string s is not a substring in b because there is no exact match, however in my case it should match with it because like I mentioned characters 'I' and 'L' would be used interchangeably and therefore the result should find a match.
Any idea of how I could proceed with this?

s.replace('I', 'L') in b.replace('I', 'L')
will evaluate to True in your example.

You could do it with regular expressions:
import re
s = 'AKIIMP'
b = 'MPKGEXAKILMP'
p = re.sub('[IL]', '[IL]', s)
if re.search(p, b):
print(f'{s!r} is in {b!r}')
else:
print('Not found')
This is not as elegant as #Deepstop's answer, but it provides a bit more flexibility in terms of what characters you equate.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

How do I find the first of a few characters in a string in python

How do I find the first of a few characters in a string in python? I have used find() and index() but they find only one character. How do I find the first position of a single character out of the few characters I want to be searched for?
So I want to find the position of the first operator(out of the 4 arithmetic operators) in an inputted string else it should return -1.
Sorry if this is a very stupid question but I have been searching and trying out multiple options over the past few days. I am also a beginner in python.
I tried this but i know its wrong:
>>> str1 ='12-23+23*12/12'
>>> str1.find('+') or str1.find('-') or str1.find('*') or str1.find('/')
This returns the first operator shown that is the + operator.
Also, I have tried
for x in str1:
if (x=='+' or x=='-' or x=='*' or x=='/'):
print(str1[x])
I know this is wrong.
I am a beginner and I'm trying to learn over a summer course I have taken. So I have not much knowledge on the topic.

str1 ='12-23+23*12/12+65'
while(1):
if('+' not in str2): #Taking '+' as an example
break;
int found = str1.find(+)
print(found)
str2 = str1.replace('+','',1)
#str2 = str1[:found]+str1[found+1:]
There is another way to do what I did in the last line of the code. I have added it in the comment above.
str2 = str1[:found]+str1[found+1:]

you can do do something like below:
validInput = set(['+','-','*','/'])
checkString = '12-23+23*12*12'
def checkInput(input):
if input not in validInput:
raise Exception
return input
def findSign(sign,string):
sign = checkInput(sign)
if sign in string:
return string.find(sign)
return -1
print(findSign('/',checkString))

Most Frequent Character - User Submitted String without Dictionaries or Counters

Currently, I am in the midst of writing a program that calculates all of the non white space characters in a user submitted string and then returns the most frequently used character. I cannot use collections, a counter, or the dictionary. Here is what I want to do:
Split the string so that white space is removed. Then count each character and return a value. I would have something to post here but everything I have attempted thus far has been met with critical failure. The closest I came was this program here:
strin=input('Enter a string: ')
fc=[]
nfc=0
for ch in strin:
i=0
j=0
while i<len(strin):
if ch.lower()==strin[i].lower():
j+=1
i+=1
if j>nfc and ch!=' ':
nfc=j
fc=ch
print('The most frequent character in string is: ', fc )
If you can fix this code or tell me a better way of doing it that meets the required criteria that would be helpful. And, before you say this has been done a hundred times on this forum please note I created an account specifically to ask this question. Yes there are a ton of questions like this but some that are reading from a text file or an existing string within the program. And an overwhelmingly large amount of these contain either a dictionary, counter, or collection which I cannot presently use in this chapter.

Just do it "the old way". Create a list (okay it's a collection, but a very basic one so shouldn't be a problem) of 26 zeroes and increase according to position. Compute max index at the same time.
strin="lazy cat dog whatever"
l=[0]*26
maxindex=-1
maxvalue=0
for c in strin.lower():
pos = ord(c)-ord('a')
if 0<=pos<=25:
l[pos]+=1
if l[pos]>maxvalue:
maxindex=pos
maxvalue = l[pos]
print("max count {} for letter {}".format(maxvalue,chr(maxindex+ord('a'))))
result:
max count 3 for letter a

As an alternative to Jean's solution (not using a list that allows for one-pass over the string), you could just use str.count here which does pretty much what you're trying to do:
strin = input("Enter a string: ").strip()
maxcount = float('-inf')
maxchar = ''
for char in strin:
c = strin.count(char) if not char.isspace() else 0
if c > maxcount:
maxcount = c
maxchar = char
print("Char {}, Count {}".format(maxchar, maxcount))
If lists are available, I'd use Jean's solution. He doesn't use a O(N) function N times :-)
P.s: you could compact this with one line if you use max:
max(((strin.count(i), i) for i in strin if not i.isspace()))

To keep track of several counts for different characters, you have to use a collection (even if it is a global namespace implemented as a dictionary in Python).
To print the most frequent non-space character while supporting arbitrary Unicode strings:
import sys
text = input("Enter a string (case is ignored)").casefold() # default caseless matching
# count non-space character frequencies
counter = [0] * (sys.maxunicode + 1)
for nonspace in map(ord, ''.join(text.split())):
counter[nonspace] += 1
# find the most common character
print(chr(max(range(len(counter)), key=counter.__getitem__)))
A similar list in Cython was the fastest way to find frequency of each character.

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!

Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.

you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Algorithm to extract number of varying length from title of file - python

Related

string matching with interchangeable characters

Python re.sub() is not replacing every match

How do I find the first of a few characters in a string in python

Most Frequent Character - User Submitted String without Dictionaries or Counters

Python Regular Expressions Findall

Categories

Resources