How to ignore numbers when doing string.startswith() in python? - python

I have a directory with a large number of files. The file names are similar to the following: the(number)one(number), where (number) can be any number. There are also files with the name: the(number), where (number) can be any number. I was wondering how I can count the number of files with the additional "one(number)" at the end of their file name.
Let's say I have the list of file names, I was thinking of doing
for n in list:
if n.startswith(the(number)one):
add one to a counter
Is there anyway for it to accept any number in the (number) space when doing a startswith?
Example:
the34one5
the37one2
the444one3
the87one8
the34
the32
This should return 4.

Use a regex matching 'one\d+' using the re module.
import re
for n in list:
if re.search(r"one\d+", n):
add one to a counter
If you want to make it very accurate, you can even do:
for n in list:
if re.search(r"^the\d+one\d+$", n):
add one to a counter
Which will even take care of any possible non digit chars between "the" and "one" and won't allow anything else before 'the' and after the last digit'.
You should start learning regexp now:
they let you make some complex text analysis in a blink that would be hard to code manually
they work almost the same from one language to another, making you more flexible
if you encounter some code using them, you will be puzzled if you didn't cause it's not something you can guess
the sooner you know them, the sooner you'll learn when NOT (hint) to use them. Which is eventually as important as knowing them.

The easiest way to do this probably is glob.glob():
number = len(glob.glob("/path/to/files/the*one*"))
Note that * here will match any string, not just numbers.

The same as a one-liner and also answering the question as it should match 'the' as well:
import re
count = len([name for name in list if re.match('the\d+one', name)])

Related

How can I join different segments of a list?

I'm having trouble in a school project because I don't know how to join elements of a list in segments. Here's an example: Let's say I have the following list:
list = ["T","h","i","s","I","s","A","L","i","s","t",]
How could I join this list so that the program outputs the following?:
Output: ["This","Is","A","List"]
Assuming list is your input, and without giving you the answer outright since it's a school project you should do yourself, here are some hints.
You'll want to check if a character is uppercase to know when the start of a word is. With python, you can use isupper() (ex: 'C'.isupper() would return True).
Python strings are iterable.
You can add a character to the end of a string using += (ex: myWord += 'a')
You can add a string to a list using append (ex: myList.append(myWord))
Remember this is a learning experience and there's no real value to being given the answer outright, if that's what you were hoping for. Best of luck and welcome to StackOverflow.
You can use regex for this
import re
list = ["T","h","i","s","I","s","A","L","i","s","t",]
sep=[s for s in re.split("([A-Z][^A-Z]*)", ''.join(list)) if s]
print(sep)

Fastest way to count instances of substrings in string Python3.6

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!
def getKmers(self, kmer):
self.kmer_dict = {}
kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
kmer_list = []
for x in range(len(kmer_tuples)):
new_kmer = ''
for y in range(kmer):
new_kmer += kmer_tuples[x][y]
kmer_list.append(new_kmer)
for x in range(len(kmer_list)):
self.kmer_dict[kmer_list[x]] = 0
for x in range(len(self.sequence)-kmer):
for substr in kmer_list:
if self.sequence[x:x+kmer] == substr:
self.kmer_dict[substr] += 1
break
return self.kmer_dict
For counting overlapping substrings of DNA, you can use Biopython:
>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2
Disclaimer: I wrote this method, see commit 97709cc.
However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).
Of course Python is fully able to perform these string searches. But instead of re-inventing all the wheels you will need, one screw at a time, you would be better of using a more specialized tool inside Python to deal with your problem - it looks like the BioPython project is the most activelly maintained and complete to deal with this sort of problem.
Short post with an example resembling your problem:
https://dodona.ugent.be/nl/exercises/1377336647/
Link to the BioPython project documentation: https://biopython.org/wiki/Documentation
(if the problem were simply overlapping strings, then the 3rd party "regex" module would be a way to go - https://pypi.org/project/regex/ - as the built-in regex engine in Python's re module can't deal with overlapping sequence either)

python string replacement, all possible combinations #2

I have sentences like the following:
((wouldyou)) give me something ((please))
and a bunch of keywords, stored in arrays / lists:
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
I want to replace every occurrence of variables in parentheses with a suitable set of strings stored in an array and get every possible combination back. The amount of variables and keywords is undefined.
James helped me with the following code:
def filler(word, from_char, to_char):
options = [(c,) if c != from_char else (from_char, to_char) for c in word.split(" ")]
return (' '.join(o) for o in product(*options))
list(filler('((?please)) tell me something ((?please))', '((?please))', ''))
It works great but only replaces one specific variable with empty strings. Now I want to go through various variables with different set of keywords. The desired result should look something like this:
can you give me something please
would you give me something please
please give me something please
can you give me something ASAP
would you give me something ASAP
please give me something ASAP
I guess it has something to do with to_ch, but I have no idea how to compare through list items at this place.
The following would work. It uses itertools.product to construct all of the possible pairings (or more) of your keywords.
import re, itertools
text = "((wouldyou)) give me something ((please))"
keywords = {}
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
# Get a list of bracketed terms
lsources = re.findall("\(\((.*?)\)\)", text)
# Build a list of the possible substitutions
ldests = []
for source in lsources:
ldests.append(keywords[source])
# Generate the various pairings
for lproduct in itertools.product(*ldests):
output = text
for src, dest in itertools.izip(lsources, lproduct):
# Replace each term (you could optimise this using a single re.sub)
output = output.replace("((%s))" % src, dest)
print output
You could further improve it by avoiding the need to do multiple replace() and assignment calls with one re.sub() call.
This scripts gives the following output:
can you give me something please
can you give me something ASAP
would you give me something please
would you give me something ASAP
please give me something please
please give me something ASAP
It was tested using Python 2.7. You will need to think how to solve it if multiple identical keywords were used. Hopefully you find this useful.
This is a job for Captain Regex!
Partial, pseudo-codey, solution...
One direct, albeit inefficient (like O(n*m) where n is number of words to replace and m is average number of replacements per word), way to do this would be to use the regex functionality in the re module to match the words, then use the re.sub() method to swap them out. Then you could just embed that in nested loops. So (assuming you get your replacements into a dict or something first), it would look something like this:
for key in repldict:
regexpattern = # construct a pattern on the fly for key
for item in repldict[key]:
newstring = re.sub(regexpattern, item)
And so forth. Only, you know, like with correct syntax and stuff. And then just append the newstring to a list, or print it, or whatever.
For creating the regexpatterns on the fly, string concatenation just should do it. Like a regex to match left parens, plus the string to match, plus a regex to match right parens.
If you do it that way, then you can handle the optional features just by looping over a second version of the regex pattern which appends a question mark to the end of the left parens, then does whatever you want to do with that.

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Generate random string from regex character set

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?
Example:
def foo(regex_set):
re.something(re.compile(regex_set))
foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz
The compile is of course optional, but in my mind that's what this function would look like.
Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:
import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:
import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
Note: neither module can invert all regex patterns.
And there are differences between the two modules:
import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))
yields
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
Here is another difference, this time pyparsing enumerates a larger set of matches:
x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100
A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:
Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").
If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.
if its just simple ranges you could manually parse it
def range_parse(rng):
min,max = rng.split("-")
return "".join(chr(i) for i in range(ord(min),ord(max)+1))
print range_parse("a-z")+range_parse('A-Z')
but its gross ...
Another solution I thought of to simplify the problem:
Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

Categories

Resources