Splitting up input based on separators and storing the values - python

so I am new to Python. I was wondering how I could take something like
"James-Dean-Winchester"
or
"James:Dean:Winchester"
or simply
"James Dean Winchester"
and have python be able to see which format is which, split the input based on the format and then store it in variables to be modified later on. Could I somehow store the splitting characters (":","-"," ") in an array then call the array on the text that I am wishing to split or is there an easier way of doing it?
Update: I should have added that there will only ever be one type of separator.

you could define a function that performs the split and returns the separator in addition to the separated array:
def multiSepSplit(string,separators=["-",":"," "]):
return max([(string.split(sep),sep) for sep in separators],key=lambda s:len(s[0]))
multiSepSplit("James-Dean-Winchester")
# (['James', 'Dean', 'Winchester'], '-')
multiSepSplit("James Dean Winchester")
# (['James', 'Dean', 'Winchester'], ' ')
multiSepSplit("James:Dean:Winchester")
# (['James', 'Dean', 'Winchester'], ':')
How it works is by performing all the splits using a list comprehension on the separators and taking the one with the maximum number of elements in the resulting array.
Each entry in the list is actually a tuple with the resulting array s[0] and the separator that was used s[1].

If you do not know which delimiter is in play for each string, you need to write some logic for this.
One suggestion is to maintain a list of potential delimiters (sorted by preference / popularity) and test whether they occur in your string more than once.
Below is an example.
delimiters = list('-: ')
test_list = ['James-Dean-Winchester', 'April:May:June',
'John Abraham Smith', 'Joe:Ambiguous-Connor']
def get_delimiter(x, delim):
for sep in delim:
if x.count(sep) > 1:
return sep
else:
return None
result = [get_delimiter(i, delimiters) for i in test_list]
# ['-', ':', ' ', None]
You can then link test_list with result via zip, i.e. by iterating indices in each list sequentially.
You can split a string by a delimiter using, for example, 'mystr1-mystr2-mystr3'.split('-').

Related

Permutate removal of defined substrings with varying length from strings

I am trying to generate all permutations from a list of strings where certain substrings of characters are removed. I have a list of certain chemical compositions and I want all compositions resulting from that list where one of those elements is removed. A short excerpt of this list looks like this:
AlCrHfMoNbN
AlCrHfMoTaN
AlCrHfMoTiN
AlCrHfMoVN
AlCrHfMoWN
...
What I am trying to get is
AlCrHfMoNbN --> CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
AlCrHfMoTaN --> CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
for each composition. I just need the right column. As you can see some of the resulting compositions are duplicates and this is intended. The list of elements that need to be removed is
Al, Cr, Hf, Mo, Nb, Ta, Ti, V, W, Zr
As you see some have a length of two characters and some of only one.
There is a question that asks about something very similar, however my problem is more complex:
Getting a list of strings with character removed in permutation
I tried adjusting the code to my needs:
def f(s, c, start):
i = s.find(c, start)
return [s] if i < 0 else f(s, c, i+1) + f(s[:i]+s[i+1:], c, i)
s = 'AlCrHfMoNbN'
print(f(s, 'Al', 0))
But this simple approach only leads to ['AlCrHfMoNbN', 'lCrHfMoNbN']. So only one character is removed whereas I need to remove a defined string of characters with a varying length. Also I am limited to a single input object s - instead of hundreds that I need to process - so cycling through by hand is not an option.
To sum it up what I need is a change in the code that allows to:
input a list of strings either separated by linebreaks or whitespace
remove substrings of characters from that list which are defined by a second list (just like above)
writes the resulting "reduced" items in a continuing list preferably as a single column without any commas and such
Since I only have some experience with Python and Bash I strongly prefer a solution with these languages.
IIUC, all you need is str.replace:
input_list = ['AlCrHfMoNbN', 'AlCrHfMoTaN']
removals = ['Al', 'Cr', 'Hf', 'Mo', 'Nb', 'Ta', 'Ti', 'V', 'W', 'Zr']
result = {}
for i in input_list:
result[i] = [i.replace(r,'') for r in removals if r in i]
Output:
{'AlCrHfMoNbN': ['CrHfMoNbN',
'AlHfMoNbN',
'AlCrMoNbN',
'AlCrHfNbN',
'AlCrHfMoN'],
'AlCrHfMoTaN': ['CrHfMoTaN',
'AlHfMoTaN',
'AlCrMoTaN',
'AlCrHfTaN',
'AlCrHfMoN']}
if you have gawk, set FPAT to [A-Z][a-z]* so each element will be regarded as a field, and use a simple loop to generate permutations. also set OFS to empty string so there won't be spaces in output records.
$ gawk 'BEGIN{FPAT="[A-Z][a-z]*";OFS=""} {for(i=1;i<NF;++i){p=$i;$i="";print;$i=p}}' file
CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
CrHfMoTiN
AlHfMoTiN
AlCrMoTiN
AlCrHfTiN
AlCrHfMoN
CrHfMoVN
AlHfMoVN
AlCrMoVN
AlCrHfVN
AlCrHfMoN
CrHfMoWN
AlHfMoWN
AlCrMoWN
AlCrHfWN
AlCrHfMoN
I've also written a portable one with extra spaces and explanatory comments:
awk '{
# separate last element from others
sub(/[A-Z][a-z]*$/, " &")
# from the beginning of line
# we will match each element and print a line where it is omitted
for (i=0; match(substr($1,i), /[A-Z][a-z]*/); i+=RLENGTH)
print substr($1,1,i) substr($1,i+RLENGTH+1) $2
# ^ before match ^ after match ^ last element
}' file
This doesn't use your attempt, but it works when we assume that your elements always begin with an uppercase letter (and consist otherwise only of lowercase letters):
def f(s):
# split string by elements
import re
elements = re.findall('[A-Z][^A-Z]*', s)
# make a list of strings, where the first string has the first element removed, the second string the second, ...
r = []
for i in range(len(elements)):
r.append(''.join(elements[:i]+elements[i+1:]))
# return this list
return r
Of course this still only works for one string. So if you have a list of strings l and you want to apply it for every string in it, just use a for loop like that:
# your list of strings
l = ["AlCrHfMoNbN", "AlCrHfMoTaN", "AlCrHfMoTiN", "AlCrHfMoVN", "AlCrHfMoWN"]
# iterate through your input list
for s in l:
# call above function
r = f(s)
# print out the result if you want to
[print(i) for i in r]

How to convert the tuple elements to string, without losing the tuple properties?

I created a tuple by reading data from a MySQL table. All the elements are of mixed data types and to be able to apply few string operations (upper case, removal of special characters etc), I need to convert all those elements to string.
I tried "".str() and .join() but the resultant is a pure string and I lose information about individual elements.
Something like:
(ABC, XYZ, 234, QWE, 578) <-- mixed datatypes but I can do tuple[0] to just fetch ABC
The cursor returns multiple records. The struct_address_str[0] returns the first record (like the example above). The struct_address_str[0][0] returns the first element of the row. After I make the first transformation, the struct_address_str[0][0] no more returns the first element but first character of the element.
However, after the transformation, if I do tuple[0][0], I get back A while I want to the output to be ABC.
How do I get this working?
Following is the code that I am using:
cursor = conn.cursor();
### Structure Address Data ###
cursor.execute("SELECT id,... FROM ...");
#converted the cursor to list
struct_address = list(cursor.fetchall())
#converted all the list elements to string
struct_address_str = [str(i) for i in struct_address]
#Checking the values
print(struct_address_str[0][1], sep="\n")
print(struct_address_str[0][2], sep="\n")
print(struct_address_str[0], sep="\n")
#converted all the list elements to uppercase
struct_address_upper = [i.upper() for i in struct_address_str]
#removing all the special characters
#cli_add_no_sp_char = [s.translate(str.maketrans('', '', '\'(#),-\".')) for s in cli_address_upper]
struct_add_no_sp_char = [s.translate(str.maketrans('\'(#),-\"./', ' ', '')) for s in struct_address_upper]
What about :
struct_address_str = [[str(i) for i in x] for x in struct_address]
And then again:
struct_address_upper = [[i.upper() for i in x] for x in struct_address_str]
Of course you could combine the two in one line by using "str(i).upper()". I would probably define a function sanitize(i) making all needed operations and then use it in the list comprehension.

How to remove falsy values when splitting a string with a non-whitespace separator

According to the docs:
str.split(sep=None, maxsplit=-1)
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']).
Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
So to use the keyword argument sep=, is the following the pythonic way to remove the falsy items?
[w for w in s.strip().split(' ') if w]
If it's just whitespaces (\s\t\n), str.split() will suffice but let's say we are trying to split on another character/substring, the if-condition in the list comprehension is necessary. Is that right?
If you want to be obtuse, you could use filter(None, x) to remove falsey items:
>>> list(filter(None, '1,2,,3,'.split(',')))
['1', '2', '3']
Probably less Pythonic. It might be clearer to iterate over the items specifically:
for w in '1,2,,3,'.split(','):
if w:
…
This makes it clear that you're skipping the empty items and not relying on the fact that str.split sometimes skips empty items.
I'd just as soon use a regex, either to skip consecutive runs of the separator (but watch out for the end):
>>> re.split(r',+', '1,2,,3,')
['1', '2', '3', '']
or to find everything that's not a separator:
>>> re.findall(r'[^,]+', '1,2,,3,')
['1', '2', '3']
If you want to go way back in Python's history, there were two separate functions, split and splitfields. I think the name explains the purpose. The first splits on any whitespace, useful for arbitrary text input, and the second behaves predictably on some delimited input. They were implemented in pure Python before v1.6.
Well, I think you might just need a hand in understanding the documentation. In your example, you pretty much are demonstrating the differences in the algorithm mentioned in documentation. Not using the sep keyword argument more or less is like using sep=' ' and then throwing out the empty strings. When you have multiple spaces in a row the algorithm splits those and finds None. Because you were explicit that you wanted everything split by a space it converts None to an empty string. Changing None to an empty string is good practice in this case, because it avoids changing the signature of the function (or in other words what the functions returns), in this case it returns a list of strings.
Below is showing how an empty string with 4 spaces is treated differently...
>>> empty = ' '
>>> s = 'this is an irritating string with random spacing .'
>>> empty.split()
[]
>>> empty.split(' ')
['', '', '', '']
For you question, just use split() with no sep argument
well your string
s = 'this is an irritating string with random spacing .',
which is containing more than one white spaces that's why empty.split(' ') is returning noney value.
you would have to remove extra white space from string s and can get desired result.

How does raw_input().strip().split() in Python work in this code?

Hopefully, the community might explain this better to me. Below is the objective, I am trying to make sense of this code given the objective.
Objective: Initialize your list and read in the value of followed by lines of commands where each command will be of the types listed above. Iterate through each command in order and perform the corresponding operation on your list.
Sample input:
12
insert 0 5
insert 1 10
etc.
Sample output:
[5, 10]
etc.
The first line contains an integer, n, denoting the number of commands.
Each line of the subsequent lines contains one of the commands described above.
Code:
n = int(raw_input().strip())
List = []
for number in range(n):
args = raw_input().strip().split(" ")
if args[0] == "append":
List.append(int(args[1]))
elif args[0] == "insert":
List.insert(int(args[1]), int(args[2]))
So this is my interpretation of the variable "args." You take the raw input from the user, then remove the white spaces from the raw input. Once that is removed, the split function put the string into a list.
If my raw input was "insert 0 5," wouldn't strip() turn it into "insert05" ?
In python you use a split(delimiter) method onto a string in order to get a list based in the delimiter that you specified (by default is the space character) and the strip() method removes the white spaces at the end and beginning of a string
So step by step the operations are:
raw_input() #' insert 0 5 '
raw_input().strip() #'insert 0 5'
raw_input().strip().split() #['insert', '0', '5']
you can use split(';') by example if you want to convert strings delimited by semicolons 'insert;0;5'
Let's take an example, you take raw input
string=' I am a coder '
While you take input in form of a string, at first, strip() consumes input i.e. string.strip() makes it
string='I am a coder'
since spaces at the front and end are removed.
Now, split() is used to split the stripped string into a list
i.e.
string=['I', 'am', 'a', 'coder']
Nope, that would be .remove(" "), .strip() just gets rid of white space at the beginning and end of the string.

Find Certain String Indices

I have this string and I need to get a specific number out of it.
E.G. encrypted = "10134585588147, 3847183463814, 18517461398"
How would I pull out only the second integer out of the string?
You are looking for the "split" method. Turn a string into a list by specifying a smaller part of the string on which to split.
>>> encrypted = '10134585588147, 3847183463814, 18517461398'
>>> encrypted_list = encrypted.split(', ')
>>> encrypted_list
['10134585588147', '3847183463814', '18517461398']
>>> encrypted_list[1]
'3847183463814'
>>> encrypted_list[-1]
'18517461398'
Then you can just access the indices as normal. Note that lists can be indexed forwards or backwards. By providing a negative index, we count from the right rather than the left, selecting the last index (without any idea how big the list is). Note this will produce IndexError if the list is empty, though. If you use Jon's method (below), there will always be at least one index in the list unless the string you start with is itself empty.
Edited to add:
What Jon is pointing out in the comment is that if you are not sure if the string will be well-formatted (e.g., always separated by exactly one comma followed by exactly one space), then you can replace all the commas with spaces (encrypt.replace(',', ' ')), then call split without arguments, which will split on any number of whitespace characters. As usual, you can chain these together:
encrypted.replace(',', ' ').split()

Categories

Resources