Function for list elements concatenation? - python

I want to create a function which concatenates all the strings within a list and returns the resulting string. I tried something like this
def join_strings(x):
for i in x:
word = x[x.index(i)] + x[x.index(i) + 1]
return word
#set any list with strings and name it n.
print join_strings(n)
but it doesn't work and I can't figure out why. Any solution to the problem or fix of my thought? I thank you in advance!

For real work, use ''.join(x).
The problem with your code is that you are changing word each iteration, without keeping previous strings.
try:
def join_strings(x):
word = ''
for i in x:
word += i
return word
This is an example of a general pattern of using an accumulator. Something that keeps the information and is updated accross different loops/recursive calls. This method will work almost as is (except the word='' part) for joining lists and tuples and more, or summing anything - actually, it is close to be reimplementation of the sum built in function. A closer one will be:
def sum(iterable, s=0):
acc = s
for t in iterable:
acc += s
return acc
Of course, for strings you can achieve the same effect using ''.join(x), and in general (numbers, lists, etc.) you can use the sum function. an even more general case would be to replace += with a general operation:
from operator import add
def reduce(iterable, s=0, op=add):
acc = s
for t in iterable:
acc = op(w, s)
return acc

Related

Program to create a string from two given strings by concatenating the characters that are not contained by both strings

Write a Python program to create a string from two given strings by concatenating the characters that are not contained by both strings. The characters from the 1st string should appear before the characters from the 2nd string. Return the resulting string.
Sample input: ‘0abcxyz’, ‘abcxyz1’
Expected Output: ‘01’
I have already got the results but would like to learn if there is a better way to achieve the same results.'''
var14_1, var14_2 = '0abcxyz', 'abcxyz1'
def concat(var14_1,var14_2):
res = []
[res.append(s) for s in var14_1 if s not in var14_2]
[res.append(s) for s in var14_2 if s not in var14_1]
print(''.join(res))
concat(var14_1,var14_2)
The above code is returning the results as 01 which is as
expected. However I would like to know if there is any other way
to arrive at this solution without having to use "for loop"
twice. Your feedback will immensely help in improving my python skills. Thanks in advance!
It would be nicer to not use list comprehensions only to run many times res.append()
var14_1, var14_2 = '0abcxyz', 'abcxyz1'
r1 = [s for s in var14_1 if s not in var14_2]
r2 = [s for s in var14_2 if s not in var14_1]
res = r1 + r2
print(''.join(res))
To use one for loop you could convert strings to sets and get common chars
common = set('0abcxyz') & set('abcxyz1')
and then you can use one for with concatenated strings var14_1 + var14_2
common = set('0abcxyz') & set('abcxyz1')
res = [s for s in var14_1 + var14_2 if s not in common]
print(''.join(res))
Try this.
#furas pointed out you don't need list() while using set, so updated for that.
var14_1, var14_2 = '0abcxyz', 'abcxyz1'
def concat(first, second):
return ''.join(set(first).symmetric_difference(set(second)))
print(concat(var14_1, var14_2))
taking a set of an object creates an unordered collection of unique elements.
set()
has a function called symmetric_difference() which allows you to find the symmetric difference between two sets.

Finding the shortest word in a string

I'm new to coding and I'm working on a question that asks to find the shortest word within a sentence. I'm confused what the difference between:
def find_short(s):
for x in s.split():
return min(len(x))
and
def find_short(s):
return min(len(x) for x in s.split())
is, because the former gives me an error and the latter seems to work fine. Are they not virtually the same thing?
Are they not virtually the same thing?
No, they are not the same thing. If s equals "hello world", in the first iteration, x would be "hello". And there are two things wrong here:
You are trying to return in the very first iteration rather than going over all the elements (words) to find out what's the shortest.
min(len(x)) is like saying min(5) which is not only an bad parameter to pass to min(..) but also doesn't make sense. You'd want to pass a list of elements from which min will calculate the minimum.
The second approach is actually correct. See this answer of mine to get an idea of how to interpret it. In short, you are calculating length of every word, putting that into a list (actually a generator), and then asking min to run its minimum computation on it.
There's an easier approach to see why your second expression works. Try printing the result of the following:
print([len(x) for x in s.split()])
The function min takes an array as parameter.
On your 1st block, you have
def find_short(s):
for x in s.split():
return min(len(x))
min is called once on the length of the 1st word, so it crashes because it's expecting an array
You second block is a little different
def find_short(s):
return min(len(x) for x in s.split())
Inside min, you have len(x) for x in s.split() which will return an array of all the lengths and give it to min. Then, with this array, min will be able to return the smallest.
No, they are not the same thing.
In first piece of code you are entering for cycle and trying to calculate min of the first word's length. min(5) doesn't make sense, does it? And even if it could be calculated, return would have stopped executing this function (other words' lengths would not have been taken into consideration).
In second one, len(x) for x in s.split() is a generator expression yielding the lengths of all the words in your sentence. And min will calculate the minimal element of this sequence.
Yes, the examples given are very different.
The first example effectively says:
Take the string s, split it by spaces, and then take each word, x, found and return the minimum value of just the length of x.
The second example effectively says:
Find the minimum value in the list generated by len(x) for x in s.split().
That first example generates an error because the min function expects to compare at least 2 or more elements, and only 1 is provided.
That second example works because the list that is generated by len(x) for x in s.split() converts a string, like say "Python types with ducks?" to a list of word lengths (in my example, it would convert the string to [6, 5, 4, 6]). That list that is generated (this is also why it's called a generator), is what the min function then uses to find the minimum value inside said list.
Another way to write that first example so that it works like you would expect is like this
def find_short(s):
min_length = float("inf")
for x in s.split():
if len(x) < min_length:
min_length = len(x)
return min_length
However, notice how you have to keep track of a variable that you do not have to define using the list generator method in your second example. Although this is not a big deal when you are learning programming for the first time, it becomes a bigger deal when you start making larger, more complex programs.
Sidenote:
Any value that follows the return keyword is what a function "outputs", and thus no more code gets executed.
For example, in your first example (and assuming that the error was not generated), your loop would only ever execute once regardless of the string you give it because it does not check that you actually have found the value you want. What I mean by that is that any time your code encounters a return statement, it means that your function is done.
That is why in my example find_short function, I have an if statement to check that I have the value that I want before committing to the return statement that exits the function entirely.
There is mainly two mistakes here.
First of, seems you are returning the length of the string, not the string itself.
So your function will return 4 instead of 'book', for example.
I will get into how you can fix it in short.
But answering your question:
min() is a function that expects an iterable (entities like array).
In your first method, you are splitting the text, and calling return min(len(word)) for each word.
So, if the call was successfully, it would return on the first iteration.
But it is not successfully because min(3) throws an exception, 3 is not iterable.
On your second approach you are creating a list of parameters to min function.
So your code first resolves len(x) for x in s.split() returning something like 3,2,3,4,1,3,5 as params for min, which returns the minimum value.
If you would like to return the shortest word, you could try:
def find_short(s):
y = s.split()
y.sort(key=lambda a: len(a))
return y[0]

Iterating through a string and returning "x" number of characters in the string

Im trying to iterate through a string and return the string length that has been declared in the function.
It will receive two parameters, x and chars. The function will return a string that is comprised of the values in chars from index 0 up to the index x-1.
The return I want to receive is
print rangeLoopStringParam1("bobsyouruncle", 5)
# -> "bobsy"
print rangeLoopStringParam1("supercalifragilisticexpialidoshus", 8)
# -> "supercal"
This is the code I have so far but I feel like Im not getting very far. Can anyone guide me through this problem?
def rangeLoopStringParam1(chars, x):
for i in range(0, x, len(chars)):
chars += chars
return chars
Whats wrong with list slicing?
def stringRange(chars, x):
return chars[:x]
Whats going on here is we are treating the string as a list of characters, and we are telling python to give us the first x elements of the list (first x characters of string). Keep in mind, list slicing follows the syntax list[start:end] (in our case, start is implied to be 0). This means that, since lists start at 0, list slicing really returns a list of all elements with index from 0 (inclusive) to x (non inclusive).
For a better, more detailed explanation, see this great answer by Greg Hewgill on how list slices work.
On an extra note, you can do lambda function.
yourFunction = lambda string, num: string[:num]
print(yourFunction("bobsyouruncle", 5))
returns
bobsy
for reference:
Python3 Docs
Python course Doc
Python has some pretty powerful built in functionality. One of them (slicing) does exactly what you're asking for.
def rangeLoopStringParam1(chars, x):
return chars[:x]
You can directly split you string using :
new_string = string[:n] with n being the length you want.

Python searching a large list speed

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.
Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.
You should change headwordList into a set.
The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).
With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.
Now, this whole loop can be simplified:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.
In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.
Also, you might do better to build a dictionary of words and their ratios.
Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.
EDIT: This is to answer a question about generator expressions needing to be parenthesized.
Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word = max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.
But I think it's easier to read if you split out the genexp on its own line.
EDIT: #Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.
Happily, the code got simpler! Always a good sign.
EDIT: I just tested the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher
headwords = [
# This is a list of 650,000 words
# Dummy list:
"happy",
"new",
"year",
]
def words_from_file(filename):
with open(filename, "rt") as f:
for line in f:
for word in line.split():
yield word
def _match(matcher, s):
matcher.set_seq2(s)
return (matcher.ratio(), s)
ratios = {}
best_ratio = 0
matcher = SequenceMatcher()
for word in words_from_file("sentences.txt"):
matcher.set_seq1(word.lower())
if word not in headwords:
ratio, word = max(_match(matcher, word.lower()) for word in headwords)
best_ratio = max(best_ratio, ratio) # remember best ratio
if ratio > 0.86:
ratios[word] = ratio
print(best_ratio)
print(ratios)
1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.
2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.
3) You construct sentenceList which doesn't need to be done.
for line in file:
if line not in headwordList...
4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist
This is a data structures question. What you want to do, is to turn your list into something with faster element lookup speed, for example a binary search tree would work great here: time complexity is only O (log n) as opposed to O (n) in a list (which is in comparison incredibly fast).
There's a fairly simple explanation here:
http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html
But if you are not familiar with tree concepts, you might want to start few chapters earlier:
http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

python nested generator objects content

I have a problem with Python.
I'm trying to understand which are the information stored in an object that I discovered be a generator.
I don't know anything about Python, but I have to understand how this code works in order to convert it to Java.
The code is the following:
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
pairs = [(text[:i+1], text[i+1:]) for i in range(min(len(text), L))]
return pairs
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
productw = 1
for w in words:
productw = productw * Pw(w)
return productw
while I understood how the methods Pwords and splits work (the function Pw(w) simply get a value from a matrix), I'm still trying to understand how the "candidates" object, in the "segment" method is built and what it contains.
As well as, how the "max()" function analyzes this object.
I hope that someone could help me because I didn't find any feasible solution here to print this object.
Thanks a lot to everybody.
Mauro.
generator is quite simple abstraction. It looks like single-use custom iterator.
gen = (f(x) for x in data)
means that gen is iterator which each next value is equal to f(x) where x is corresponding value of data
nested generator is similar to list comprehension with small differences:
it is single use
it doesn't create whole sequence
code runs only during iterations
for easier debugging You can try to replace nested generator with list comprehension
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = [[first]+segment(rem) for first,rem in splits(text)]
return max(candidates, key=Pwords)

Categories

Resources