Creating a list from another list - Python - python

I'm trying to create a program where the user inputs a list of strings, each one in a separate line. I want to be able to be able to return, for example, the third word in the second line. The input below would then return "blue".
input_string("""The cat in the hat
Red fish blue fish """)
Currently I have this:
def input_string(input):
words = input.split('\n')
So I can output a certain line using words[n], but how do output a specific word in a specific line? I've been trying to implement being able to type words[1][2] but my attempts at creating a multidimensional array have failed.
I've been trying to split each words[n] for a few hours now and google hasn't helped. I apologize if this is completely obvious, but I just started using Python a few days ago and am completely stuck.

It is as simple as:
input_string = ("""The cat in the hat
Red fish blue fish """)
words = [i.split(" ") for i in input_string.split('\n')]
It generates:
[['The', 'cat', 'in', 'the', 'hat', ''], ['Red', 'fish', 'blue', 'fish', '']]

It sounds like you want to split on os.linesep (the line separator for the current OS) before you split on space. Something like:
import os
def input_string(input)
words = []
for line in input.split(os.linesep):
words.append(line.split())
That will give you a list of word lists for each line.

There is a method called splitlines() as well. It will split on newlines. If you don't pass it any arguments, it will remove the newline character. If you pass it True, it will keep it there, but separate the lines nonetheless.
words = [line.split() for line in input_string.splitlines()]

Try this:
lines = input.split('\n')
words = []
for line in lines:
words.append(line.split(' '))
In english:
construct a list of lines, this would be analogous to reading from a file.
loop over each line, splitting it into a list of words
append the list of words to another list. this produces a list of lists.

Related

Creating a dictionary of words as keys mapped to their instances in a set of 'documents'

I need to take a list of tuples that includes sentences that are preprocessed as such (the 0 is an integer that corresponds to the publication, and the set at the end finds all unique words in the sentence):
(0, 'political commentators on both sides of the political divide agreed that clinton tried to hammer home the democrats theme that trump is temperamentally unfit with the line about his tweets and nuclear weapons', {'weapons', 'political', 'theme', 'line', 'and', 'sides', 'commentators', 'of', 'tried', 'about', 'is', 'agreed', 'clinton', 'the', 'home', 'to', 'divide', 'tweets', 'that', 'democrats', 'unfit', 'on', 'temperamentally', 'both', 'hammer', 'his', 'nuclear', 'with', 'trump'})
and returns a dictionary that includes the words as the key, and a list of integers that are the "index" position of the word as the value. i.e if this sentence was the 12th of the list, the dictionary value would contain 12 next to all the present words.
I know that I need to enumerate the original set of documents and then take the words from the set in the tuple, but I'm having a hard time finding the proper syntax to iterate into the sets of words within the tuple. Right now I'm stumped as to even where to start. If you want to see my code for how I produced the tuples from an original document of lines here it is.
def makeDocuments(filename):
with open(filename) as f:
g = [l for l in f]
return [tuple([int(l[0:2]), re.sub(r'\W', ' ',(l[2:-1])), set(re.findall(r'[a-zA-Z%]+', l))]) for l in g]
A test case was provided for me, upon searching for a given key the results should look something like:
assert index['happiness'] == [16495,66139,84943,
85998,91589,93472,
120070,133078,193349]
where the word 'happiness' occurs inside the sentences at those index positions.
Parsing that string is hard and you have pretty much just done a brute force extraction of data. Instead of trying to guess whether that's going to work on all possible input, you can use python's ast module to convert literals (what you type into a python program to represent stings, tuples, sets and so forth) into python objects for processing. After that, its just a question of associating the words in the newly created tuple to the indexes.
import ast
def makeDocuments(filename):
catalog = {}
with open(filename) as f:
for line in f:
index, text, words = ast.literal_eval(line)
for word in words:
if word not in catalog:
catalog[word] = []
catalog[word].append(index)
return catalog

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

Splitting elements within a list and separate strings, then counting the length

If I have several lines of code, such that
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.
and I want to split the 'string' or sentences for each line by the ";" punctuation, I would do
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
words_split = words.split(";")
However, now I would get strings of text such that,
["Jane, I don't like cavillers or questioners', 'besides, there is something truly forbidding in a child taking up her elders in that manner.']
[Be seated somewhere', 'and until you can speak pleasantly, remain silent."']
['I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk', 'and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.']
So it has now created two separate elements in this list.
How would I actually separate this list.
I know I need a 'for' loop because it needs to process through all the lines. I will need to use another 'split' method, however I have tried "\n" as well as ',' but it will not generate an answer, and the python thing says "AttributeError: 'list' object has no attribute 'split'". What would this mean?
Once I separate into separate strings, I want to calculate the length of each string, so i would do len(), etc.
You can iterate through the list of created words like this:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
for sentence_part in words.split(";"):
print(sentence_part) # will print the elements of the list
print(len(sentence_part) # will print the length of the sentence parts
Alernatively if you just need the length for each of the parts:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
sentence_part_lengths = [len(sentence_part) for sentence_part in words.split(";")]
Edit: With further information from your second post.
for count, line in enumerate(open("jane_eyre_sentences.txt")):
words = line.strip("\n")
if ";" in words:
wordssplit = words.split(";")
number_of_words_per_split = [(x, len(x.split())) for x in wordsplit]
print("Line {}: ".format(count), number_of_words_per_split)

When splitting text lines after a full stop, how do I specify when NOT to split like after the title 'Dr.'? [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 7 years ago.
#!/usr/bin/python
#Opening the file
myFile=open ('Text File.txt', 'r')
#Printing the files original text first
for line in myFile.readlines():
print line
#Splitting the text
varLine = line
splitLine = varLine.split (". ")
#Printing the edited text
print splitLine
#Closing the file
myFile.close()
When opening a .txt file in a Python program, I want the output of the text to be formatted like a sentence i.e. after a full stop is shown a new line is generated. That is what I have achieved so far, however I do not know how to prevent this happening when full stops are used not at the end of a sentence such as with things like 'Dr.', or 'i.e.' etc.
Best way, if you control the input, is to use two spaces at the end of a sentence (like people should, IMHO), then use split on '. ' so you don't touch the Dr. or i.e.
If you don't control the input... I'm not sure this is really Pythonic, but here's one way you could do it: use a placeholder to identify all locations where you want to save the period. Below, I assume 'XYZ' never shows up in my text. You can make it as complex as you like, and it will be better the more complex it is (less likely to run into it that way).
sentence = "Hello, Dr. Brown. Nice to meet you. I'm Bob."
targets = ['Dr.', 'i.e.', 'etc.']
replacements = [t.replace('.', placeholder) for t in targets]
# replacements now looks like: ['DrXYZ', 'iXYZeXYZ', 'etcXYZ']
for i in range(len(targets)):
sentence = sentence.replace(targets[i], replacements[i])
# sentence now looks like: "Hello, DrXYZ Brown. Nice to meet you. I'm Bob."
output = sentence.split('. ')
# output now looks like: ['Hello, DrXYZ Brown', ' Nice to meet you', " I'm Bob."]
output = [o.replace(placeholder, '.') for o in output]
print(output)
>>> ['Hello, Dr. Brown', ' Nice to meet you', " I'm Bob."]
Use the in keyword to check.
'.' in "Dr."
# True
'.' in "Bob"
# False

Grouping related search keywords

I have a log file containing search queries entered into my site's search engine. I'd like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.
Example data:
dog food
good dog trainer
cat food
veterinarian
Groups should include:
cat:
cat food
dog:
dog food
good dog trainer
food:
dog food
cat food
etc...
Ideas? Some sort of "indexing algorithm" perhaps?
f = open('data.txt', 'r')
raw = f.readlines()
#generate set of all possible groupings
groups = set()
for lines in raw:
data = lines.strip().split()
for items in data:
groups.add(items)
#parse input into groups
for group in groups:
print "Group \'%s\':" % group
for line in raw:
if line.find(group) is not -1:
print line.strip()
print
#consider storing into a dictionary instead of just printing
This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:
Group 'trainer':
good dog trainer
Group 'good':
good dog trainer
Group 'food':
dog food
cat food
Group 'dog':
dog food
good dog trainer
Group 'cat':
cat food
Group 'veterinarian':
veterinarian
Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.
SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.
The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.
Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.
So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".
Your algorithm needs the following parts (if done by yourself)
a parser for the data, breaking up in lines, breaking up the lines in words.
A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)
in pseudocode (generation):
create empty set S for name value pairs.
for each line L parsed
for each word W in line L
seek W in set S -> Item
if not found -> add word W -> (empty array) to set S
add line L reference to array in Ietm
endfor
endfor
(lookup (word: W))
seek W in set S into Item
if found return array from Item
else return empty array.
Modified version of #swanson's answer (not tested):
from collections import defaultdict
from itertools import chain
# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))
# parse input into groups
groups = defaultdict(list)
for line in lines:
for word in words:
if word in line:
groups[word].append(line)

Categories

Resources