how to sort text file alphabetically with python? - python

I have a text file that contains English words translated into Arabic that I want to arrange alphabetically,
text file :
entrenched = ترسخ
hypotenuse =وتر
conquered = التغلب
tempted = يغري
intentional = متعمد
ps: some words contains more the one like this :
indistinguishable = لا يمكن تمييزه
spot the subtle = بقعة الخفية
I want to check just the first char of every lines and make the sort , I tried but it didn't work :
def sorting(filename):
infile = open(filename,encoding="utf8")
words = []
for line in infile:
temp = line.split()
for i in temp:
words.append(i)
infile.close()
words.sort()
outfile = open("result.txt", "w",encoding="utf8")
for i in words:
outfile.writelines(i)
outfile.writelines("\n")
outfile.close()
sorting("words.txt")

Had to figure out your code indenting :-(
Your code was splitting each line and adding all the words to words - I thought you wanted to sort by the first word of each line, so my code below sorts the lines using the key= method of str.sort() using a function getfirstword to work out the first word of the line. Often you'd use a lambda function for this but it's much easier to debug a non-lambda function.
I put the strings into a variable and split the variable, just so it's a [mre] in one file.
datatext = """entrenched = ترسخ
hypotenuse =وتر
conquered = التغلب
tempted = يغري
intentional = متعمد
"""
def getfirstword(s):
'''
given a string, split off the first word
'''
words = s.split(maxsplit=1)
if len(words)>0:
return words[0]
return ""
# takes a complete text - splits it into lines
def sorting(text):
lines = text.splitlines()
sortedlines = sorted(lines,key=getfirstword)
for line in sortedlines:
print( line )
# could use:
# datatext = open( "words.csv").read()
sorting(datatext)
Result:
conquered = التغلب
entrenched = ترسخ
hypotenuse =وتر
intentional = متعمد
tempted = يغري

Related

File Names Chain in python

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:
'''
The objective of the homework assignment is to design and implement a function
that reads some strings contained in a series of files and generates a new
string from all the strings read.
The strings to be read are contained in several files, linked together to
form a closed chain. The first string in each file is the name of another
file that belongs to the chain: starting from any file and following the
chain, you always return to the starting file.
Example: the first line of file "A.txt" is "B.txt," the first line of file
"B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the
chain "A.txt"-"B.txt"-"C.txt".
In addition to the string with the name of the next file, each file also
contains other strings separated by spaces, tabs, or carriage return
characters. The function must read all the strings in the files in the chain
and construct the string obtained by concatenating the characters with the
highest frequency in each position. That is, in the string to be constructed,
at position p, there will be the character with the highest frequency at
position p of each string read from the files. In the case where there are
multiple characters with the same frequency, consider the alphabetical order.
The generated string has a length equal to the maximum length of the strings
read from the files.
Therefore, you must write a function that takes as input a string "filename"
representing the name of a file and returns a string.
The function must construct the string according to the directions outlined
above and return the constructed string.
Example: if the contents of the three files A.txt, B.txt, and C.txt in the
directory test01 are as follows
test01/A.txt test01/B.txt test01/C.txt
-------------------------------------------------------------------------------
test01/B.txt test01/C.txt test01/A.txt
house home kite
garden park hello
kitchen affair portrait
balloon angel
surfing
the function most_frequent_chars ("test01/A.txt") will return "hareennt".
'''
def file_names_list(filename):
intermezzo = []
lista_file = []
a_file = open(filename)
lines = a_file.readlines()
for line in lines:
intermezzo.extend(line.split())
del intermezzo[1:]
lista_file.append(intermezzo[0])
intermezzo.pop(0)
return lista_file
def words_list(filename):
lista_file = []
a_file = open(filename)
lines = a_file.readlines()[1:]
for line in lines:
lista_file.extend(line.split())
return lista_file
def stuff_list(filename):
file_list = file_names_list(filename)
the_rest = words_list(filename)
second_file_name = file_names_list(file_list[0])
the_lists = words_list(file_list[0]) and
words_list(second_file_name[0])
the_rest += the_lists[0:]
return the_rest
def most_frequent_chars(filename):
huge_words_list = stuff_list(filename)
maxOccurs = ""
list_of_chars = []
for i in range(len(max(huge_words_list, key=len))):
for item in huge_words_list:
try:
list_of_chars.append(item[i])
except IndexError:
pass
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
list_of_chars.clear()
return maxOccurs
print(most_frequent_chars("test01/A.txt"))
This assignment is relatively easy, if the code has a good structure. Here is a full implementation:
def read_file(fname):
with open(fname, 'r') as f:
return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))
def read_chain(fname):
seen = set()
new = fname
result = []
while not new in seen:
A = read_file(new)
seen.add(new)
new, words = A[0], A[1:]
result.extend(words)
return result
def most_frequent_chars (fname):
all_words = read_chain(fname)
result = []
for i in range(max(map(len,all_words))):
chars = [word[i] for word in all_words if i<len(word)]
result.append(max(sorted(set(chars)), key = chars.count))
return ''.join(result)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
In the code above, we define 3 functions:
read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.
PS. This line of code you had is very interesting:
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
I edited my code to include it.
Space complexity optimization
The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:
def scan_file(fname, database):
with open(fname, 'r') as f:
next_file = None
for x in f:
for y in x.split():
if next_file is None:
next_file = y
else:
for i,c in enumerate(y):
while len(database) <= i:
database.append({})
if c in database[i]:
database[i][c] += 1
else:
database[i][c] = 1
return next_file
def most_frequent_chars (fname):
database = []
seen = set()
new = fname
while not new in seen:
seen.add(new)
new = scan_file(new, database)
return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.
Ok, here's my solution:
def parsi_file(filename):
visited_files = set()
words_list = []
# Getting words from all files
while filename not in visited_files:
visited_files.add(filename)
with open(filename) as f:
filename = f.readline().strip()
words_list += [line.strip() for line in f.readlines()]
# Creating dictionaries of letters:count for each index
letters_dicts = []
for word in words_list:
for i in range(len(word)):
if i > len(letters_dicts)-1:
letters_dicts.append({})
letter = word[i]
if letters_dicts[i].get(letter):
letters_dicts[i][letter] += 1
else:
letters_dicts[i][letter] = 1
# Sorting dicts and getting the "best" letter
code = ""
for dic in letters_dicts:
sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
code += sorted_letters[0]
return code
We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.
Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

How can I merge two snippets of text that both contain a desired keyword?

I have a program that pulls out text around a specific keyword. I'm trying to modify it so that if two keywords are close enough together, it just shows one longer snippet of text instead of two individual snippets.
My current code, below, adds words after the keyword to a list and resets the counter if another keyword is found. However, I've found two problems with this. The first is that the data rate limit in my spyder notebook is exceeded, and I haven't been able to deal with that. The second is that though this would make a longer snippet, it wouldn't get rid of the duplicate.
Does anyone know a way to get rid of the duplicate snippet, or know how to merge the snippets in a way that doesn't exceed the data rate limit (or know how to change the spyder rate limit)? Thank you!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close

python - interpreting two spaces as one from text file

I'm trying to make a program that translates morse code from a text file. In theory it should be pretty easy but the problem is that I find the formatting of the text file a bit silly (its school work so can't change that). What I meant by that is that in the file one space separates two characters (like this -. ---) but two spaces equal end of a word (so space in the translated text). Like this: .--. .-.. . .- ... . .... . .-.. .--. .-.-.-
This is what I have, but it gives me translated text without the spaces.
translator = {} #alphabet and the equivalent code, which I got from another file
message = []
translated = ("")
msg_file = open(msg.txt,"r")
for line in msg_file:
line = line.rstrip()
part = line.rsplit(" ")
message.extend(part)
for i in message:
if i in translator.keys():
translated += (translator[i])
print(translated)
I also dont know how to intercept the line change (\n).
Why don't you split on two spaces to get the words, then on space to get the characters? Something like:
translated = "" # store for the translated text
with open("msg.txt", "r") as f: # open your file for reading
for line in f: # read the file line by line
words = line.split(" ") # split by two spaces to get our words
parsed = [] # storage for our parsed words
for word in words: # iterate over the words
word = [] # we'll re-use this to store our translated characters
for char in word.split(" "): # get characters by splitting and iterate over them
word.append(translator.get(char, " ")) # translate the character
parsed.append("".join(word)) # join the characters and add the word to `parsed`
translated += " ".join(parsed) # join the parsed words and add them to `translated`
# uncomment if you want to add new line after each line of the file:
# translated += "\n"
print(translated) # print the translated string
# PLEASE HELP!
Of course, all this assuming your translator dict has proper mapping.
Split on double-space first to get a list of words in each line then you can split the words on a single space to get characters to feed your translator
translator = {} #alphabet and the equivalent code, which I got from another file
message = []
translated = ("")
with open('msg.txt',"r") as msg_file:
for line in msg_file:
line = line.strip()
words = line.split(' ')
line = []
for word in words:
characters = word.split()
word = []
for char in characters:
word.append(translator[char])
line.append(''.join(word))
message.append(' '.join(line))
print('\n'.join(message))

printing 5 words before and after a specific word in a file in python

I have a folder which contains some other folders and these folders contain some text files. (The language is Persian). I want to print 5 words before and after a keyword with the keyword in the middle of them. I wrote the code, but it gives the 5 words in the start and the end of the line and not the words around the keyword. How can I fix it?
Hint: I just write the end of the code which relates to the question above. The start of the code is about the opening and normalizing the files.
def c ():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
if y in line:
z = line.split()
print (z[-6], z[-5],
z[-4], z[-3],
z[-2], z[-1], y,
z[+1], z[+2],
z[+3], z[+4],
z[+5], z[+6])
what I expect is something like this:
word word word word word keyword word word word word word
Each sentence in a new line.
Try this. It splits the words. Then it calculates the amount to show before and after (with a minimum of however much is left, and a maximum of 5) and shows it.
words = line.split()
if y in words:
index = words.index(y)
before = index - min(index, 5)
after = index + min( len(words) - 1 - index, 5) + 1
print (words[before:after])
You need to get the words indices based on your keyword's index. You can use list.index() method in order to get the intended index, then use a simple indexing to get the expected words:
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
print words[max(0, ind-5):min(ind+6, len(words))]
Or as a more optimized approach you can use a generator function in order to produce the words as an iterator which is very much optimized in terms of memory usage.
def get_words(keyword):
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
yield words[max(0, ind-5):min(ind+6, len(words))]
Then you can simply loop over the result for print or etc.
y = "آرامش"
for words in get_words(y):
# do stuff
def c():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
split_line = line.split()
if y in split_line:
index = split_line.index(y)
print (' '.join(split_line[max(0,index-5):min(index+6,le
n(split_line))]))
Assuming the keyword must be an exact word.

Python: How to replace the last word of a line from a text file?

How would I go about replacing the last word of a specific line from all the lines of a text file that has been loaded into Python? I know that if I wanted to access it as a list I'd use [-1] for the specific line, but I don't know how to do it as a string. An example of the text file is:
A I'm at the shop with Bill.
B I'm at the shop with Sarah.
C I'm at the shop with nobody.
D I'm at the shop with Cameron.
If you want a more powerful editing option, Regex is your friend.
import re
pattern = re.compile(r'\w*(\W*)$') # Matches the last word, and captures any remaining non-word characters
# so we don't lose punctuation. This will includes newline chars.
line_num = 2 # The line number we want to operate on.
new_name = 'Steve' # Who are we at the shops with? Not Allan.
with open('shopping_friends.txt') as f:
lines = f.readlines()
lines[line_num] = re.sub(pattern, new_name + r'\1', lines[line_num])
# substitue the last word for your friend's name, keeping punctuation after it.
# Now do something with your modified data here
please check this , is not 100% pythonic, is more for an overview
file_text = '''I'm at the shop with Bill.
I'm at the shop with Sarah.
I'm at the shop with nobody.
I'm at the shop with Cameron.
I'm at the shop with nobody.'''
def rep_last_word_line(textin, search_for, replace_with, line_nr):
if (isinstance(textin,str)):
textin = textin.splitlines()
else:
# remove enline from all lines - just in case
textin = [ln.replace('\n', ' ').replace('\r', '') for ln in textin]
if (textin[line_nr] != None):
line = textin[line_nr].replace('\n', ' ').replace('\r', '')
splited_line = line.split()
last_word = splited_line[-1]
if (last_word[0:len(search_for)] == search_for):
splited_line[-1] = last_word.replace(search_for,replace_with)
textin[line_nr] = ' '.join(splited_line)
return '\r\n'.join(textin)
print rep_last_word_line(file_text,'nobody','Steve',2)
print '='*80
print rep_last_word_line(file_text,'nobody','Steve',4)
print '='*80
# read content from file
f = open('in.txt','r')
# file_text = f.read().splitlines() # read text and then use str.splitlines to split line withoud endline character
file_text = f.readlines() # this will keep the endline character
print rep_last_word_line(file_text,'nobody','Steve',2)
If you have blank lines, you might want to try splitlines()
lines = your_text.splitlines()
lastword = lines[line_to_replace].split()[-1]
lines[line_to_replace] = lines[line_to_replace].replace(lastword, word_to_replace_with)
keep in mind that your changes are in lines now and not your_text so if you want your changes to persist, you'll have to write out lines instead
with open('shops.txt', 'w') as s:
for l in lines:
s.write(l)
Assuming you have a file "example.txt":
with open("example.txt") as myfile:
mylines = list(myfile)
lastword = mylines[2].split()[-1]
mylines[2] = mylines[2].replace(lastword,"Steve.")
(Edit: fixed off by one error... assuming by 3rd line, he means human style counting rather than zeroth-indexed)
(Note that the with line returns a file object which will be automatically closed even if for example the length of myfile is less than 3; this file object also provides an iterator which then gets converted into a straight list so you can pick a specific line.)

Categories

Resources