The requirment of the code is :
There is a txt file called "lts_en_us_12.9.0_phonRules_Default copy.txt" in which is a bunch of phonology rules such as follows. (I showed the first several rows of the file).
word {. .} {{;}} {rewrite}
word {{#} .} {{$0} {;}} {rewrite}
word {{;} .} {{$0} {;}} {rewrite}
word {.pau .} {{$0} {;}} {rewrite}
word {. #} {{;} {$1}} {rewrite}
word {. {;}} {{;} {$1}} {rewrite}
word {. .pau} {{;} {$1}} {rewrite}
word {{;} {;}} {{;}} {rewrite}
aux-verb-reduction {{;} k aI1 n d {;} {^1 &} v {;}} {{$0} k aI1 n . d & {$8}} {optional}
aux-verb-reduction {{;} w A1 n t {;} t {u1 &} {;}} {{$0} w A1 . n & {$8}} {optional}
aux-verb-reduction {{;} l E1 t {;} m i1 {;}} {{$0} l E1 . m i {$7}} {optional}
aux-verb-reduction {{;} g I1 v {;} m i1 {;}} {{$0} g I1 . m i {$7}} {optional}
Our goal is to build a txt file called symbols.txt file. This could be done by iterating over the set of phonology rules, and keeping a running list of every symbol that occurs in them. For instance if the rules are just
{aspiration} {; k} {; k_h} {rewrite}
{low_vowel} {A} {#} {optional}
{unreleased} {b} {b_c} {rewrite}
then the symbols file would be
ε 0
. 1
# 2
A 3
b 4
b_c 5
k 6
k_h 7
Note the top 2 lines here: the epsilon and the . are going to be included at the top, regardless. These are default symbols, whether or not they ever occur in the rules, and have to be included in your symbols.txt.
So to be specific of the true file, the expected output should be:
ε 0
. 1
. .2
; 3
# . 4
# ; 5 # (Note : the second pound key in this line is a comment.)where "$0" becomes the "#" because dollar sign means replace the i'th element in the left {} into the position where the dollar sign sits. In this case, "$0" means replace the 0th element in the left to the position of "$0".
; . 6
; .7
. pau . 8
. ; 9
. # 10
; # 11
. ; 12
; ; 13
. .pau 14
; . 15
I am new to cs and coding, so I asked Chatgpt3 to help and here is the code Chatgpt3 generated:
symbols = [("ε", 0), (".", 1)]
with open("lts_en_us_12.9.0_phonRules_Default copy.txt", "r") as f:
for line in f:
parts = line.strip().split(" ")
for part in parts:
if part not in [s[0] for s in symbols]:
symbols.append((part, len(symbols)))
with open("symbols.txt", "w") as f:
for symbol in symbols:
f.write(f"{symbol[0]} {symbol[1]}\n")
But the probem is that the output of ChatGpt3's code is:
ε 0
. 1
word 2
{. 3
.} 4
{{;}} 5
{rewrite} 6
{{#} 7
{{$0} 8
{;}} 9
{{;} 10
{.pau 11
#} 12
{$1}} 13
.pau} 14
aux-verb-reduction 15
k 16
aI1 17
n 18
d 19
Which is so different from the expected code based on the requirement. To be pecific, we don't want the "word" and the "rewrite". We just want the output which is from the middle two parts, not the first part and the last part. In addition, we want to list the output in an alphabetic indexing order as in the example output.
So, how could I do that? Or how can I edit this version of the code to fulfill that code requirement and the goal?
Related
I have a text file containing number integer and string value pairs I want to sort them in ascending order but I am not getting it correct
**TextFile.txt **
87,Toronto
45,USA
45,PAKISTAN
33,India
38,Jerry
30,Tom
23,Jim
7,Love
38,Hate
30,Stress
My code
def sort_outputFiles():
print('********* Sorting now **************')
my_file = open("TextFile.txt", "r",encoding='utf-8')
data = my_file.read()
data_into_list = data.split("\n")
my_file.close()
score=[]
links=[]
for d in data_into_list:
if d != '':
s=d.split(',')
score.append(s[0])
links.append(s[1])
n = len(score)
for i in range(n):
for j in range(0, n-i-1):
if score[j] < score[j+1]:
score[j], score[j+1] = score[j+1], score[j]
links[j], links[j+1] = links[j+1], links[j]
for l,s in zip(links,score):
print(l," ",s)
My output
********* Sorting now **************
Toronto 87
Love 7
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Expected output
********* Sorting now **************
Toronto 87
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Love 7
having error at line 2 in out put it should be in last
You are comparing strings, not numbers.
In an dictionary (like, the physical book), the words are sorted by who has the "lowest" first letter, and if it's a tie, we pick the lowest second letter, and so on. This is called lexicographical order.
So the string "aaaa" < "ab". Also, "1111" < "12"
To fix this, you have to convert the string to a number (using int(s[0]) instead of s[0] in the score.append function).
This will make 1111 > 12. Your code will give the correct result.
You can use Python's sorted() function to sort the list. If you use the key parameter, you can specify a custom sorting behaviour, and if you use the reverse parameter, you can sort in descending order.
Also, you can use the csv module to make reading your input file easier.
import csv
with open("TextFile.txt", "r", encoding="utf-8", newline="") as csvfile:
lines = list(csv.reader(csvfile))
for line in sorted(lines, key=lambda l: int(l[0]), reverse=True):
print(f"{line[1]} {line[0]}")
Output:
Toronto 87
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Love 7
Not sure about your intentions, but a compact implementation would be like:
with open('textfile.txt', 'r') as f:
d = [l.split(',') for l in f.readlines()]
d=[(dd[1][:-1], int(dd[0])) for dd in d]
d_sorted = sorted(d, key=lambda x:x[1], reverse=True)
print(d_sorted)
Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object
I am attempting to make a Caesar cipher that changes the key each letter, I currently have a working cipher that scrambles the entire string once, running 1-25 however I would like it to do it for each letter, as in the string "ABC" would shift A by 1, B by 2 and C by 3, resulting in BDF
I already have a working cipher, and am just not sure how to have it change each letter.
upper = collections.deque(string.ascii_uppercase)
lower = collections.deque(string.ascii_lowercase)
upper.rotate(number_to_rotate_by)
lower.rotate(number_to_rotate_by)
upper = ''.join(list(upper))
lower = ''.join(list(lower))
return rotate_string.translate(str.maketrans(string.ascii_uppercase, upper)).translate(str.maketrans(string.ascii_lowercase, lower))
#print (caesar("This is simple", 2))
our_string = "ABC"
for i in range(len(string.ascii_uppercase)):
print (i, "|", caesar(our_string, i))
Outcome is this:
0 | ABC
1 | ZAB
2 | YZA
3 | XYZ
4 | WXY
5 | VWX
6 | UVW
7 | TUV
8 | STU
9 | RST
10 | QRS
11 | PQR
12 | OPQ
13 | NOP
14 | MNO
15 | LMN
16 | KLM
17 | JKL
18 | IJK
19 | HIJ
20 | GHI
21 | FGH
22 | EFG
23 | DEF
24 | CDE
25 | BCD
What I would like is to have it a shift of 1 or 0 for the first letter, then 2 for the second, and so on.
Good effort! Note that the mapping doesn't only rearrange letters in the alphabet, so it's never achieved by rotating the alphabet. In your example, upper would become the following mapping:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
BDFHJLNPRTVXZBDFHJLNPRTVXZ
Also note this cipher is not easily reversible, i.e. it's not clear whether to reverse 'B'->'A' or 'B'->'N'.
(Side note: If we treat letters ZABCDEFGHIJKLMNOPQRSTUVWXY as numbers 0-25, this cipher multiplies by two (in modulo 26): (x*2)%26. If instead of 2, we multiply by any number not divisible by 2 and 13, the resulting cipher will always be reversible. Can you see why? Hints: [1], [2].)
When you feel confused about a piece of code, often it's a good sign it's time to refactor a part of it into a separate function, e.g. like this:
(Playground: https://ideone.com/wNSADR)
import string
def letter_index(letter):
"""Determines the position of the given letter in the English alphabet
'a' -> 0
'A' -> 0
'z' -> 25
"""
if letter not in string.ascii_letters:
raise ValueError("The argument must be an English letter")
if letter in string.ascii_lowercase:
return ord(letter) - ord('a')
return ord(letter) - ord('A')
def caesar(s):
"""Ciphers the string s by shifting 'A'->'B', 'B'->'D', 'C'->'E', etc
The shift is cyclic, i.e. 'A' comes after 'Z'.
"""
ret = ""
for letter in s:
index = letter_index(letter)
new_index = 2*index + 1
if new_index >= len(string.ascii_lowercase):
# The letter is shifted farther than 'Z'
new_index %= len(string.ascii_lowercase)
new_letter = chr(ord(letter) - index + new_index)
ret += new_letter
return ret
print('caesar("ABC"):', caesar("ABC"))
print('caesar("abc"):', caesar("abc"))
print('caesar("XYZ"):', caesar("XYZ"))
Output:
caesar("ABC"): BDF
caesar("abc"): bdf
caesar("XYZ"): VXZ
Resources:
chr
ord
hi i have a file which contains the data as shown below. I want to replace the integers which occurs after 'A' (fourth column) 2,3,15,25,115,1215 with other integers which i have them in dictionary (key,value). the number of white spaces after 'A' ranges from 0-3. I tried str.replace(old,new) method in python but it replaces all instance of the integers in the file.
This is the replacement i want to do inside the file.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56
Suggest me some ways to do it.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.split()
if len(spl) == 8:
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+[str(replacements.get(k, str(k))) for k in ints]))
else:
spl[-3] = spl[-3].replace("A","")
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+["A"]+[str(replacements.get(k, str(k))) for k in ints]))
print(res)
['Name 1 N ASHA A 0 35 23', 'Name 2 R MONA A 5 30 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A 1220 45 56']
Not sure if you want to use the data or write it to a file but if your file is like your example this will replace the digits from the dict, if the len of split is different we know we have a number and an A without a space so we replace .
There will also always be a space so if you write to file and have to work on the file again it will be a lot easier.
I would just remove the map and use strings as keys and values unless you actually want ints.
If you want to keep the exact same format and only want to change the first number:
replacements = {"2":"0","3":"5","15":"7","25":"30","115":"120","1215":"1220"}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.rsplit(None, 3)
end = spl[-3:]
if "A" == end[0][0]:
k = end[0][1:]
res.append(line.replace(k,replacements.get(k,k)))
else:
k = end[0]
res.append(line.replace(k,replacements.get(k,k)))
print(res)
['Name 1 N ASHA A 0 35 03', 'Name 2 R MONA A 5 25 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A1220 45 56']
Editted based on additional info regarding all other numbers.
This is entirely dependent on the specific characteristics of your file that you mention in your comments.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
with open('input.txt', 'r') as fin, open('output.txt', 'w') as fout:
pos_a = 22 # 0-indexed position of 'A' in every line
for line in fin:
left_side = line[:pos_a + 1]
num_to_convert = line[pos_a + 1: pos_a + 5]
right_side = line[pos_a + 5:]
# String formatting to preserve padding as per original file
newline = '{}{:>4}{}'.format(left_side,
replacements[int(num_to_convert)],
right_side)
fout.write(newline)
If there's a possibility that one of the values in the column will not be in your replacements dict, and you want to keep that value unchanged, then instead of replacements[int(num1)], do replacements.get(int(num1), num1)
Regex101
^[\w\d\s]{23}([\d\s]{1,4}).*$
Debuggex Demo
Note: This is more of a fixed length parsing
Python
import re
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
searchString = "Name 1 N ASHA A 2 35 23 "
replace_search = re.search('^[\w\d\s]{23}([\d\s]{1,4}).*$', searchString, re.IGNORECASE)
if replace_search:
result = replace_search.group(1)
convert_result = int(result)
dictionary_lookup = int(replacements[convert_result])
replace_result = '% 4d' % dictionary_lookup
regex_replace = r"\g<1>" + replace_result + r"\g<3>"
line = re.sub(r"^([\w\d\s]{23})([\d\s]{1,4})(.*)$", regex_replace, searchString)
print(line)
I have two files. One (e-number.txt) that contains a long species list and some info on every species and one (artsliste.txt) that contains species from a certain location.
I want to extract the info in e-number.txt for all the species listed in artsliste.txt.
Print the corresponding line to be short.
I feel like I am close, and feel like it can't be too hard, but I might have started out all wrong.
The latest code I have:
ellenberg=open('e-number.txt').read()
arter=open('artsliste.txt','r')
for line in arter:
art = arter.readline()
if art in ellenberg:
print(ellenberg)
artsliste.txt contains stuff like this:
Acer pseudoplatanus
Acer platanoides
Aeculus hippocastaneum
Adoxa moschatellina
Sambucus nigra
Aegopodium podagraria
Anthriscus sylvestris
e-number.txt contains stuff like this:
Acaena novae-zelandiae 2527 . 8 . 3 . 6 . 3 . 0 Acae nova Acaena novae-zelandiae
Acer campestre 3 5 5 5 5 7 7 6 6 0 0 Acer camp Acer campestre
Acer platanoides 4 4 4 . 5 . 7 . 7 0 0 Acer plat Acer platanoides
Acer pseudoplatanus 5 4 4 6 5 . 6 7 6 0 0 Acer pseu Acer pseudoplatanus
I would like my output to look like:
Acer pseudoplatanus 5 4 4 6 5 . 6 7 6 0 0 Acer pseu Acer pseudoplatanus
Oenanthe crocata 1363 . 7 . 8 . 6 . 7 . 1 Oenanthe crocata
Trifolium medium 2087 7 7 4 4 6 6 3 4 0 0 Trifolium medium
I feel like there most be a function that can print the line that was found, either way i guess i have to setup another search function inside the one i have already and that doesnt make any sense.
Hope someone can get me in the right direction.
Best regards.
From the readline documentation:
f.readline() reads a single line from the file; a newline character
(\n) is left at the end of the string
art = arter.readline().strip()
should help.
UPDATED according to your comment
Try this:
for line in arter:
art = arter.readline().strip()
index = ellenberg.find(art)
if index > -1:
line_end_index = ellenberg.find('\n', index)
print(ellenberg[index:line_end_index])
ONE MORE UPDATE:
This code will print full relevant line only if line starts with art else it will print chunk from entry point of art to the end of line.
To print full line you could use following code:
ellenberg=open('e-number.txt').readlines()
arter=open('artsliste.txt','r')
for line in arter:
for ellenberg_line in ellenberg:
line = line.strip()
if len(line) > 0 and line.strip() in ellenberg_line:
print ellenberg_line