How to make translating function in python? - python

I want to ask something about translating somestring using python. I have a csv file contains list of abreviation dictionary like this.
before, after
ROFL, Rolling on floor laughing
STFU, Shut the freak up
LMK, Let me know
...
I want to translate string that contains word in column "before" to be word in column "after". I try to use this code, but it doesn't change anything.
def replace_abbreviation(tweet):
dictionary = pd.read_csv("dict.csv", encoding='latin1')
dictionary['before'] = dictionary['before'].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'ignore').decode())
tmp = dictionary.set_index('before').to_dict('split')
tweet = tweet.translate(tmp)
return tweet
For example :
Input = "lmk your test result please"
Output = "let me know your test
result please"

You can read the contents to a dict and then use the following code.
res = {}
with open('dict.csv') as file:
next(file) # skip the first line "before, after"
for line in file:
k, v = line.strip().split(', ')
res[k] = v
def replace(tweet):
return ' '.join(res.get(x.upper(), x) for x in tweet.split())
print(replace('stfu and lmk your test result please'))
Output
Shut the freak up and Let me know your test result please

Related

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

output differences between 2 texts when lines are dissimilar

I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps

Mincemeat map function returning dictionary

I am using a map reduce implementation called mincemeat.py. It contains a map function and reduce function. First off I will tell what I am trying to accomplish. I am doing a coursera course on bigdata where there is a programming assignment. The question is that there are hundreds of files containing data of the form paperid:::author1::author2::author3:::papertitle
We have to go through all the files and give for a particular author, the word he has used to the maximum. So I wrote the following code for it.
import re
import glob
import mincemeat
from collections import Counter
text_files = glob.glob('test/*')
def file_contents(file_name):
f = open(file_name)
try:
return f.read()
finally:
f.close()
datasource = dict((file_name, file_contents(file_name)) for file_name in text_files)
def mapfn(key, value):
for line in value.splitlines():
wordsinsentence = line.split(":::")
authors = wordsinsentence[1].split("::")
# print authors
words = str(wordsinsentence[2])
words = re.sub(r'([^\s\w-])+', '', words)
# re.sub(r'[^a-zA-Z0-9: ]', '', words)
words = words.split(" ")
for author in authors:
for word in words:
word = word.replace("-"," ")
word = word.lower()
yield author, word
def reducefn(key, value):
return Counter(value)
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
# print results
i = open('outfile','w')
i.write(str(results))
i.close()
My problem now is that, reduce function has to receive authorname and all the words he has used in his titles, for all authors. So I expected an output like
{authorname: Counter({'word1':countofword1,'word2':countofword2,'word3':countofword3,..}).
But what I get is
authorname: (authorname, Counter({'word1': countofword1,'word2':countofword2}))
Can someone tell why it is happening like that? I don't need help to solve the question, I need help to know why it is happening like that!
I ran your code and I see it is working as expected. The output looks like {authorname : Counter({'word1':countofword1,'word2':countofword2,'word3':countofword3,..}).
That said. Remove the code from here as it violates Coursera Code of Honor.
Check your value data structure in reducefn before Counter.
def reducefn(key, value):
print(value)
return Counter(value)

Python Replacing Words from Definitions in Text File

I've got an old informix database that was written for cobol. All the fields are in code so my SQL queries look like.
SELECT uu00012 FROM uu0001;
This is pretty hard to read.
I have a text file with the field definitions like
uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name
I would like to swap out the code for the more english name. Run a python script on it maybe and have a file with the english names saved.
What's the best way to do this?
If each piece is definitely a separate word, re.sub is definitely the way to go here:
#create a mapping of old vars to new vars.
with open('definitions') as f:
d = dict( [x.split() for x in f] )
def my_replace(match):
#if the match is in the dictionary, replace it, otherwise, return the match unchanged.
return d.get( match.group(), match.group() )
with open('inquiry') as f:
for line in f:
print re.sub( r'\w+', my_replace, line )
Conceptually,
I would probably first build a mapping of codings -> english (in memory or o.
Then, for each coding in your map, scan your file and replace with the codes mapped english equivalent.
infile = open('filename.txt','r')
namelist = []
for each in infile.readlines():
namelist.append((each.split(' ')[0],each.split(' ')[1]))
this will give you a list of key,value pairs
i dont know what you want to do with the results from there though, you need to be more explicit
dictionary = '''uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name'''
dictionary = dict(map(lambda x: (x[1], x[0]), [x.split() for x in dictionary.split('\n')]))
def process_sql(sql, d):
for k, v in d.items():
sql = sql.replace(k, v)
return sql
sql = process_sql('SELECT f_name FROM client;', dictionary)
build dictionary:
{'date': 'uu00013', 'l_name': 'uu00015', 'f_name': 'uu00014', 'client': 'uu00012'}
then run thru your SQL and replace human readable values with coded stuff. The result is:
SELECT uu00014 FROM uu00012;
import re
f = open("dictfile.txt")
d = {}
for mapping in f.readlines():
l, r = mapping.split(" ")
d[re.compile(l)] = r.strip("\n")
sql = open("orig.sql")
out = file("translated.sql", "w")
for line in sql.readlines():
for r in d.keys():
line = r.sub(d[r], line)
out.write(line)

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources