I am a beginner in Python, and I am trying to collect the names from a txt and put them inside another txt file using NLTK. The issue is that only the first names are returned, without the surnames. Anything I can do? Here's the code:
import nltk
# function start
def extract_entities(text):
ind = len(text)-7
sub = text[ind:]
print(sub)
output.write('\nPRODID=='+sub+'\n\n')
for sent in nltk.sent_tokenize(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
output.write(chunk.label()+':'+ ' '.join(c[0] for c in chunk.leaves())+'\n')
# function end
# main program
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
file = open('C:\Python34\Description.txt', 'r')
output = open('C:\Python34\out.txt', 'w')
for line in file:
if not line : continue
extract_entities(line)
file.close()
output.close()
Thanks in advance for your answers!
Related
I am fairly new to python and I trying to capture the last line on a syslog file using python but unable to do so. This is a huge log file so I want to avoid loading the complete file in memory. I just want to read the last line of the file and capture the timestamp for further analysis.
I have the below code which captures all the timestamps into a python dict which take a really long time to run for it to get to the last timestamp once it completed my plan was to reverse the list and capture the first object in the index[0]:
The lastFile function uses glob module and gives me the most latest log file name which is being fed into recentEdit of the main function.
Is there a better way of doing this
Script1:
#!/usr/bin/python
import glob
import os
import re
def main():
syslogDir = (r'Location/*')
listOfFiles = glob.glob(syslogDir)
recentEdit = lastFile(syslogDir)
print(recentEdit)
astack=[]
with open(recentEdit, "r") as f:
for line in f:
result = [re.findall(r'\d{4}.\d{2}.\d{2}T\d{2}.\d{2}.\d{2}.\d+.\d{2}.\d{2}',line)]
print(result)
def lastFile(i):
listOfFiles = glob.glob(i)
latestFile = max(listOfFiles, key=os.path.getctime)
return(latestFile)
if __name__ == '__main__': main()
Script2:
###############################################################################
###############################################################################
#The readline() gives me the first line of the log file which is also not what I am looking for:
#!/usr/bin/python
import glob
import os
import re
def main():
syslogDir = (r'Location/*')
listOfFiles = glob.glob(syslogDir)
recentEdit = lastFile(syslogDir)
print(recentEdit)
with open(recentEdit, "r") as f:
fLastLine = f.readline()
print(fLastLine)
# astack=[]
# with open(recentEdit, "r") as f:
# for line in f:
# result = [re.findall(r'\d{4}.\d{2}.\d{2}T\d{2}.\d{2}.\d{2}.\d+.\d{2}.\d{2}',line)]
# print(result)
def lastFile(i):
listOfFiles = glob.glob(i)
latestFile = max(listOfFiles, key=os.path.getctime)
return(latestFile)
if __name__ == '__main__': main()
I really appreciate your help!!
Sincerely.
If you want to directly go,to the end of the file. Follow these steps:
1.Every time your program runs persist or store the last '\n' index.
2.If you have persisted index of last '\n' then you can directly seek to that index using
file.seek(yourpersistedindex)
3.after this when you call file.readline() you will get the lines starting from yourpersistedindex.
4.Store this index everytime your are running your script.
For Example:
you file log.txt has content like:
timestamp1 \n
timestamp2 \n
timestamp3 \n
import pickle
lastNewLineIndex = None
#here trying to read the lastNewLineIndex
try:
rfile = open('pickledfile', 'rb')
lastNewLineIndex = pickle.load(rfile)
rfile.close()
except:
pass
logfile = open('log.txt','r')
newLastNewLineIndex = None
if lastNewLineIndex:
#seek(index) will take filepointer to the index
logfile.seek(lastNewLineIndex)
#will read the line starting from the index we provided in seek function
lastLine = logfile.readline()
print(lastLine)
#tell() gives you the current index
newLastNewLineIndex = logfile.tell()
logfile.close()
else:
counter = 0
text = logfile.read()
for c in text:
if c == '\n':
newLastNewLineIndex = counter
counter+=1
#here saving the new LastNewLineIndex
wfile = open('pickledfile', 'wb')
pickle.dump(newLastNewLineIndex,wfile)
wfile.close()
I am using PEP8 module of python inside my code.
import pep8
pep8_checker = pep8.StyleGuide(format='pylint')
pep8_checker.check_files(paths=['./test.py'])
r = pep8_checker.check_files(paths=['./test.py'])
This is the output:
./test.py:6: [E265] block comment should start with '# '
./test.py:23: [E265] block comment should start with '# '
./test.py:24: [E302] expected 2 blank lines, found 1
./test.py:30: [W293] blank line contains whitespace
./test.py:35: [E501] line too long (116 > 79 characters)
./test.py:41: [E302] expected 2 blank lines, found 1
./test.py:53: [E501] line too long (111 > 79 characters)
./test.py:54: [E501] line too long (129 > 79 characters)
But this result is printed on terminal and the final value that is assigned to 'r' is 8 (i.e. total numbers of errors).
I want to store these errors in a variable. How can I do this?
EDIT:
here is the test.py file: http://paste.fedoraproject.org/347406/59337502/raw/
There are at least two ways to do this. The simplest is to redirect sys.stdout to a text file, then read the file at your leisure:
import pep8
import sys
saved_stdout = sys.stdout
sys.stdout = open('pep8.out', 'w')
pep8_checker = pep8.StyleGuide(format='pylint')
pep8_checker.check_files(paths=['./test.py'])
r = pep8_checker.check_files(paths=['./test.py'])
sys.stdout.close()
sys.stdout = saved_stdout
# Now you can read "pep.out" into a variable
Alternatively you can write to a variable using StringIO:
import pep8
import sys
# The module name changed between python 2 and 3
if sys.version_info.major == 2:
from StringIO import StringIO
else:
from io import StringIO
saved_stdout = sys.stdout
sys.stdout = StringIO()
pep8_checker = pep8.StyleGuide(format='pylint')
pep8_checker.check_files(paths=['./test.py'])
r = pep8_checker.check_files(paths=['./test.py'])
testout = sys.stdout.getvalue()
sys.stdout.close()
sys.stdout = saved_stdout
# testout contains the output. You might wish to testout.spilt("\n")
I have some Python code that lists pull requests in Github. If I print the parsed json output to the console, I get the expected results, but when I output the parsed json to a csv file, I'm not getting the same results. They are cut off after the sixth result (and that varies).
What I'm trying to do is overwrite the csv each time with the latest output.
Also, I'm dealing with unicode output which I use unicodecsv for. I don't know if this is throwing the csv output off.
I will list both instances of the relevant piece of code with the print statement and with the csv code.
Thanks for any help.
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
import csv
import unicodecsv
for pr in result:
data = pr.as_dict()
changes = (gh.repository('my-repo', repo).pull_request(data['number'])).as_dict()
if changes['commits'] == 1 and changes['changed_files'] == 1:
#keep print to console for testing purposes
print "Login: " + changes['user']['login'] + '\n' + "Title: " + changes['title'] + '\n' + "Changed Files: " + str(changes['changed_files']) + '\n' + "Commits: " + str(changes['commits']) + '\n'
With csv:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
import csv
import unicodecsv
for pr in result:
data = pr.as_dict()
changes = (gh.repository('my-repo', repo).pull_request(data['number'])).as_dict()
if changes['commits'] == 1 and changes['changed_files'] == 1:
with open('c:\pull.csv', 'r+') as f:
csv_writer = unicodecsv.writer(f, encoding='utf-8')
csv_writer.writerow(['Login', 'Title', 'Changed files', 'Commits'])
for i in changes['user']['login'], changes['title'], str(changes['changed_files']), str(changes['commits']) :
csv_writer.writerow([changes['user']['login'], changes['title'],changes['changed_files'], changes['commits']])
The problem is with the way you write data to file.
Every time you open file in r+ mode you will overwrite the last written rows.
And for dealing with JSON
I've been programming for a couple of months, so I'm not an expert. I have two huge text files (omni, ~20 GB, ~2.5M lines; dbSNP, ~10 GB, ~60M lines). They have the first few lines, not necessarily tab-delimited, starting with "#" (the header) and the rest of the lines are organized in tab-delimited columns (the actual data).
The first two columns of each line contain the chromosome number and the position on the chromosome, while the third column contains an identification code. In the "omni" file I don't have the ID, so I need to find the position in the dbSNP file (a database) and create a copy of the first file completed with the IDs.
Because of memory limits I decided to read the two files line by line and restart from the last line read. I am not satisfied of the efficiency of my code, because I feel it is slower than it could be. I'm pretty sure that it is my fault, because of lack of experience. Is there a way to make it faster using Python? May the problem be the opening and closing of the files?
I usually launch the script in GNOME Terminal (Python 2.7.6, Ubuntu 14.04) like this:
python -u Replace_ID.py > Replace.log 2> Replace.err
Thank you very much in advance.
omni (Omni example):
...
#CHROM POS ID REF ALT ...
1 534247 . C T ...
...
dbSNP (dbSNP example):
...
#CHROM POS ID REF ALT ...
1 10019 rs376643643 TA T ...
...
The output should be exactly the same as the Omni file, but with the rs ID after the position.
Code:
SNPline = 0 #line in dbSNP file
SNPline2 = 0 #temporary copy
omniline = 0 #line in omni file
line_offset = [] #beginnings of every line in dbSNP file (stackoverflow.com/a/620492)
offset = 0
with open("dbSNP_build_141.vcf") as dbSNP: #database
for line in dbSNP:
line_offset.append(offset)
offset += len(line)
dbSNP.seek(0)
with open("Omni_replaced.vcf", "w") as outfile:
outfile.write("")
with open("Omni25_genotypes_2141_samples.b37.v2.vcf") as omni:
for line in omni:
omniline += 1
print str(omniline) #log
if line[0] == "#": #if line is header
with open("Omni_replaced.vcf", "a") as outfile:
outfile.write(line) #write as it is
else:
split_omni = line.split('\t') #tab-delimited columns
with open("dbSNP_build_141.vcf") as dbSNP:
SNPline2 = SNPline #restart from last line found
dbSNP.seek(line_offset[SNPline])
for line in dbSNP:
SNPline2 = SNPline2 + 1
split_dbSNP = line.split('\t')
if line[0] == "#":
print str(omniline) + "#" + str(SNPline2) #keep track of what's happening.
rs_found = 0 #it does not contain the rs ID
else:
if split_omni[0] + split_omni[1] == split_dbSNP[0] + split_dbSNP[1]: #if chromosome and position match
print str(omniline) + "." + str(SNPline2) #log
SNPline = SNPline2 - 1
with open("Omni_replaced.vcf", "a") as outfile:
split_omni[2] = split_dbSNP[2] #replace the ID
outfile.write("\t".join(split_omni))
rs_found = 1 #ID found
break
else:
rs_found = 0 #ID not found
if rs_found == 0: #if ID was not found in dbSNP, then:
with open("Omni_replaced.vcf", "a") as outfile:
outfile.write("\t".join(split_omni)) #keep the line unedited
else: #if ID was found:
pass #no need to do anything, line already written
print "End."
here is my contribution to your problem.
First of all, here is what I understand about your problem, just to check I'm correct :
You have two files, each are tabulation separated values file. The first, dbSNP, contains data, whose the third columns are identifiers corresponding to the gene's chromosome number (column 1) and the gene's position on the chromosome (column 2).
The task so consist of taking the omni file and filling in the ID column with all the values coming from the dbNSP file (based on the chromosome number and the gene position).
Problem come from the size of the files.
You tried to retain in the file position of each line to make seek and go directly to the good line in order to avoid tu put all the dbnsp file in memory. As this approach is not enough quick for you due to multiple file opening, here is what I propose.
Parse the dbNSP file one time to retain only the essential information, i.e. the couples (number,position):ID.
From your example that correspond to :
1 534247 rs201475892
1 569624 rs6594035
1 689186 rs374789455
This correspond to less than 10% of the file size in term of memory, so starting from a 20GB file you will load in memory less than 2GB, it is probably affordable(don't know what sort of loading you tried before).
So here is my code to do this. Do not hesitate to ask for explanation as unlike you, I use Object Programming.
import argparse
#description of this script
__description__ = "This script parse a Database file in order to find the genes identifiers and provide them to a other file.[description to correct]\nTake the IDs from databaseFile and output the targetFile content enriched with IDs"
# -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
#classes used to handle and operate on data
class ChromosomeIdentifierIndex():
def __init__(self):
self.chromosomes = {}
def register(self, chromosomeNumber, positionOnChromosome, identifier):
if not chromosomeNumber in self.chromosomes:
self.chromosomes[chromosomeNumber] = {}
self.chromosomes[chromosomeNumber][positionOnChromosome] = identifier
def __setitem__(self, ref, ID):
""" Allows to use alternative syntax to chrsIndex.register(number, position, id) : chrsIndex[number, position] = id """
chromosomeNumber, positionOnChromosome = ref[0],ref[1]
self.register(chromosomeNumber, positionOnChromosome, ID)
def __getitem__(self, ref):
""" Allows to get IDs using the syntax: chromosomeIDindex[chromosomenumber,positionOnChromosome] """
chromosomeNumber, positionOnChromosome = ref[0],ref[1]
try:
return self.chromosomes[chromosomeNumber][positionOnChromosome]
except:
return "."
def __repr__(self):
for chrs in self.chromosomes.keys():
print "Chromosome : ", chrs
for position in self.chromosomes[chrs].keys():
print "\t", position, "\t", self.chromosomes[chrs][position]
class Chromosome():
def __init__(self, string):
self.values = string.split("\t")
self.chrs = self.values[0]
self.position = self.values[1]
self.ID = self.values[2]
def __str__(self):
return "\t".join(self.values)
def setID(self, ID):
self.ID = ID
self.values[2] = ID
class DefaultWritter():
""" Use to print if no output is specified """
def __init__(self):
pass
def write(self, string):
print string
def close(self):
pass
# -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
#The code executed when the scrip is called
if __name__ == "__main__":
#initialisation
parser = argparse.ArgumentParser(description = __description__)
parser.add_argument("databaseFile" , help="A batch file that contains many informations, including the IDs.")
parser.add_argument("targetFile" , help="A file that contains informations, but miss the IDs.")
parser.add_argument("-o", "--output", help="The output file of the script. If no output is specified, the output will be printed on the screen.")
parser.add_argument("-l", "--logs" , help="The log file of the script. If no log file is specified, the logs will be printed on the screen.")
args = parser.parse_args()
output = None
if args.output == None:
output = DefaultWritter()
else:
output = open(args.output, 'w')
logger = None
if args.logs == None:
logger = DefaultWritter()
else:
logger = open(args.logs, 'w')
#start of the process
idIndex = ChromosomeIdentifierIndex()
#build index by reading the database file.
with open(args.databaseFile, 'r') as database:
for line in database:
if not line.startswith("#"):
chromosome = Chromosome(line)
idIndex[chromosome.chrs, chromosome.position] = chromosome.ID
#read the target, replace the ID and output the result
with open(args.targetFile, 'r') as target:
for line in target:
if not line.startswith("#"):
chromosome = Chromosome(line)
chromosome.setID(idIndex[chromosome.chrs, chromosome.position])
output.write(str(chromosome))
else:
output.write(line)
output.close()
logger.close()
The main idea is parsing the dbNSP file one time and collecting all the IDs in a dictionary. Then reading line by line the omnifile and outputting the result.
You can call the script like this :
python replace.py ./dbSNP_example.vcf ./Omni_example.vcf -o output.vcf
The argparse module I used and import to handle parameter also furnish auto-help, so description of parameter is available with
python replace.py -h
or
python replace.py --help
I believe that his approach will be faster than yours as I read file once and just work on RAM later, and I invite you to test it.
NB : I don't know if you are familiar to Object programing so I have to mention that here all the classes are in a same file for the sake of posting on stack overflow. In real life use case, good practice would be to put all the class in separate file like "Chromosome.py", "ChromosomeIdentifierIndex.py" and "DefaultWritter.py" for example, and then, import them into the "replace.py" file.
so I'm having this trouble with the decode. I found it in other threads how to do it for simple strings, with the u'string'.encode. But I can't find a way to make it work with files.
Any help would be appreciated!
Here's the code.
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
and here's the whole code, should it help.
#!/usr/bin/env python
# coding: utf-8
"""
Script to helps on translate some code's methods from
portuguese to english.
"""
from multiprocessing import Pool
from mock import MagicMock
from goslate import Goslate
import fnmatch
import logging
import os
import re
import urllib2
_MAX_PEERS = 1
try:
os.remove('traducoes.log')
except OSError:
pass
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.FileHandler('traducoes.log')
logger.addHandler(handler)
def fileWalker(ext, dirname, names):
"""
Find the files with the correct extension
"""
pat = "*" + ext[0]
for f in names:
if fnmatch.fnmatch(f, pat):
ext[1].append(os.path.join(dirname, f))
def encontre_text(file):
"""
find on the string the works wich have '_' on it
"""
text = file.read().decode('utf-8')
return re.findall(r"\w+(?<=_)\w+", text)
#return re.findall(r"\"\w+\"", text)
def traduza_palavra(txt):
"""
Translate the word/phrase to english
"""
try:
# try connect with google
response = urllib2.urlopen('http://google.com', timeout=2)
pass
except urllib2.URLError as err:
print "No network connection "
exit(-1)
if txt[0] != '_':
txt = txt.replace('_', ' ')
txt = txt.replace('media'.decode('utf-8'), 'média'.decode('utf-8'))
gs = Goslate()
#txt = gs.translate(txt, 'en', gs.detect(txt))
txt = gs.translate(txt, 'en', 'pt-br') # garantindo idioma tupiniquim
txt = txt.replace(' en ', ' br ')
return txt.replace(' ', '_') # .lower()
def subistitua(file, txt, novo_txt):
"""
should rewrite the file with the new text in the future
"""
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
def magica(File):
"""
Thread Pool. Every single thread should play around here with
one element from list os files
"""
global _DONE
if _MAX_PEERS == 1: # inviavel em multithread
logger.info('\n---- File %s' % File)
with open(File, "r+") as file:
list_txt = encontre_text(file)
for txt in list_txt:
novo_txt = traduza_palavra(txt)
if txt != novo_txt:
logger.info('%s -> %s [%s]' % (txt, novo_txt, File))
subistitua(file, txt, novo_txt)
file.close()
print File.ljust(70) + '[OK]'.rjust(5)
if __name__ == '__main__':
try:
response = urllib2.urlopen('http://www.google.com.br', timeout=1)
except urllib2.URLError as err:
print "No network connection "
exit(-1)
root = './app'
ex = ".py"
files = []
os.path.walk(root, fileWalker, [ex, files])
print '%d files found to be translated' % len(files)
try:
if _MAX_PEERS > 1:
_pool = Pool(processes=_MAX_PEERS)
result = _pool.map_async(magica, files)
result.wait()
else:
result = MagicMock()
result.successful.return_value = False
for f in files:
pass
magica(f)
result.successful.return_value = True
except AssertionError, e:
print e
else:
pass
finally:
if result.successful():
print 'Translated all files'
else:
print 'Some files were not translated'
Thank you all for the help!
In Python 2, reading from files produces regular (byte) string objects, not unicode objects. There is no need to call .encode() on these; in fact, that'll only trigger an automatic decode to Unicode first, which can fail.
Rule of thumb: use a unicode sandwich. Whenever you read data, you decode to unicode at that stage. Use unicode values throughout your code. Whenever you write data, encode at that point. You can use io.open() to open file objects that encode and decode automatically for you.
That also means you can use unicode literals everywhere; for your regular expressions, for your string literals. So use:
def encontre_text(file):
text = file.read() # assume `io.open()` was used
return re.findall(ur"\w+(?<=_)\w+", text) # use a unicode pattern
and
def subistitua(file, txt, novo_txt):
text = file.read() # assume `io.open()` was used
text = text.replace(txt, novo_txt)
file.seek(0) # rewind
file.write(text)
as all string values in the program are already unicode, and
txt = txt.replace(u'media', u'média')
as u'..' unicode string literals don't need decoding anymore.