I'm trying to write a program on py3. I have saved 2 raw texts in the same directory as "programm.py" but the program can't find the texts.
I'm using emacs, and I wrote:
from __future__ import division
import nltk, sys, matplotlib, numpy, re, pprint, codecs
from os import path
text1 = "/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt"
text2 = "/home/giovanni/Scrivania/Giovanni/programmi/Esame/Sksp.txt"
from nltk import ngrams
s_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
w_tokenizer = nltk.word_tokenize("text")
print(text1)
but when I run it in py3 it doesn't print text1 (I used it to see if it works)
>>> import programma1
>>>
Instead, if I ask to print in py3 it can't find the file
>>> import programma1
>>> text1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'text1' is not defined
What can I do?
There's a few independent issues going on here. As #Yash Kanojia correctly pointed out, to get the contents of the files you need to read them, rather than just have their address.
The reason you can't access text1 is that it isn't a global variable. To access it, you need to use programma1.text1.
I've also moved all the import statements to the top of programma1.py as it's seen as good practice :)
Full code:
programma1.py:
from __future__ import division
import nltk, sys, matplotlib, numpy, re, pprint, codecs
from nltk import ngrams
from os import path
with open("/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt") as file1:
text1 = file1.read()
with open("/home/giovanni/Scrivania/Giovanni/programmi/Esame/Sksp.txt") as file2:
text2 = file2.read()
s_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
w_tokenizer = nltk.word_tokenize("text")
#print(text1)
main.py:
import programma1
print(programma1.text1)
EDIT:
I presume you wanted to load the contents of the files into the tokenizer. If you do, replace this line:
w_tokenizer = nltk.word_tokenize("text")
with this line
w_tokenizer = nltk.word_tokenize(text1 + "\n" + text2)
Hope this helps.
with open('/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt') as f:
data = f.read()
print(data)
Related
I have a long file which follows some structure and I want to parse this file to extract an object called sample:
The file named paths_text.txt is like that:
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
My code runs fine like this:
import os
os.chdir('/groups/cgsd/alexandre/python_code')
import re
with open('./src/paths_text.txt') as f:
for line in f:
sample = re.search(r'pfg\d+',line)
print(sample)
But when I search for underscore I get None as a result of my match, why?
import os
os.chdir('/groups/cgsd/alexandre/python_code')
import re
with open('./src/paths_text.txt') as f:
for line in f:
sample = re.search(r'pfg\d+_',line)
print(sample)
Becuase there's G and T between pfg001 and _. \d+ only counts numbers.
Code:
import os
import re
import time
import csv
from TexSoup import TexSoup
path = os.getcwd()
texFile = path + '\\Paper16.tex'
print(texFile)
soup = TexSoup(open(texFile, 'r'))
This returns no output when I try to print(soup) and I believe it is because the first line starts with %.
I think this is some sort of bug of TexSoup.
Namely, if you remove the first line or comment out second line instead, the TexSoup is able to parse the file and print(soup) will give some output.
In addition, if you terminate the first line by adding braces:
%{\documentstyle[aps,epsf,rotate,epsfig,preprint]{revtex}}
the TexSoup is also able to parse the file.
I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st as a string initially, like st=""
You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
You can simply use this
import requests
requests.get(url)
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)
import re
import string
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
filehandle = urllib.urlopen('http://www.google.com/') #open webpage
s = filehandle.read() #read
print s #display
#what i plan to do with it once i get the first part working
#results = re.findall('[<td style="font-weight:bold;" nowrap>$][0-9][0-9][0-9][.][0-9][0-9][</td></tr></tfoot></table>]',s)
#earnings = '$ '
#for money in results:
#earnings = earnings + money[1]+money[2]+money[3]+'.'+money[5]+money[6]
#print earnings
#raw_input()
this is the code that i have so far. now i have looked at all the other forums that give solutions such as the name of the script, which is parse_Money.py, and i have tried doing it with urllib.request.urlopen AND i have tried running it on python 2.5, 2.6, and 2.7. If anybody has any suggestions it would be really welcome, thanks everyone!!
--Matt
---EDIT---
I also tried this code and it worked, so im thinking its some kind of syntax error, so if anybody with a sharp eye can point it out, i would be very appreciative.
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
b = 3
#find URL
URL = raw_input('Type the URL you would like to read from[Example: http://www.google.com/] :')
while b == 3:
#get file name
file1 = raw_input('Enter a file name for the downloaded code:')
filepath = file1 + '.txt'
if os.path.isfile(filepath):
print 'File already exists'
b = 3
else:
print 'Filename accepted'
b = 4
file_path = filepath
#open file
FileWrite = open(file_path, 'a')
#acces URL
filehandle = urllib.urlopen(URL)
#display souce code
for lines in filehandle.readlines():
FileWrite.write(lines)
print lines
print 'The above has been saved in both a text and html file'
#close files
filehandle.close()
FileWrite.close()
it appears that the urlopen method is available in the urllib.request module and not in the urllib module as you're expecting.
rule of thumb - if you're getting an AttributeError, that field/operation is not present in the particular module.
EDIT - Thanks to AndiDog for pointing out - this is a solution valid for Py 3.x, and not applicable to Py2.x!
The urlopen function is actually in the urllib2 module. Try import urllib2 and use urllib2.urlopen
I see that you are using Python2 or at least intend to use Python2.
urlopen helper function is available in both urllib and urllib2 in Python2.
What you need to do this, execute this script against the correct version of your python
C:\Python26\python.exe yourscript.py
I'am new to Python 3 and could really use a little help. I have a txt file containing:
InstallPrompt=
DisplayLicense=
FinishMessage=
TargetName=D:\somewhere
FriendlyName=something
I have a python script that in the end, should change just two lines to:
TargetName=D:\new
FriendlyName=Big
Could anyone help me, please? I have tried to search for it, but I didnt find something I could use. The text that should be replaced could have different length.
import fileinput
for line in fileinput.FileInput("file",inplace=1):
sline=line.strip().split("=")
if sline[0].startswith("TargetName"):
sline[1]="new.txt"
elif sline[0].startswith("FriendlyName"):
sline[1]="big"
line='='.join(sline)
print(line)
A very simple solution for what you're doing:
#!/usr/bin/python
import re
import sys
for line in open(sys.argv[1],'r').readlines():
line = re.sub(r'TargetName=.+',r'TargetName=D:\\new', line)
line = re.sub(r'FriendlyName=.+',r'FriendlyName=big', line)
print line,
You would invoke this from the command line as ./test.py myfile.txt > output.txt
Writing to a temporary file and the renaming is the best way to make sure you won't get a damaged file if something goes wrong
import os
from tempfile import NamedTemporaryFile
fname = "lines.txt"
with open(fname) as fin, NamedTemporaryFile(dir='.', delete=False) as fout:
for line in fin:
if line.startswith("TargetName="):
line = "TargetName=D:\\new\n"
elif line.startswith("FriendlyName"):
line = "FriendlyName=Big\n"
fout.write(line.encode('utf8'))
os.rename(fout.name, fname)
Is this a config (.ini) file you're trying to parse? The format looks suspiciously similar, except without a header section. You can use configparser, though it may add extra space around the "=" sign (i.e. "TargetName=D:\new" vs. "TargetName = D:\new"), but if those changes don't matter to you, using configparser is way easier and less error-prone than trying to parse it by hand every time.
txt (ini) file:
[section name]
FinishMessage=
TargetName=D:\something
FriendlyName=something
Code:
import sys
from configparser import SafeConfigParser
def main():
cp = SafeConfigParser()
cp.optionxform = str # Preserves case sensitivity
cp.readfp(open(sys.argv[1], 'r'))
section = 'section name'
options = {'TargetName': r'D:\new',
'FriendlyName': 'Big'}
for option, value in options.items():
cp.set(section, option, value)
cp.write(open(sys.argv[1], 'w'))
if __name__ == '__main__':
main()
txt (ini) file (after):
[section name]
FinishMessage =
TargetName = D:\new
FriendlyName = Big
subs_names.py script works both Python 2.6+ and Python 3.x:
#!/usr/bin/env python
from __future__ import print_function
import sys, fileinput
# here goes new values
substitions = dict(TargetName=r"D:\new", FriendlyName="Big")
inplace = '-i' in sys.argv # make substitions inplace
if inplace:
sys.argv.remove('-i')
for line in fileinput.input(inplace=inplace):
name, sep, value = line.partition("=")
if name in substitions:
print(name, sep, substitions[name], sep='')
else:
print(line, end='')
Example:
$ python3.1 subs_names.py input.txt
InstallPrompt=
DisplayLicense=
FinishMessage=
TargetName=D:\new
FriendlyName=Big
If you are satisfied with the output then add -i parameter to make changes inplace:
$ python3.1 subs_names.py -i input.txt