Nltk in xls files - python

There is no problem accessing the file but while reading I get the following error
from nltk.corpus.reader import WordListCorpusReader
reader= WordListCorpusReader("C:\\Users\samet\\nltk_data\\corpora\\bilgi\samet",
["politika.xls"])
a = reader.words()
print (a)
enter image description here

You'll want to make sure the file you're trying to load (politika.xls) is saved with utf-8 encoding. First I'll detail how I replicated your error, then I'll show an approach to solve it.
I was able to replicate your error as follows:
Create a new text document. "temp.txt"
Open it, add a few lines random text, save and close it.
Rename "temp.txt" to "temp.xls"
Open "temp.xls"
Save as.... "temp.xlsx"
Close file.
Rename "temp.xlsm" to "politika.xls"
Try running your code (with correction to path).
Receive your error: "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte"
There may be a more straightforward approach, but from the above error condition, this worked to fix it:
Create a backup copy of "politika.xls"
Rename "politika.xls" to "old_politika.xls"
Create a new text file "politika.txt".
#Steps 3.1 - 3.4 may or may not be needed.
3.1. Open "politika.txt"
3.2. Save as...
3.3. Select Encoding >> (either ANSI or UTF-8 should work)
3.4. Save and close file.
Rename "politika.txt" to "politika.csv"
Open up "old_politika.xls"
Select and copy the data.
Open up "politika.csv"
Paste the data. Save and exit.
Rename "politika.csv" to "politika.xls"
Run your program. (See below for code / potential correction)
Also, you'll want to fix your dirrectory path. Make sure you use the excape character "\" for each "\" in the path. You were missing a "\" in front of " \samet" in 2 places. Corrected code below:
from nltk.corpus.reader import WordListCorpusReader
reader= WordListCorpusReader("C:\\Users\\samet\\nltk_data\\corpora\\bilgi\\samet",
["politika.xls"])
a = reader.words()
print (a)
I hope this helps.

Related

What is the best way to read a JSON file and obtain the values without the invisible characters in Python?

I have a simple JSON file that I was supposed to use as a configuration file, it contains the default directories for whoever is running the script using their MacBooks:
{
"main_sheet_path": "/Users/jammer/Documents/Studios⁩/⁨CAT/⁨000-WeeklyReports⁩/2020/",
"reference_sheet_path": "/Users/jammer/Documents/DownloadedFiles/"
}
I read the JSON file and obtain the values using this code:
with open('reportconfig.json','r') as j:
config_data = json.load(j)
main_sheet_path = str(config_data.get('main_sheet_path'))
reference_sheet_path = str(config_data.get('reference_sheet_path'))
I use the path to check for a source file's existence before doing anything with it:
source_file = 'source.xlsx'
source_file = main_sheet_path + filename
if not os.path.isfile(source_file) :
print ('ERROR: Source file \'' + source_file + '\' NOT FOUND!')
return
Note that the filename is inputted as a parameter when the script is run (there are multiple files, the script has to know which one to target).
The file is there for sure but the script never seems to "see" it so I get that "ERROR" that I printed in the above code. Why do I think there are invisible characters? Because when I copy and paste from what was printed in the "error" notice above into the terminal, the last few characters of the file name always gets substituted by some invisible characters and hitting backspace erases characters where the cursor isn't supposed to be.
How do I know for sure that the file is there and that my problem is with reading the JSON file and not in the Directory names or anywhere else in the code? Because I finally gave up on using a JSON config file and went with a configuration file like this instead:
#!/usr/local/bin/python3.7
# -*- coding: utf-8 -*-
file_paths = { "main_sheet_path": "/Users/jammer/Documents/Studios⁩/⁨CAT/⁨000-WeeklyReports⁩/2020/",
"reference_sheet_path": "/Users/jammer/Documents/DownloadedFiles/"
}
I then just import the file and obtain the values like this:
import reportconfig as cfg
main_sheet_path = cfg.file_paths['main_sheet_path']
reference_sheet_path = cfg.file_paths['reference_sheet_path']
...
This workaround works perfectly — I don't get the "error" that the file isn't there when it is and the rest of the script is executed as expected. When the file isn't there, I get the proper "error" I expect and copying-and-pasting the full path and filename from the "error message" gives me the complete file name and hitting the backspace erases the right characters (no funny behavior, no invisible characters).
But could anyone please tell me how read the JSON file properly without getting those pesky invisible characters? I've spent hours trying to figure it out including searching seemingly related questions in stackoverflow but couldn't find the answer. TIA!
I think there is just a typo error in this code:
source_file = 'source.xlsx'
source_file = main_sheet_path + filename
Maybe filename is set to some other file which is not present hence it is giving you error.
Try to set filename='source.xlsx'
Maybe it will help

unable to read file from external location in python

I am trying to read a txt file(kept in another location) in python, but getting error.
FileNotFoundError
in ()
----> 1 employeeFile=open("C:‪/Users/xxxxxxxx/Desktop/python/files/employee.txt","r")
2 print(employeeFile.read())
3 employeeFile.close()
FileNotFoundError: [Errno 2] No such file or
directory:'C:\u202a/Users/xxxxxxxx/Desktop/python/files/employee.txt'
Code used:
employeeFile=open("C:‪/Users/xxxxxxxx/Desktop/python/files/employee.txt","r")
print(employeeFile.read())
employeeFile.close()
I tried using frontslash(/) and backslash(). But getting the same error.Please let me know what is missing in code.
I'm guessing you copy and pasted from a Windows property pane, switching backslashes to forward slashes manually. Problem is, the properties dialog shoves a Unicode LEFT-TO-RIGHT EMBEDDING character into the path so the display is consistent, even in locales with right-to-left languages (e.g. Arabic, Hebrew).
You can read more about this on Raymond Chen's blog, The Old New Thing. The solution is to delete that invisible character from your path string. Selecting everything from the initial " to the first forward slash, deleting it, then retyping "C:/, should do the trick.
As your error message suggests, there's a weird character between the colon and the forward slash (C:[some character]/). Other than that the code is fine.
employeeFile = open("C:/Users/xxxxxxxx/Desktop/python/files/employee.txt", "r")
You can copy paste this code and use it.

Jupyter Notebook encoding error?

I started to learn pandas by following this tutorial:
https://github.com/jvns/pandas-cookbook
Right in the first chapter I try very elementary example of reading a csv file. The example goes like this:
import pandas as pd
broken_df = pd.read_csv("..\data\bikes.csv")
I get a lengthy error message, which ends with a line:
FileNotFoundError: File b'..\\data\x08ikes.csv' does not exist
So although I write 'bikes.csv', which I have in the correct folder, the program seems to be searching for a file called 'x08ikes.csv'. Could this be an encoding error? sys.getdefaultencoding() returns 'utf-8'.
I am using Anaconda3 for 64bit Windows, version 4.4.0. My browser is Brave. Any ideas what is going wrong here?
The backslash character '\' has special meaning; it tries to "escape" the next character. In this case '\b' is an escape character that does have a meaning. There are three ways around this:
Escape the escapes:
You can use the backslash to escape the next backslash, telling Python "this is just another character"
broken_df = pd.read_csv("..\\data\\bikes.csv")
Use a raw string:
Placing r at the beginning of a string tells Python to interpret everything in the string as-is
broken_df = pd.read_csv(r"..\data\bikes.csv")
Use forward slashes:
This is specific to file paths. You can trace the directory to you file using forward slashes instead of backslashes.
broken_df = pd.read_csv("../data/bikes.csv")
What you can do is, upload the bikes.csv in to the Jupyter Home "Files" tab. Open it and you may still see the message. Then go to File->New, and you may get a new blank file. Open the original bikes.csv in notepad, copy and paste the content in to the file in jupyter notebook. This may help to resolve it.
Then you can run the following code.
import pandas as pd
broken_df = pd.read_csv("..\data\bikes.csv")

How do I fix an error in my python code where it is complaining about a string being wanted?

I was actually using code from a course at Udacity.com on Data Wrangling. The code file is very short so I was able to copy what they did and I still get an error. They use python 2.7.x. The course is about a year old, so maybe something about the functions or modules in the 2.7 branch has changed. I mean the code used by the instructors works.
I know that using the csv module or function would solve the issue but they want to demonstrate the use of a custom parse function. In addition, they are using the enumerate function. Here is the link to the gist.
This should be very simple and basic and that is why it is frustrating me. I know they are reading the file, which is a csv file, as binary, with the "rb" parameter to the line
with open("file.csv", "rb") as f:
You don't have matching characters in your csv file and the dictionaries in your test function. In particular, in your csv file you are using an em dash (U+2014) and in your firstline and tenthline dictionaries you are using a hyphen-minus (U+002D).
hex(ord(d[0]['US Chart Position'].decode('utf-8')))
'0x2014' # output: code point for the em dash character in csv file
hex(ord(firstline['US Chart Position']))
'0x2d' # output: code point for hyphen-minus
To fix it, just copy and paste the — character from the csv in your gist into the dictionaries in your source code to replace the - characters.
Make sure to include this comment at the top of your file:
# -*- coding: utf-8 -*-
This will ensure that Python knows to expect non-ascii characters in the source code.
Alternatively, you could replace all the — (em dash) characters in the csv file with hyphens:
sed 's/—/-/g' beatles-diskography.csv > beatles-diskography2.csv
Then, remember to use the new file name in your source code.

Python - ShiftJIS errors in DOS

I have csv files that I need to edit in Python that have to remain in Shift-JIS. When test my code by entering each section into the Python interpreter, the files get edited fine and they stay in Shift-JIS. I run the following lines in the Python interpreter:
import sys, codecs
reload(sys)
sys.setdefaultencoding('shift_jis')
I put these lines in a script and run them from the DOS prompt and of course the shift-JIS characters I add get messed up. If I run chcp at the DOS prompt, it tells me that I'm running chcp 932, shift-JIS. Does anybody know what's not working?
In case anyone needs to know, this is the remedy:
In this case Python was using Unicode when I needed Shift-JIS. What worked for me was specifying lines to use unicode then encode them in Shift-JIS, then write them to the file. This worked every time.
For example:
name = u"テスト "
newstring = name + other_string_data
newstring = newstring.encode('shift_jis')
Then the string would get encoded into shift-JIS and written. This isn't the most elegant way to do this but I hope this helps somebody, it took me about 2 hours to figure out.

Categories

Resources