Error using langdetect in python: "No features in text" - python

Hey I have a csv with multilingual text. All I want is a column appended with a the language detected. So I coded as below,
from langdetect import detect
import csv
with open('C:\\Users\\dell\\Downloads\\stdlang.csv') as csvinput:
with open('C:\\Users\\dell\\Downloads\\stdlang.csv') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append('Lang')
all.append(row)
for row in reader:
row.append(detect(row[0]))
all.append(row)
writer.writerows(all)
But I am getting the error as LangDetectException: No features in text
The traceback is as follows
runfile('C:/Users/dell/.spyder2-py3/temp.py', wdir='C:/Users/dell/.spyder2-py3')
Traceback (most recent call last):
File "<ipython-input-25-5f98f4f8be50>", line 1, in <module>
runfile('C:/Users/dell/.spyder2-py3/temp.py', wdir='C:/Users/dell/.spyder2-py3')
File "C:\Users\dell\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\dell\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/dell/.spyder2-py3/temp.py", line 21, in <module>
row.append(detect(row[0]))
File "C:\Users\dell\Anaconda3\lib\site-packages\langdetect\detector_factory.py", line 130, in detect
return detector.detect()
File "C:\Users\dell\Anaconda3\lib\site-packages\langdetect\detector.py", line 136, in detect
probabilities = self.get_probabilities()
File "C:\Users\dell\Anaconda3\lib\site-packages\langdetect\detector.py", line 143, in get_probabilities
self._detect_block()
File "C:\Users\dell\Anaconda3\lib\site-packages\langdetect\detector.py", line 150, in _detect_block
raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
LangDetectException: No features in text.
This is how my csv looks like
1)skunkiest smokiest yummiest strain pain killer and mood lifter
2)Relaxation, euphorique, surélevée, somnolence, concentré, picotement, une augmentation de l’appétit, soulager la douleur Giggly, physique, esprit sédation
3)Reduzierte Angst, Ruhe, gehobener Stimmung, zerebrale Energie, Körper Sedierung
4)Calmante, relajante muscular, Relajación Mental, disminución de náuseas
5)重いフルーティーな幸せ非常に強力な頭石のバースト
Please help me with this.

You can use something like this to detect which line in your file is throwing the error:
for row in reader:
try:
language = detect(row[0])
except:
language = "error"
print("This row throws and error:", row[0])
row.append(language)
all.append(row)
What you're going to see is that it probably fails at "重いフルーティーな幸せ非常に強力な頭石のバースト". My guess is that detect() isn't able to 'identify' any characters to analyze in that row, which is what the error implies.
Other things, like when the input is only a URL, also cause this error.

The error occurred when passing an object with no letters to detect. At least one letter should be there.
To reproduce, run any of below commands:
detect('.')
detect(' ')
detect('5')
detect('/')
So, you may apply some text pre-processing first to drop records in which row[0] value is an empty string,a null value, a white space, a number, a special character, or simply doesn't include any alphabets.

the problem is a null text or something like ' ' with no value;
check this in a condition and loop your reader in a list comprehension or
from langdetect import detect
textlang = [detect(elem) for elem in textlist if len(elem) > 50]
textlang = [detect(elem) if len(elem) > 50 else elem == 'no' for elem in textlist]
or with a loop
texl70 = df5['Titletext']
langdet = []
for i in range(len(df5)):
try:
lang=detect(texl70[i])
except:
lang='no'
print("This row throws error:", texl70[i])
langdet.append(lang)

The error occurrs when string has no letters. If you want to ignore that row and continue the process.
for i in df.index:
str = df.iloc[i][1]
try:
lang = detect(str)
except:
continue

Related

Python readlines() function ignores line written by program

I want to read lines and check whether an specific number is in it, but when reading and printing the list with the lines I can't get the 1st line where my testing string was written by the same program:
Code I use to write stuff on the file:
with open('db.txt', 'a') as f:
f.write(f'Request's channel id from guild {guild.id}:{request_channel_id} \n')
and the code I'm using to read the files and check the lines is:
with open('db.txt', 'r') as f:
index = 0
for line in f:
index += 1
if str(message.guild.id) in line or str(message.channel.id) in line:
break
content = f.readlines()
print(content)
content = content[index]
content.strip(":")
The second block of code is returning: [] and empty list even though I opened it and the line is there. But, when I write directly at the file with my keyboard the code "sees" the random stuff I wrote.
.txt file content:
Id do canal de request servidor 833434062248869899: 888273958263222332
a.a
all
a
a
a
a
Error:
['a.a\n', '\n', 'all\n', 'a\n', 'a\n', 'a\n', 'a']
Ignoring exception in on_message
Traceback (most recent call last):
File "C:\Users\CARVALHO\AppData\Local\Programs\Python\Python39\lib\site-packages\discord\client.py", line 343, in _run_event
await coro(*args, **kwargs)
File "C:\Users\CARVALHO\desktop\gabriel\codando\music_bot\main.py", line 48, in on_message
request_channel_id = int(content[1])
IndexError: string index out of range

How do I test to see if a line of a .txt exists or not? and then raise and exception?

I've been working on a function for hours and I just can't see the solution since this is my first time working with opening .txt files, etc.
My function will open a .txt file of 50 names, with the first line (a header) being NAMES. And then it will create a list of those names. I need to test the function so if the line 'NAMES' is not in the txt file it will raise an exception. I have been working on the 'NAMES' part for longer than I care to admit and can't see what I am doing wrong. Here is my code:
EOF = ''
def load_names(fName):
global line, names
print(end = f"Opening file {fName} ...")
try:
f = open(fName, 'r')
except:
raise Exception(f"OOPS! File {fName} not found")
print(end = "reading...")
line = f.readline().strip()
while line.strip() != 'NAMES':
line = f.readline().strip()
while line != EOF and line.strip() != 'NAMES':
raise Exception("!! Oops! Missing line 'NAMES' !!" )
names = [] # To collect names from file
line = f.readline().strip() # Read in first name
while line != EOF:
if line =='\n':
print("!! Oops! Blank line not allowed !!")
names.append(line.strip())
line = f.readline()
f.close()
print(end = "closed.\n")
return names
The 'blank line not allowed' works when tested, but the way I have this code written now, even If I open a file that does have the 'NAMES' line in it, it still gives the error "Exception("!! Oops! Missing line 'NAMES' !!" )". I'm not sure how to do it, basically.
The files i am testing this with look like:
With NAMES -
NAMES
Mike
James
Anna
Without NAMES -
Mike
James
Anna
Your function seems much too complex, and the exception handling is wrong. For example, it will replace a "permission denied" exception with a generic Exception with an incorrect message saying the file wasn't found, even though that was not the reason for the error.
Instead, probably
avoid the use of global variables;
don't trap errors you can't meaningfully handle;
don't strip() more than once - save the stripped value so you can use it again;
simply read one line at a time, and check that it passes all your criteria.
def load_names(fName):
with open(fName, 'r') as f:
seen_names = False
names = []
for line in f:
line = line.strip()
if line == 'NAMES':
seen_names = True
elif line == '':
raise Exception("!! Oops! Blank line not allowed !!")
else:
names.append(line)
if not seen_names:
raise Exception("!! Oops! Missing line 'NAMES' !!" )
return names
If you actually want NAMES to be the first line in the file, it's not hard to change the code to require that instead. Maybe then it does make sense to read the first line separately before the loop.
def load_names(fName):
with open(fName, 'r') as f:
seen_names = False
names = []
for line in f:
line = line.strip()
if not seen_names:
if line == 'NAMES':
seen_names = True
else:
raise Exception("!! Oops! Missing line 'NAMES' !!" )
elif line == '':
raise Exception("!! Oops! Blank line not allowed !!")
else:
names.append(line)
return names
Concretely, the bug in your original attempt is that the code continues to read lines until it gets one which doesn't contain NAMES, and then complains.

How can I print a single row element in Python DictReader

I've got a program to find DNA matches. We've been given a one line text file for finding the longest sequence of STR(Short Tandem Repeats) then match the result with a database which has been a .cvs file as below:
> name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
> Albus,15,49,38,5,14,44,14,12
> Draco,9,13,8,26,15,25,41,39
> Ginny,37,47,10,23,5,48,28,23
> Harry,46,49,48,29,15,5,28,40
After getting results for longest sequence amounts(as integer) I'm trying to find a match in this cvs file and print the value in the first column (name) when I found the matching numbers of STR. I'm checking the program with print() and it prints 'no match' until the matched name then stops with below error code:
Traceback (most recent call last):
File "dna.py", line 37, in <module>
main()
File "dna.py", line 27, in main
print(row[0])
KeyError: 0
My program probably finds the match but can't print out than exits. Could you please help me? TIA
def main():
for i in SEQ.keys():
SEQ[i] = find_longest_sequence(i)
print(SEQ.items())
with open(sys.argv[1],'r') as f:
db = csv.DictReader(f)
for row in db:
if int(row['AGATC']) == SEQ['AGATC'] and int(row['AATG']) == SEQ['AATG'] and int(row['TATC']) == SEQ['TATC']:
print(row[0])
else:
print("No match")
I used the data you provided and made a test.csv
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Draco,9,13,8,26,15,25,41,39
Ginny,37,47,10,23,5,48,28,23
Harry,46,49,48,29,15,5,28,40
Then tested with the folowing code (py csv-test.py test.csv):
import csv
import sys
with open(sys.argv[1], "r") as f:
db = csv.DictReader(f)
for row in db:
if int(row['AGATC']) == 15:
print(row['name'])
And the result was "Albus".
PS. With:
print(row[0])
I get the same KeyError as you do.
Do I miss something?

How am I supposed to loop big files in openpyxl? iter_rows gives errors

I have been building a web scraper in python/selenium/openpyxl as part of my job for the last two months. Considering it's my first time doing something like this it has been relatively successfull, it has produced results. But one thing that I have never been able to figure out how to do properly is, how am I supposed to loop through an .xlsx document?
I am working with files with 500k+ rows. I have split them up to 100k each though so that the size wouldn't become an issue.
So my problem is, when I loop through the document in this way:
wb = load_workbook(filePath, read_only=True)
ws = wb.active
while (currentRow <= docLength):
adress = ws.cell(row=currentRow, column=1)
car = ws.cell(row=currentRow, column=2)
#Scrape info and append into other document
currentRow += 1
It ends up consuming way too much memory, and the script goes very much slower already after 100 rows...
But if I do this, I get ParseError after a few hundred rows every time! It's very frustrating, as I think this is the right way to do it.
wb = load_workbook(filePath, read_only=True)
ws = wb.active
try:
for row in ws.iter_rows(min_row=1, max_col=10, max_row=docLength, values_only=True):
for x, cell in enumerate(row):
if x == 0:
adress = cell
if x == 1:
car = cell
#Scrape info and append into other document
currentRow += 1
except xml.etree.ElementTree.ParseError:
???
The full exception error:
Traceback (most recent call last):
File "C:\python\scrape.py", line 263, in <module>
for row in ws.iter_rows(min_row=1, max_col=10, max_row=docLength, values_only=True):
File "C:\Users\x\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl-3.0.3-py3.8.egg\openpyxl\worksheet\_read_only.py", line 78, in _cells_by_row
File "C:\Users\x\AppData\Local\Programs\Python\Python38-32\lib\site-packages\openpyxl-3.0.3-py3.8.egg\openpyxl\worksheet\_reader.py", line 142, in parse
File "C:\Users\x\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
yield from pullparser.read_events()
File "C:\Users\x\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
raise event
File "C:\Users\x\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 769027
Why is this happening? :(
And why is it saying such a high column number, is it because of the xml formating?

How to solve ' Exception has occurred: TypeError coercing to Unicode: need string or buffer, tuple found ' in Python

I am trying to calculate average goals by team from a dataset of matches and came up with the following error- 'Exception has occurred: TypeError
coercing to Unicode: need string or buffer, tuple found' My code is;
matches = open('matches.csv', 'r')
data_read = csv.reader(matches, delimiter = ',')
matches = []
for row in data_read:
matches.append((row[0], row[1], row[2], row[3]))
team=['Bandari','Chemelil','Gor Mahia','Kakamega Homeboyz','Kariobangi Sharks','Kenya CB',
'Leopards','Mathare Utd.','Mount Kenya United', 'Nzoia Sugar','Posta Rangers','Sofapaka',
'Sony Sugar','Tusker','Ulinzi Stars','Vihiga United', 'Western Stima', 'Zoo']
results=[]
for file in matches:
avgs=[]
**for object in team:**
goalsscored=0
with open(file) as f:
reader=csv.DictReader(f)
rows=[ row for row in reader if row['Home_Team']==object]
for row in rows:
for rows in row[HTgoals]:
goalsscored=goalsscored + int(row['HTgoals'])
with open(file) as f:
reader=csv.DictReader(f)
rows2=[ row for row in reader if row['Away_Team']==object]
for row in rows2:
for rows2 in row['ATgoals']:
goalsscored=goalsscored + int(row['ATgoals'])
kk=df.apply(pd.value_counts)
avgs.append(goalsscored/kk)
results.append(avgs)
The error I get, which pops up at the line enclosed with double asteriks, is
Exception has occurred: TypeError
coercing to Unicode: need string or buffer, tuple found
File "C:\Users\HP\PycharmProjects\betapp1\model_1.py", line 28, in <module>
with open(file) as f:
File "C:\Users\HP\Anaconda2\Lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Users\HP\Anaconda2\Lib\runpy.py", line 82, in _run_module_code
mod_name, mod_fname, mod_loader, pkg_name)
File "C:\Users\HP\Anaconda2\Lib\runpy.py", line 252, in run_path
return _run_module_code(code, init_globals, run_name, path_name)
My dataset consists of 4 values per row, the home team, the away team, goals scored by the home team and goals scored by the away team. An example is below;
Gor Mahia,Tusker,1,0
Mount Kenya United,Zoo,1,0
Sony Sugar,Western Stima,4,0
I expect the output to be a list with the average number of goals a team scores, but Im not getting any output
It looks like your error is here:
for file in matches:
avgs=[]
for object in team:
goalsscored=0
with open(file) as f: # error is here
reader=csv.DictReader(f)
The open function is expecting file to be a string or file buffer, but it is a tuple that looks something like this (row[0], row[1], row[2], row[3]). If you specify the filenames that you are trying to open, then we can help further.

Categories

Resources