Searching a set of keywords in Russian in Python

Searching a set of keywords in Russian in Python - python

I am having a problem to import a set of keywords in Russian to a code I'm writing for the the extraction and calculation of those keywords in a corpus of historical texts that I'm working on.
My code looks like this:
f = open('keyword_rayoni.txt', 'r', 'utf-8')
allKeywords = f.read().lower().split("\n")
f.close()`
print(allKeywords)
I get a TypeError: an integer is required (got type str)
I used the same code on an English set of keywords and it worked. I also tried to set the encoding for the Russian keywords to UTF-8, but it didn't solve the problem. Could you please help?

You are using the open function wrong. Input help(open) in a python console. This will give documentation on the open function. If you read it you will see that the third argument is buffering, a different parameter which takes an int (but you are giving it a string, utf-8, see?)
Try:
f = open('blah.txt', 'r', encoding='utf-8')

Related

ÙˆØµÙ„Ù‰ characters showing when writing the text obtained through web scraping into a csv file [duplicate]

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.
from newspaper import Article, Source
import csv
first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")
first_article.download()
if first_article.is_downloaded:
first_article.parse()
first_article.nlp
article_array = []
collate = {}
collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)
keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
csv_writer = csv.DictWriter(output_file, keys)
csv_writer.writeheader()
csv_writer.writerows(article_array)
output_file.close()
When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:
â€œAt the end of the day, Europeâ€™s economy isnâ€™t in great shape, inflation doesnâ€™t look exciting and there are a bunch of political risks to reckon with.
So far I have tried:
with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:
to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.
What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?

Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.

Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.

That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.
You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

Input a ASCII text file / Convert every character it to ASCII value in Python

I am totally beginner in programmaing, so please forgive my mistakes.
I am trying to create a Python program which takes as input a ASCII text file, then converts every single character in its ASCII number value and keeps only the odd results of them. Finally, I must visyallize my exports for each character, using * as bars with percentage (See picture)enter image description here.
I have managed to go this far,
f = open(r"c:\python\7_ASCII\Sample.txt", "r")
result = ' '.join((str(ord(x)) for x in f))
print(f)
which gives me the following error:
TypeError: ord() expected a character, but string of length 320 found
I've tried many methods such as list comprehensions but the error insists in appearing.
Any ideas?

You are not reading your text file. To do so you have to use f.read():
f = open(r"c:\python\7_ASCII\Sample.txt", "r")
my_text = f.read()
print(my_text)
With this you can iterate over your text and apply the logic. However, bear in mind that you can open your files using with:
with open(r"c:\python\7_ASCII\Sample.txt", "r") as reader:
print(reader.read())

Receiving Data From Universal Robot and Decoding

I am working on a project where we are looking to get some data from a universal robot such as position and force data and then store that data in a text file for later reference. We can receive the data just fine, but turning it into readable coordinates is an issue. An example data string is below:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\xc0?\x00\x00\x16C\x00\x00\xc0?\x00\x00\x16C\x00\x00\x00?\xcd\xcc\xcc>\x00\x00\x96C\x00\x00\xc8A\x1e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x88\xfb\x7f?\xd0M><\xc0G\x9e:tNT?\r\x11\x07\xbc\xb9\xfd\x7f?~\xa0\xa1:\x03\x02+?\x16\xeb\x7f\xbf#\xce\xcc\xbc9\xdfl\xbbq\xc3\x8a>i\x19T<\xf3\xf9\x7f\xbf\xb4k\x87\xbb->\xc2>\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80?\xdb\x0f\xc9#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xff\xff\xff\xff\xecb\xc7#\xecb\xc7#\xecb\xc7#\
*not entire string received
At first I thought it was hex so I tried the code:
packet_12 = packet_12.encode('hex')
x = str(packet_12)
x = struct.unpack('!d', packet_12.decode('hex'))[0]
all_data.write("X=", x * 1000)
But to no avail. I tried several different decoding methods using codecs and .encode, but none worked. I found on a different post here the two code blocks below:
y = codecs.decode(packet_12, 'utf-8', errors='ignore')
packet_12 = s.recv(8)
z = str(packet_12)
x = ''.join('%02x' % ord(c) for c in packet_12)
Neither worked for my application. Finally I tried saving the entire sting in a .txt file and opening it with python and decoding it with the code below, but again nothing seemed to happen.
with io.open('C:\\Users\\myuser\\Desktop\\decode.txt', 'r', encoding='utf8') as f:
text = f.read()
with io.open('C:\\Users\\myuser\\Desktop\\decode', 'w', encoding='utf8') as f:
f.write(text)
I am aware I might be missing something incredibly simple such as using the wrong decoding type or I might even have jibberish as the robot output, but any help is appreciated.

The easiest way to receive data from the robot with python is to use Universal Robots' Real-Time-Data-Exchange Interface. They offer some python examples for receiving and sending data.
Check out my GitHub repo for an example code which is based on the official code from UR:
https://github.com/jonenfabian/Read_Data_From_Universal_Robots

Writing a string to CSV using line escapes in python 3

Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?

Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)

If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))

Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.

Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.

file_put_contents and iconv equivalents in Python?

What I want is extremely simple and can be done in PHP language literally with one line of code:
file_put_contents('target.txt', iconv('windows-1252', 'utf-8', file_get_contents('source.txt')));
In Python I spent a whole day trying to figure out how to achieve the same trivial thing, but to no avail. When I try to read or write files I usually get UnicodeDecode errors, str has no method decode and a dozen of similar errors. It seems like I scanned all threads at SO, but still do not know how can I do this.

Are you specifying the "encoding" keyword argument when you call open?
with open('source.txt', encoding='windows-1252') as f_in:
with open('target.txt', 'w', encoding='utf-8') as f_out:
f_out.write(f_in.read())

Since Python 3.5 you can write:
Path('target.txt').write_text(
Path('source.txt').read_text(encoding='windows-1252'),
encoding='utf-8'
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching a set of keywords in Russian in Python - python

Related

ÙˆØµÙ„Ù‰ characters showing when writing the text obtained through web scraping into a csv file [duplicate]

Input a ASCII text file / Convert every character it to ASCII value in Python

Receiving Data From Universal Robot and Decoding

Writing a string to CSV using line escapes in python 3

file_put_contents and iconv equivalents in Python?

Categories

Resources