Django __istartswith does't work. Python 2.7, Django 1.8 - python

I am working with cyrillic text. There is parser written in python2.7 which saves strings from another site into database, which are saved as unicode to database [u'\u041a\u043e\u043d\u0446\u0435\u0440\u0442\u044b']
type:
<type 'lxml.etree._ElementUnicodeResult'>
In templates it's shown as normal text (russian this case), but search doesn't work. In this case using django filters(any)(and the relevant data is present)do not return by the normal key? Is it correct behaviour? What is the right solution?
Tired cyrillic letter, and the must be results by this search. The type of string is unicode, but find nothing among many result starting with the following letter, so something is wrong, but I can't understand why. And at the same time english search keys give relevant search result to english words, while russian do not.
var = u'б'
print(type(var))
result = Event.objects.filter(title__istartswith= var)
EDIT: actually simple exact filter works , but anything complex doesn't
EDIT: and title__contains also works. so I'll better rename the questions not to get beaten. but behaviour is pretty strange, sometimes it doesn't show anything thought exact same title is the key. why? and still in face I need start with functionality which doesn't work

Related

How to transform the emoji code to the unicode? [duplicate]

I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda

Formatting text that is meant to be replaced

This is a rather generic question, but I have a textfile that I want to edit using a script.
What are some ways to format text, so that it will visually stand out but still be recognized by my script?
It works fine when I use text_to_be_replaced, but it is hard to find when you have a large file.
Tried searching, and it seems that the common ways are:
%text_to_be_replaced%
<text_to_be_replaced>
$(text_to_be_replaced)
But maybe there is a commonly used/widely accepted way to format text for visibility?
The language the script is written in is python, if that matters... but I'm looking for a more-or-less generic soluting which will work 90% of the time.
I'm not aware of any generic standard here, but if it's meant to be replaced, you can use the new string formatting method as follows:
string = 'some text {add_text_here} some more text'
Then to replace it when you need to:
value = 'formatted'
string = string.format(add_text_here=value)
Now print it out:
>>> string
'some text formatted some more text'
In fact, this quite neat at the addition of curly {brackets} around the text that needs to be replaced also may make it stand out a little.
At first I thought that {{curly braces}} would be fine, but than I went with $ALLCAPS.
First of all, caps really stands out, while lowercase may be confused with the rest of the code.
And while it $REALLYSTANDSOUT, it shouldn't cause any problems, since it's just a "bookmark" in a text file, and will be replaced with the appropriate stuff determined by the script.

Python - non-English characters don't work in one case

Despite the fact I tried to find a solution to my problem both on english and my native-language sites I was unable to find a solution.
I'm querying an online dictionary to get translated words, however non-English characters are displayed as e.g. x86 or x84. However, if I just do print(the_same_non-english_character) the letter is displayed in a proper form. I use Python 3.3.2 and the HTML source of the site I extract the words from has charset=UTF-8 set.
Morever, if I use e.g. replace("x86", "non-english_character"), I don't get anything replaced, but replacing of normal characters works.
you need to escape with a \:
In [1]: s= "\x86"
In [2]: s.replace("\x86","non-english_character")
Out[2]: 'non-english_character'

Python, not understanding double quotes with nothing inside

got a bit of a noob question.
I'm trying to get Metagoofil working because it keeps saying "error downloading webpage" etc etc.
A google search found that I can change a bit of the code in one of the config files and it will work properly again.
I'm having a problem though: this seems to be the code I want to use.
self.url = url.replace("/url?q=", "", 1).split("&amp")[0]
BUT, it doesn't seem to like me (based on syntax highlighting) have those two quotation marks together with nothing in between. When they are like above, it starts highlighting .split(" up to here thinking that this is the string.
My question is, how can I make the double quotations together without anything in the middle and have it register as its own string, so it doesn't highlight the .split("
The literal "" is a valid string with a length of zero. I guess your syntax highlighter is not working properly.

Passing JSON strings larger than 80 characters

I'm having a problem passing strings that exceed 80 characters in JSON. When I pass a string that's exactly 80 characters long it works like magic. But once I add the 81st letter it craps out. I've tried looking at the json object in firebug and it seems to think the string is an array because it has an expander next to it. Clicking the expander though does nothing. I've tried searching online for caps on JSON string sizes and work arounds but am coming up empty :(. Anybody know anything about this?
edit:
It actually doesn't matter what the string is... using "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" yields the same results.
Here's my code: (I'm using python)
result = {"test": "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"}
self.response.out.write(simplejson.dumps(result))
would you happen to know the class that encodes strings properly for python? Thanks so much :)
What is the 81st character? Sounds like the string isn't properly escaped, making the json decoder think it is an array. If you could post the string here, or at least the 20 or so characters around 80, I could probably tell you what is wrong. Also, if you could tell how the json string was made. In most languages you can get a class that will make proper json strings out of objects and arrays. For example, php has json_encode();

Categories

Resources