Sphinx and Python, multilanguage search - python

I have the table in UTF-8 called tr_transcriptions. It stores dozens records with 'text' field on different languages. There're 34 languages supported at the moment:
Afrikaans, Korean, Arabic, Malayalam, Bahasa, Mandarin, Bahasama, Norwegian, Croatian,
Polish, Czech, Portuguese, Danish, Romanian, Dutch, Russian, English, Slovak, Flemish
Spanish, French, Swedish, German, Tagalog, Greek, Tamil, Hindi, Telugu, Hungarian, Thai, Italian, Turkish, Kannada, Vietnamese
I want to gives to users an opportunity to search through this table. I have a few issues with that. I can't get it working with Sphinx. Here's my config file:
source transcriptions
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = pass
sql_db = transcriptions
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query = \
SELECT * \
FROM tr_transcriptions
}
index tr_transcriptions
{
source = transcriptions
charset_type = utf-8
charset_table = U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00E0->a, U+00E1->a,\
U+00E2->a, U+00E3->a, U+00E4->a, U+00E5->a, U+0100->a, U+0101->a, U+0102->a, U+0103->a,\
U+010300->a, U+0104->a, U+0105->a, U+01CD->a, U+01CE->a, U+01DE->a, U+01DF->a, U+01E0->a,\
U+01E1->a, U+01FA->a, U+01FB->a, U+0200->a, U+0201->a, U+0202->a, U+0203->a, U+0226->a,\
U+0227->a, U+023A->a, U+0250->a, U+04D0->a, U+04D1->a, U+1D2C->a, U+1D43->a, U+1D44->a,\
U+1D8F->a, U+1E00->a, U+1E01->a, U+1E9A->a, U+1EA0->a, U+1EA1->a, U+1EA2->a, U+1EA3->a,\
U+1EA4->a, U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a, U+1EA9->a, U+1EAA->a, U+1EAB->a,\
U+1EAC->a, U+1EAD->a, U+1EAE->a, U+1EAF->a, U+1EB0->a, U+1EB1->a, U+1EB2->a, U+1EB3->a,\
U+1EB4->a, U+1EB5->a, U+1EB6->a, U+1EB7->a, U+2090->a, U+2C65->a, U+0180->b, U+0181->b,\
U+0182->b, U+0183->b, U+0243->b, U+0253->b, U+0299->b, U+16D2->b, U+1D03->b, U+1D2E->b,\
U+1D2F->b, U+1D47->b, U+1D6C->b, U+1D80->b, U+1E02->b, U+1E03->b, U+1E04->b, U+1E05->b,\
U+1E06->b, U+1E07->b, U+00C7->c, U+00E7->c, U+0106->c, U+0107->c, U+0108->c, U+0109->c,\
U+010A->c, U+010B->c, U+010C->c, U+010D->c, U+0187->c, U+0188->c, U+023B->c, U+023C->c,\
U+0255->c, U+0297->c, U+1D9C->c, U+1D9D->c, U+1E08->c, U+1E09->c, U+212D->c, U+2184->c,\
U+010E->d, U+010F->d, U+0110->d, U+0111->d, U+0189->d, U+018A->d, U+018B->d, U+018C->d,\
U+01C5->d, U+01F2->d, U+0221->d, U+0256->d, U+0257->d, U+1D05->d, U+1D30->d, U+1D48->d,\
U+1D6D->d, U+1D81->d, U+1D91->d, U+1E0A->d, U+1E0B->d, U+1E0C->d, U+1E0D->d, U+1E0E->d,\
U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d, U+1E13->d, U+00C8->e, U+00C9->e, U+00CA->e,\
U+00CB->e, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e, U+0112->e, U+0113->e, U+0114->e,\
U+0115->e, U+0116->e, U+0117->e, U+0118->e, U+0119->e, U+011A->e, U+011B->e, U+018E->e,\
U+0190->e, U+01DD->e, U+0204->e, U+0205->e, U+0206->e, U+0207->e, U+0228->e, U+0229->e,\
U+0246->e, U+0247->e, U+0258->e, U+025B->e, U+025C->e, U+025D->e, U+025E->e, U+029A->e,\
U+1D07->e, U+1D08->e, U+1D31->e, U+1D32->e, U+1D49->e, U+1D4B->e, U+1D4C->e, U+1D92->e,\
U+1D93->e, U+1D94->e, U+1D9F->e, U+1E14->e, U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e,\
U+1E19->e, U+1E1A->e, U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1EB8->e, U+1EB9->e, U+1EBA->e,\
U+1EBB->e, U+1EBC->e, U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e, U+1EC1->e, U+1EC2->e,\
U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e, U+1EC7->e, U+2091->e, U+0191->f, U+0192->f,\
U+1D6E->f, U+1D82->f, U+1DA0->f, U+1E1E->f, U+1E1F->f, U+011C->g, U+011D->g, U+011E->g,\
U+011F->g, U+0120->g, U+0121->g, U+0122->g, U+0123->g, U+0193->g, U+01E4->g, U+01E5->g,\
U+01E6->g, U+01E7->g, U+01F4->g, U+01F5->g, U+0260->g, U+0261->g, U+0262->g, U+029B->g,\
U+1D33->g, U+1D4D->g, U+1D77->g, U+1D79->g, U+1D83->g, U+1DA2->g, U+1E20->g, U+1E21->g,\
U+0124->h, U+0125->h, U+0126->h, U+0127->h, U+021E->h, U+021F->h, U+0265->h, U+0266->h,\
U+029C->h, U+02AE->h, U+02AF->h, U+02B0->h, U+02B1->h, U+1D34->h, U+1DA3->h, U+1E22->h,\
U+1E23->h, U+1E24->h, U+1E25->h, U+1E26->h, U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h,\
U+1E2B->h, U+1E96->h, U+210C->h, U+2C67->h, U+2C68->h, U+2C75->h, U+2C76->h, U+00CC->i,\
U+00CD->i, U+00CE->i, U+00CF->i, U+00EC->i, U+00ED->i, U+00EE->i, U+00EF->i, U+010309->i,\
U+0128->i, U+0129->i, U+012A->i, U+012B->i, U+012C->i, U+012D->i, U+012E->i, U+012F->i,\
U+0130->i, U+0131->i, U+0197->i, U+01CF->i, U+01D0->i, U+0208->i, U+0209->i, U+020A->i,\
U+020B->i, U+0268->i, U+026A->i, U+040D->i, U+0418->i, U+0419->i, U+0438->i, U+0439->i,\
U+0456->i, U+1D09->i, U+1D35->i, U+1D4E->i, U+1D62->i, U+1D7B->i, U+1D96->i, U+1DA4->i,\
U+1DA6->i, U+1DA7->i, U+1E2C->i, U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1EC8->i, U+1EC9->i,\
U+1ECA->i, U+1ECB->i, U+2071->i, U+2111->i, U+0134->j, U+0135->j, U+01C8->j, U+01CB->j,\
U+01F0->j, U+0237->j, U+0248->j, U+0249->j, U+025F->j, U+0284->j, U+029D->j, U+02B2->j,\
U+1D0A->j, U+1D36->j, U+1DA1->j, U+1DA8->j, U+0136->k, U+0137->k, U+0198->k, U+0199->k,\
U+01E8->k, U+01E9->k, U+029E->k, U+1D0B->k, U+1D37->k, U+1D4F->k, U+1D84->k, U+1E30->k,\
U+1E31->k, U+1E32->k, U+1E33->k, U+1E34->k, U+1E35->k, U+2C69->k, U+2C6A->k, U+0139->l,\
U+013A->l, U+013B->l, U+013C->l, U+013D->l, U+013E->l, U+013F->l, U+0140->l, U+0141->l,\
U+0142->l, U+019A->l, U+01C8->l, U+0234->l, U+023D->l, U+026B->l, U+026C->l, U+026D->l,\
U+029F->l, U+02E1->l, U+1D0C->l, U+1D38->l, U+1D85->l, U+1DA9->l, U+1DAA->l, U+1DAB->l,\
U+1E36->l, U+1E37->l, U+1E38->l, U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l, U+1E3D->l,\
U+2C60->l, U+2C61->l, U+2C62->l, U+019C->m, U+026F->m, U+0270->m, U+0271->m, U+1D0D->m,\
U+1D1F->m, U+1D39->m, U+1D50->m, U+1D5A->m, U+1D6F->m, U+1D86->m, U+1DAC->m, U+1DAD->m,\
U+1E3E->m, U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m, U+1E43->m, U+00D1->n, U+00F1->n,\
U+0143->n, U+0144->n, U+0145->n, U+0146->n, U+0147->n, U+0148->n, U+0149->n, U+019D->n,\
U+019E->n, U+01CB->n, U+01F8->n, U+01F9->n, U+0220->n, U+0235->n, U+0272->n, U+0273->n,\
U+0274->n, U+1D0E->n, U+1D3A->n, U+1D3B->n, U+1D70->n, U+1D87->n, U+1DAE->n, U+1DAF->n,\
U+1DB0->n, U+1E44->n, U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n, U+1E49->n, U+1E4A->n,\
U+1E4B->n, U+207F->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o, U+00D6->o, U+00D8->o,\
U+00F2->o, U+00F3->o, U+00F4->o, U+00F5->o, U+00F6->o, U+00F8->o, U+01030F->o, U+014C->o,\
U+014D->o, U+014E->o, U+014F->o, U+0150->o, U+0151->o, U+0186->o, U+019F->o, U+01A0->o,\
U+01A1->o, U+01D1->o, U+01D2->o, U+01EA->o, U+01EB->o, U+01EC->o, U+01ED->o, U+01FE->o,\
U+01FF->o, U+020C->o, U+020D->o, U+020E->o, U+020F->o, U+022A->o, U+022B->o, U+022C->o,\
U+022D->o, U+022E->o, U+022F->o, U+0230->o, U+0231->o, U+0254->o, U+0275->o, U+043E->o,\
U+04E6->o, U+04E7->o, U+04E8->o, U+04E9->o, U+04EA->o, U+04EB->o, U+1D0F->o, U+1D10->o,\
U+1D11->o, U+1D12->o, U+1D13->o, U+1D16->o, U+1D17->o, U+1D3C->o, U+1D52->o, U+1D53->o,\
U+1D54->o, U+1D55->o, U+1D97->o, U+1DB1->o, U+1E4C->o, U+1E4D->o, U+1E4E->o, U+1E4F->o,\
U+1E50->o, U+1E51->o, U+1E52->o, U+1E53->o, U+1ECC->o, U+1ECD->o, U+1ECE->o, U+1ECF->o,\
U+1ED0->o, U+1ED1->o, U+1ED2->o, U+1ED3->o, U+1ED4->o, U+1ED5->o, U+1ED6->o, U+1ED7->o,\
U+1ED8->o, U+1ED9->o, U+1EDA->o, U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o, U+1EDF->o,\
U+1EE0->o, U+1EE1->o, U+1EE2->o, U+1EE3->o, U+2092->o, U+2C9E->o, U+2C9F->o, U+01A4->p,\
U+01A5->p, U+1D18->p, U+1D3E->p, U+1D56->p, U+1D71->p, U+1D7D->p, U+1D88->p, U+1E54->p,\
U+1E55->p, U+1E56->p, U+1E57->p, U+2C63->p, U+024A->q, U+024B->q, U+02A0->q, U+0154->r,\
U+0155->r, U+0156->r, U+0157->r, U+0158->r, U+0159->r, U+0210->r, U+0211->r, U+0212->r,\
U+0213->r, U+024C->r, U+024D->r, U+0279->r, U+027A->r, U+027B->r, U+027C->r, U+027D->r,\
U+027E->r, U+027F->r, U+0280->r, U+0281->r, U+02B3->r, U+02B4->r, U+02B5->r, U+02B6->r,\
U+1D19->r, U+1D1A->r, U+1D3F->r, U+1D63->r, U+1D72->r, U+1D73->r, U+1D89->r, U+1DCA->r,\
U+1E58->r, U+1E59->r, U+1E5A->r, U+1E5B->r, U+1E5C->r, U+1E5D->r, U+1E5E->r, U+1E5F->r,\
U+211C->r, U+2C64->r, U+00DF->s, U+015A->s, U+015B->s, U+015C->s, U+015D->s, U+015E->s,\
U+015F->s, U+0160->s, U+0161->s, U+017F->s, U+0218->s, U+0219->s, U+023F->s, U+0282->s,\
U+02E2->s, U+1D74->s, U+1D8A->s, U+1DB3->s, U+1E60->s, U+1E61->s, U+1E62->s, U+1E63->s,\
U+1E64->s, U+1E65->s, U+1E66->s, U+1E67->s, U+1E68->s, U+1E69->s, U+1E9B->s, U+0162->t,\
U+0163->t, U+0164->t, U+0165->t, U+0166->t, U+0167->t, U+01AB->t, U+01AC->t, U+01AD->t,\
U+01AE->t, U+021A->t, U+021B->t, U+0236->t, U+023E->t, U+0287->t, U+0288->t, U+1D1B->t,\
U+1D40->t, U+1D57->t, U+1D75->t, U+1DB5->t, U+1E6A->t, U+1E6B->t, U+1E6C->t, U+1E6D->t,\
U+1E6E->t, U+1E6F->t, U+1E70->t, U+1E71->t, U+1E97->t, U+2C66->t, U+00D9->u, U+00DA->u,\
U+00DB->u, U+00DC->u, U+00F9->u, U+00FA->u, U+00FB->u, U+00FC->u, U+010316->u, U+0168->u,\
U+0169->u, U+016A->u, U+016B->u, U+016C->u, U+016D->u, U+016E->u, U+016F->u, U+0170->u,\
U+0171->u, U+0172->u, U+0173->u, U+01AF->u, U+01B0->u, U+01D3->u, U+01D4->u, U+01D5->u,\
U+01D6->u, U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u, U+01DB->u, U+01DC->u, U+0214->u,\
U+0215->u, U+0216->u, U+0217->u, U+0244->u, U+0289->u, U+1D1C->u, U+1D1D->u, U+1D1E->u,\
U+1D41->u, U+1D58->u, U+1D59->u, U+1D64->u, U+1D7E->u, U+1D99->u, U+1DB6->u, U+1DB8->u,\
U+1E72->u, U+1E73->u, U+1E74->u, U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u, U+1E79->u,\
U+1E7A->u, U+1E7B->u, U+1EE4->u, U+1EE5->u, U+1EE6->u, U+1EE7->u, U+1EE8->u, U+1EE9->u,\
U+1EEA->u, U+1EEB->u, U+1EEC->u, U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u, U+1EF1->u,\
U+01B2->v, U+0245->v, U+028B->v, U+028C->v, U+1D20->v, U+1D5B->v, U+1D65->v, U+1D8C->v,\
U+1DB9->v, U+1DBA->v, U+1E7C->v, U+1E7D->v, U+1E7E->v, U+1E7F->v, U+2C74->v, U+0174->w,\
U+0175->w, U+028D->w, U+02B7->w, U+1D21->w, U+1D42->w, U+1E80->w, U+1E81->w, U+1E82->w,\
U+1E83->w, U+1E84->w, U+1E85->w, U+1E86->w, U+1E87->w, U+1E88->w, U+1E89->w, U+1E98->w,\
U+02E3->x, U+1D8D->x, U+1E8A->x, U+1E8B->x, U+1E8C->x, U+1E8D->x, U+2093->x, U+00DD->y,\
U+00FD->y, U+00FF->y, U+0176->y, U+0177->y, U+0178->y, U+01B3->y, U+01B4->y, U+0232->y,\
U+0233->y, U+024E->y, U+024F->y, U+028E->y, U+028F->y, U+02B8->y, U+1E8E->y, U+1E8F->y,\
U+1E99->y, U+1EF2->y, U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y, U+1EF7->y, U+1EF8->y,\
U+1EF9->y, U+0179->z, U+017A->z, U+017B->z, U+017C->z, U+017D->z, U+017E->z, U+01B5->z,\
U+01B6->z, U+0224->z, U+0225->z, U+0240->z, U+0290->z, U+0291->z, U+1D22->z, U+1D76->z,\
U+1D8E->z, U+1DBB->z, U+1DBC->z, U+1DBD->z, U+1E90->z, U+1E91->z, U+1E92->z, U+1E93->z,\
U+1E94->z, U+1E95->z, U+2128->z, U+2C6B->z, U+2C6C->z, U+00C6->U+00E6, U+01E2->U+00E6,\
U+01E3->U+00E6, U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, U+1D02->U+00E6,\
U+1D2D->U+00E6, U+1D46->U+00E6, U+00E6, U+0400->U+0435, U+0401->U+0435, U+0402->U+0452,\
U+0452, U+0403->U+0433, U+0404->U+0454, U+0454, U+0405->U+0455, U+0455, U+0406->U+0456,\
U+0407->U+0456, U+0457->U+0456, U+0456, U+0408..U+040B->U+0458..U+045B, U+0458..U+045B,\
U+040C->U+043A, U+040D->U+0438, U+040E->U+0443, U+040F->U+045F, U+045F, U+0450->U+0435,\
U+0451->U+0435, U+0453->U+0433, U+045C->U+043A, U+045D->U+0438, U+045E->U+0443,\
U+0460->U+0461, U+0461, U+0462->U+0463, U+0463, U+0464->U+0465, U+0465, U+0466->U+0467,\
U+0467, U+0468->U+0469, U+0469, U+046A->U+046B, U+046B, U+046C->U+046D, U+046D,\
U+046E->U+046F, U+046F, U+0470->U+0471, U+0471, U+0472->U+0473, U+0473, U+0474->U+0475,\
U+0476->U+0475, U+0477->U+0475, U+0475, U+0478->U+0479, U+0479, U+047A->U+047B, U+047B,\
U+047C->U+047D, U+047D, U+047E->U+047F, U+047F, U+0480->U+0481, U+0481, U+048A->U+0438,\
U+048B->U+0438, U+048C->U+044C, U+048D->U+044C, U+048E->U+0440, U+048F->U+0440,\
U+0490->U+0433, U+0491->U+0433, U+0490->U+0433, U+0491->U+0433, U+0492->U+0433,\
U+0493->U+0433, U+0494->U+0433, U+0495->U+0433, U+0496->U+0436, U+0497->U+0436,\
U+0498->U+0437, U+0499->U+0437, U+049A->U+043A, U+049B->U+043A, U+049C->U+043A,\
U+049D->U+043A, U+049E->U+043A, U+049F->U+043A, U+04A0->U+043A, U+04A1->U+043A,\
U+04A2->U+043D, U+04A3->U+043D, U+04A4->U+043D, U+04A5->U+043D, U+04A6->U+043F,\
U+04A7->U+043F, U+04A8->U+04A9, U+04A9, U+04AA->U+0441, U+04AB->U+0441, U+04AC->U+0442,\
U+04AD->U+0442, U+04AE->U+0443, U+04AF->U+0443, U+04B0->U+0443, U+04B1->U+0443,\
U+04B2->U+0445, U+04B3->U+0445, U+04B4->U+04B5, U+04B5, U+04B6->U+0447, U+04B7->U+0447,\
U+04B8->U+0447, U+04B9->U+0447, U+04BA->U+04BB, U+04BB, U+04BC->U+04BD, U+04BE->U+04BD,\
U+04BF->U+04BD, U+04BD, U+04C0->U+04CF, U+04CF, U+04C1->U+0436, U+04C2->U+0436,\
U+04C3->U+043A, U+04C4->U+043A, U+04C5->U+043B, U+04C6->U+043B, U+04C7->U+043D,\
U+04C8->U+043D, U+04C9->U+043D, U+04CA->U+043D, U+04CB->U+0447, U+04CC->U+0447,\
U+04CD->U+043C, U+04CE->U+043C, U+04D0->U+0430, U+04D1->U+0430, U+04D2->U+0430,\
U+04D3->U+0430, U+04D4->U+00E6, U+04D5->U+00E6, U+04D6->U+0435, U+04D7->U+0435,\
U+04D8->U+04D9, U+04DA->U+04D9, U+04DB->U+04D9, U+04D9, U+04DC->U+0436, U+04DD->U+0436,\
U+04DE->U+0437, U+04DF->U+0437, U+04E0->U+04E1, U+04E1, U+04E2->U+0438, U+04E3->U+0438,\
U+04E4->U+0438, U+04E5->U+0438, U+04E6->U+043E, U+04E7->U+043E, U+04E8->U+043E,\
U+04E9->U+043E, U+04EA->U+043E, U+04EB->U+043E, U+04EC->U+044D, U+04ED->U+044D,\
U+04EE->U+0443, U+04EF->U+0443, U+04F0->U+0443, U+04F1->U+0443, U+04F2->U+0443,\
U+04F3->U+0443, U+04F4->U+0447, U+04F5->U+0447, U+04F6->U+0433, U+04F7->U+0433,\
U+04F8->U+044B, U+04F9->U+044B, U+04FA->U+0433, U+04FB->U+0433, U+04FC->U+0445,\
U+04FD->U+0445, U+04FE->U+0445, U+04FF->U+0445, U+0410..U+0418->U+0430..U+0438,\
U+0419->U+0438, U+0430..U+0438, U+041A..U+042F->U+043A..U+044F, U+043A..U+044F,\
U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z\
But when I search for e.g. Arabic text it doesn't give me any results.. but mysql's LIKE does.
Also I wanted to ask is there a better Search Server to such things than Sphinx?
Thanks a lot

Related

Encoding foreign alphabet characters

I am getting data from an XML provided by an API that for some reason lists Czechslovak characters in a different encoding (e.g. instead of correct Czechoslovak "ý" it uses "ý"). Therefore, instead of providing the
correct output to the user -> "Zelený"
the output is -> "Zelený"
I went through multiple StackOverflow posts, other fora and tutorials, but I still cannot figure out how to make it turn "Zelený" into "Zelený" (this is just one of the weird characters used by the XML so I cannot use str.replace).
I figured out, that the correct encoding for the Czech/Slovak language is "windows-1250"
My code:
def change_encoding(what):
what = what.encode("windows-1250")
return what
clean_xml_input = change_encoding(xml_input)
This produces error:
'charmap' codec can't encode characters in position 5-6: character
maps to <undefined>
"Zelený".encode("Windows-1252").decode("utf-8") #'Zelený'
"Zelený".encode("windows-1254").decode("utf-8") #'Zelený'
"Zelený".encode("iso-8859-1").decode("utf-8") #'Zelený'
"Zelený".encode("iso-8859-9").decode("utf-8") #'Zelený'
If it is helpful
from itertools import permutations
all_encoding = ['ASMO-708',
'big5',
'cp1025',
'cp866',
'cp875',
'csISO2022JP',
'DOS-720',
'DOS-862',
'EUC-CN',
'EUC-JP',
'euc-jp',
'euc-kr',
'GB18030',
'gb2312',
'hz-gb-2312',
'IBM00858',
'IBM00924',
'IBM01047',
'IBM01140',
'IBM01141',
'IBM01142',
'IBM01143',
'IBM01144',
'IBM01145',
'IBM01146',
'IBM01147',
'IBM01148',
'IBM01149',
'IBM037',
'IBM1026',
'IBM273',
'IBM277',
'IBM278',
'IBM280',
'IBM284',
'IBM285',
'IBM290',
'IBM297',
'IBM420',
'IBM423',
'IBM424',
'IBM437',
'IBM500',
'ibm737',
'ibm775',
'ibm850',
'ibm852',
'IBM855',
'ibm857',
'IBM860',
'ibm861',
'IBM863',
'IBM864',
'IBM865',
'ibm869',
'IBM870',
'IBM871',
'IBM880',
'IBM905',
'IBM-Thai',
'iso-2022-jp',
'iso-2022-jp',
'iso-2022-kr',
'iso-8859-1',
'iso-8859-13',
'iso-8859-15',
'iso-8859-2',
'iso-8859-3',
'iso-8859-4',
'iso-8859-5',
'iso-8859-6',
'iso-8859-7',
'iso-8859-8',
'iso-8859-8-i',
'iso-8859-9',
'Johab',
'koi8-r',
'koi8-u',
'ks_c_5601-1987',
'macintosh',
'shift_jis',
'us-ascii',
'utf-16',
'utf-16BE',
'utf-32',
'utf-32BE',
'utf-7',
'utf-8',
'windows-1250',
'windows-1251',
'Windows-1252',
'windows-1253',
'windows-1254',
'windows-1255',
'windows-1256',
'windows-1257',
'windows-1258',
'windows-874',
'x-Chinese-CNS',
'x-Chinese-Eten',
'x-cp20001',
'x-cp20003',
'x-cp20004',
'x-cp20005',
'x-cp20261',
'x-cp20269',
'x-cp20936',
'x-cp20949',
'x-cp50227',
'x-EBCDIC-KoreanExtended',
'x-Europa',
'x-IA5',
'x-IA5-German',
'x-IA5-Norwegian',
'x-IA5-Swedish',
'x-iscii-as',
'x-iscii-be',
'x-iscii-de',
'x-iscii-gu',
'x-iscii-ka',
'x-iscii-ma',
'x-iscii-or',
'x-iscii-pa',
'x-iscii-ta',
'x-iscii-te',
'x-mac-arabic',
'x-mac-ce',
'x-mac-chinesesimp',
'x-mac-chinesetrad',
'x-mac-croatian',
'x-mac-cyrillic',
'x-mac-greek',
'x-mac-hebrew',
'x-mac-icelandic',
'x-mac-japanese',
'x-mac-korean',
'x-mac-romanian',
'x-mac-thai',
'x-mac-turkish',
'x-mac-ukrainian']
for i,j in permutations(all_encoding, 2):
try:
if("Zelený".encode(i).decode(j) == 'Zelený'):
print(f'encode with `{i}` and decode with `{j}`')
except:
pass

Output not displaying full list of elements appended

from csv import reader
def func(sku_list):
values = []
with open(sku_list, 'r', encoding = 'utf-8') as pr:
rows = reader(pr)
for sku in rows:
values.append(sku[1])
return(values)
if __name__ == '__main__':
dir_path = "C:/Users/XXXX/Downloads/"
vendors = dir_path + 'file.csv'
new_prices = func(vendors)
print(new_prices)
sku_list is a csv file filled with pairs of brand names and their skus that I have downloaded from my db, for some reason as it iters through the rows and grabs just the sku value, hence sku[1], it stops well short of the actual length I expect the list to be
sku_list is 85,892 tuples long but when I print out the values appended to the list values it simply returns this:
['SKU', 'MWGB4896', 'MWGB4872', 'MWGB4848', 'MWGB3648', 'WGB4896', 'WGB4872', 'WGB4848', 'WGB3648', 'WGB2436', 'WGB1824', 'BKGB4896NT', 'BKGB4872NT', 'BKGB4848NT', 'BKGB3648NT', 'BKGB2436NT', 'BKGB1824NT', 'WFC2418G', 'WFC2418', 'WFC3624', 'WFC2418LB', 'WFC3648LB', 'WFC3624LB', 'WFC3624G', 'WFC3648G', 'WFC3648', 'LOWFC3624LB', 'LOWFC3624G', 'LOWFC3624', 'LOWFC2418LB', 'LOWFC2418G', 'LOWFC2418', 'LOWFC3648LB', 'LOWFC3648', 'LOWFC3648G', 'WM-7-B', 'WM-7-G', 'WM-7-BK', 'WMC-7', 'WM-7-R', 'APS-50', 'APS-70', 'APS-60', 'APS-84', 'SS15W', 'SC15W', 'SB15W', 'MFL-2W', 'WP-48', 'WP-40', 'WP-36', 'MP-48', 'MP-40', 'MP-36', 'OP-40', 'OP-36', 'OP-48', 'FFVSU96-2', 'FFVSU144-2', 'FFVSU192-2', '1-WA-1B', '1-WA-1BP', 'WCS-12', 'WCS-144', 'OPLD3416LSPP-2', 'OPLD3416LSPP-4', 'OPLD3416LSPP-5', 'OPLD3416LSPP-7', 'OPLD3416LSPP-8', 'OPLD1818LSPP-2', 'OPLD1818LSPP-4', 'OPLD1818LSPP-5', 'OPLD1818LSPP-7', 'OPLD1818LSPP-8', 'OPLD1818L-2', 'OPLD1818L-5', 'OPLD1818L-4', 'OPLD3416L-2', 'OPLD3416L-4', 'OPLD3416L-5', 'OPLD3416L-7', 'OPLD3416L-8', 'OPLD1818L-7', 'OPLD1818L-8', 'OPLD3416SPP-8-892', 'OPLD3416SPP-8-897', 'OPLD3416SPP-8-878', 'OPLD3416SPP-8-885', 'OPLD3416SPP-8-887', 'OPLD3416SPP-8-890', 'OPLD3416SPP-8-845', 'OPLD3416SPP-8-854', 'OPLD3416SPP-8-856', 'OPLD3416SPP-8-876', 'OPLD3416SPP-8-802', 'OPLD3416SPP-8-706', 'OPLD3416SPP-8-705', 'OPLD3416SPP-8-704', 'OPLD3416SPP-8-837', 'OPLD3416SPP-8-831', 'OPLD3416SPP-8-819', 'OPLD3416SPP-8-812', 'OPLD3416SPP-8-685', 'OPLD3416SPP-8-683', 'OPLD3416SPP-8-679', 'OPLD3416SPP-8-531', 'OPLD3416SPP-8-703', 'OPLD3416SPP-8-702', 'OPLD3416SPP-8-701', 'OPLD3416SPP-8-700', 'OPLD3416SPP-7-892', 'OPLD3416SPP-7-897', 'OPLD3416SPP-7-887', 'OPLD3416SPP-7-890', 'OPLD3416SPP-8-530', 'OPLD3416SPP-7-845', 'OPLD3416SPP-7-854', 'OPLD3416SPP-7-831', 'OPLD3416SPP-7-837', 'OPLD3416SPP-7-878', 'OPLD3416SPP-7-885', 'OPLD3416SPP-7-856', 'OPLD3416SPP-7-876', 'OPLD3416SPP-7-703', 'OPLD3416SPP-7-702', 'OPLD3416SPP-7-705', 'OPLD3416SPP-7-704', 'OPLD3416SPP-7-802', 'OPLD3416SPP-7-706', 'OPLD3416SPP-7-819', 'OPLD3416SPP-7-812', 'OPLD3416SPP-7-530', 'OPLD3416SPP-7-679', 'OPLD3416SPP-7-531', 'OPLD3416SPP-7-685', 'OPLD3416SPP-7-683', 'OPLD3416SPP-7-701', 'OPLD3416SPP-7-700', 'OPLD3416SPP-5-878', 'OPLD3416SPP-5-885', 'OPLD3416SPP-5-887', 'OPLD3416SPP-5-890', 'OPLD3416SPP-5-892', 'OPLD3416SPP-5-897', 'OPLD3416SPP-5-812', 'OPLD3416SPP-5-819', 'OPLD3416SPP-5-831', 'OPLD3416SPP-5-837', 'OPLD3416SPP-5-845', 'OPLD3416SPP-5-854', 'OPLD3416SPP-5-856', 'OPLD3416SPP-5-876', 'OPLD1818SPP-8-819', 'OPLD1818SPP-8-831', 'OPLD1818SPP-8-802', 'OPLD1818SPP-8-812', 'OPLD1818SPP-8-854', 'OPLD1818SPP-8-856', 'OPLD1818SPP-8-837', 'OPLD1818SPP-8-845', 'OPLD1818SPP-8-701', 'OPLD1818SPP-8-702', 'OPLD1818SPP-8-685', 'OPLD1818SPP-8-700', 'OPLD1818SPP-8-705', 'OPLD1818SPP-8-706', 'OPLD1818SPP-8-703', 'OPLD1818SPP-8-704', 'OPLD1818SPP-8-887', 'OPLD1818SPP-8-885', 'OPLD1818SPP-8-878', 'OPLD1818SPP-8-876', 'OPLD1818SPP-8-897', 'OPLD1818SPP-8-892', 'OPLD1818SPP-8-890', 'OPLD3416SPP-4-837', 'OPLD3416SPP-4-831', 'OPLD3416SPP-4-854', 'OPLD3416SPP-4-845', 'OPLD3416SPP-4-802', 'OPLD3416SPP-4-706',
'OPLD3416SPP-4-819', 'OPLD3416SPP-4-812', 'OPLD3416SPP-4-890', 'OPLD3416SPP-4-887', 'OPLD3416SPP-4-897', 'OPLD3416SPP-4-892', 'OPLD3416SPP-4-876', 'OPLD3416SPP-4-856', 'OPLD3416SPP-4-885', 'OPLD3416SPP-4-878', 'OPLD3416SPP-5-531', 'OPLD3416SPP-5-679', 'OPLD3416SPP-5-683', 'OPLD3416SPP-5-685', 'OPLD3416SPP-5-530', 'OPLD3416SPP-5-704', 'OPLD3416SPP-5-705', 'OPLD3416SPP-5-706', 'OPLD3416SPP-5-802', 'OPLD3416SPP-5-700', 'OPLD3416SPP-5-701', 'OPLD3416SPP-5-702', 'OPLD3416SPP-5-703', 'OPLD3416SPP-2-837', 'OPLD3416SPP-2-831', 'OPLD3416SPP-2-819', 'OPLD3416SPP-2-812', 'OPLD3416SPP-2-802', 'OPLD3416SPP-2-706', 'OPLD3416SPP-2-705', 'OPLD3416SPP-2-704', 'OPLD3416SPP-2-890', 'OPLD3416SPP-2-887', 'OPLD3416SPP-2-885', 'OPLD3416SPP-2-878', 'OPLD3416SPP-2-876', 'OPLD3416SPP-2-856', 'OPLD3416SPP-2-854', 'OPLD3416SPP-2-845', 'OPLD3416SPP-4-531', 'OPLD3416SPP-4-679', 'OPLD3416SPP-4-530', 'OPLD3416SPP-2-892', 'OPLD3416SPP-2-897', 'OPLD3416SPP-4-704', 'OPLD3416SPP-4-705', 'OPLD3416SPP-4-702', 'OPLD3416SPP-4-703', 'OPLD3416SPP-4-700', 'OPLD3416SPP-4-701', 'OPLD3416SPP-4-683', 'OPLD3416SPP-4-685', 'OPLD3416SPP-2-530', 'OPLD3416SPP-2-531', 'OPLD3416SPP-2-679', 'OPLD3416SPP-2-683', 'OPLD3416SPP-2-685', 'OPLD3416SPP-2-700', 'OPLD3416SPP-2-701', 'OPLD3416SPP-2-702', 'OPLD3416SPP-2-703', 'OPLD1818SPP-7-819', 'OPLD1818SPP-7-831', 'OPLD1818SPP-7-837', 'OPLD1818SPP-7-845', 'OPLD1818SPP-7-705', 'OPLD1818SPP-7-706', 'OPLD1818SPP-7-802', 'OPLD1818SPP-7-812', 'OPLD1818SPP-7-701', 'OPLD1818SPP-7-702', 'OPLD1818SPP-7-703', 'OPLD1818SPP-7-704', 'OPLD1818SPP-7-679', 'OPLD1818SPP-7-683', 'OPLD1818SPP-7-685', 'OPLD1818SPP-7-700', 'OPLD1818SPP-8-531', 'OPLD1818SPP-8-530', 'OPLD1818SPP-8-683', 'OPLD1818SPP-8-679', 'OPLD1818SPP-7-897', 'OPLD1818SPP-7-887', 'OPLD1818SPP-7-885', 'OPLD1818SPP-7-892', 'OPLD1818SPP-7-890', 'OPLD1818SPP-7-856', 'OPLD1818SPP-7-854', 'OPLD1818SPP-7-878', 'OPLD1818SPP-7-876', 'OPLD1818SPP-5-819', 'OPLD1818SPP-5-831', 'OPLD1818SPP-5-802', 'OPLD1818SPP-5-812', 'OPLD1818SPP-5-705', 'OPLD1818SPP-5-706', 'OPLD1818SPP-5-703', 'OPLD1818SPP-5-704', 'OPLD1818SPP-5-701', 'OPLD1818SPP-5-702', 'OPLD1818SPP-5-685', 'OPLD1818SPP-5-700', 'OPLD1818SPP-5-679', 'OPLD1818SPP-5-683', 'OPLD1818SPP-5-530', 'OPLD1818SPP-5-531', 'OPLD1818SPP-7-531', 'OPLD1818SPP-7-530', 'OPLD1818SPP-5-897', 'OPLD1818SPP-5-892', 'OPLD1818SPP-5-890', 'OPLD1818SPP-5-887', 'OPLD1818SPP-5-885', 'OPLD1818SPP-5-878', 'OPLD1818SPP-5-876', 'OPLD1818SPP-5-856', 'OPLD1818SPP-5-854', 'OPLD1818SPP-5-845', 'OPLD1818SPP-5-837', 'OPLD1818SPP-4-701', 'OPLD1818SPP-4-702', 'OPLD1818SPP-4-703', 'OPLD1818SPP-4-704', 'OPLD1818SPP-4-705', 'OPLD1818SPP-4-706', 'OPLD1818SPP-4-802', 'OPLD1818SPP-4-812', 'OPLD1818SPP-4-530', 'OPLD1818SPP-4-531', 'OPLD1818SPP-4-679', 'OPLD1818SPP-4-683', 'OPLD1818SPP-4-685', 'OPLD1818SPP-4-700', 'OPLD1818SPP-4-887', 'OPLD1818SPP-2-837', 'OPLD1818SPP-2-845', 'OPLD1818SPP-2-854', 'OPLD1818SPP-2-856', 'OPLD1818SPP-2-802', 'OPLD1818SPP-2-812', 'OPLD1818SPP-2-819', 'OPLD1818SPP-2-831', 'OPLD1818SPP-2-890', 'OPLD1818SPP-2-892', 'OPLD1818SPP-2-897', 'OPLD1818SPP-2-876', 'OPLD1818SPP-2-878', 'OPLD1818SPP-2-885', 'OPLD1818SPP-2-887', 'OPLD1818SPP-2-531', 'OPLD1818SPP-2-530', 'OPLD1818SPP-2-683', 'OPLD1818SPP-2-679', 'OPLD1818SPP-2-704', 'OPLD1818SPP-2-703',
'OPLD1818SPP-2-706', 'OPLD1818SPP-2-705', 'OPLD1818SPP-2-700', 'OPLD1818SPP-2-685', 'OPLD1818SPP-2-702', 'OPLD1818SPP-2-701', 'OPLD1818SPP-4-876', 'OPLD1818SPP-4-878', 'OPLD1818SPP-4-854', 'OPLD1818SPP-4-856', 'OPLD1818SPP-4-837', 'OPLD1818SPP-4-845', 'OPLD1818SPP-4-819', 'OPLD1818SPP-4-831', 'OPLD1818SPP-4-897', 'OPLD1818SPP-4-890', 'OPLD1818SPP-4-892', 'OPLD1818SPP-4-885', 'PLD4832DPP-2-845', 'PLD4832DPP-2-837', 'PLD4832DPP-2-856', 'PLD4832DPP-2-854', 'PLD4832DPP-2-878', 'PLD4832DPP-2-876', 'PLD4832DPP-2-887', 'PLD4832DPP-2-885', 'PLD4832DPP-2-892', 'PLD4832DPP-2-890', 'PLD4832DPP-2-897', 'PLD4832DPP-4-531', 'PLD4832DPP-4-530', 'PLD4832DPP-4-679', 'PLD4832DPP-4-683', 'PLD4832DPP-4-685', 'PLD4832DPP-4-700', 'PLD4832DPP-4-701', 'PLD4832DPP-4-702', 'PLD4832DPP-4-703', 'PLD4832DPP-4-704', 'PLD4832DPP-4-705', 'PLD4832DPP-4-706', 'PLD4832DPP-4-802', 'PLD4832DPP-4-812', 'PLD4832DPP-4-819', 'PLD4832DPP-4-831', 'PLD4832DPP-4-837', 'PLD4832DPP-4-845', 'PLD4832DPP-4-878', 'PLD4832DPP-4-876', 'PLD4832DPP-4-856', 'PLD4832DPP-4-854', 'PLD4832DPP-4-892', 'PLD4832DPP-4-890', 'PLD4832DPP-4-887', 'PLD4832DPP-4-885', 'PLD4832DPP-4-897', 'PLD4832DPP-5-683', 'PLD4832DPP-5-679', 'PLD4832DPP-5-531', 'PLD4832DPP-5-530', 'PLD4832DPP-5-701', 'PLD4832DPP-5-702', 'PLD4832DPP-5-685', 'PLD4832DPP-5-700', 'PLD4832DPP-5-705', 'PLD4832DPP-5-706', 'PLD4832DPP-5-703', 'PLD4832DPP-5-704', 'PLD4832DPP-5-819', 'PLD4832DPP-5-831', 'PLD4832DPP-5-802', 'PLD4832DPP-5-812', 'PLD4832DPP-5-854', 'PLD4832DPP-5-856', 'PLD4832DPP-5-837', 'PLD4832DPP-5-845', 'PLD4832DPP-2-701', 'PLD4832DPP-2-702', 'PLD4832DPP-2-685', 'PLD4832DPP-2-700', 'PLD4832DPP-2-679', 'PLD4832DPP-2-683', 'PLD4832DPP-2-530', 'PLD4832DPP-2-531', 'PLD4832DPP-2-819', 'PLD4832DPP-2-831', 'PLD4832DPP-2-802', 'PLD4832DPP-2-812', 'PLD4832DPP-2-705', 'PLD4832DPP-2-706', 'PLD4832DPP-2-703', 'PLD4832DPP-2-704', 'PLD4226DPP-8-887', 'PLD4226DPP-8-890', 'PLD4226DPP-8-892', 'PLD4226DPP-8-897', 'PLD4226DPP-8-856', 'PLD4226DPP-8-876', 'PLD4226DPP-8-878', 'PLD4226DPP-8-885', 'PLD4226DPP-8-831', 'PLD4226DPP-8-837', 'PLD4226DPP-8-845', 'PLD4226DPP-8-854', 'PLD4226DPP-8-706', 'PLD4226DPP-8-802', 'PLD4226DPP-8-812', 'PLD4226DPP-8-819', 'PLD4226DPP-5-892', 'PLD4226DPP-5-897', 'PLD4226DPP-5-887', 'PLD4226DPP-5-890', 'PLD4226DPP-7-530', 'PLD4226DPP-7-683', 'PLD4226DPP-7-685', 'PLD4226DPP-7-531', 'PLD4226DPP-7-679', 'PLD4226DPP-7-702', 'PLD4226DPP-7-703', 'PLD4226DPP-7-700', 'PLD4226DPP-7-701', 'PLD4226DPP-5-705', 'PLD4226DPP-5-704', 'PLD4226DPP-5-703', 'PLD4226DPP-5-702', 'PLD4226DPP-5-819', 'PLD4226DPP-5-812', 'PLD4226DPP-5-802', 'PLD4226DPP-5-706', 'PLD4226DPP-5-854', 'PLD4226DPP-5-845', 'PLD4226DPP-5-837', 'PLD4226DPP-5-831', 'PLD4226DPP-5-885', 'PLD4226DPP-5-878', 'PLD4226DPP-5-876', 'PLD4226DPP-5-856', 'PLD4226DPP-7-892', 'PLD4226DPP-7-897', 'PLD4226DPP-8-530', 'PLD4226DPP-8-531', 'PLD4226DPP-8-679', 'PLD4226DPP-8-683', 'PLD4226DPP-8-685', 'PLD4226DPP-8-700', 'PLD4226DPP-8-701', 'PLD4226DPP-8-702', 'PLD4226DPP-8-703', 'PLD4226DPP-8-704', 'PLD4226DPP-8-705', 'PLD4226DPP-7-705', 'PLD4226DPP-7-704', 'PLD4226DPP-7-802', 'PLD4226DPP-7-706', 'PLD4226DPP-7-819', 'PLD4226DPP-7-812', 'PLD4226DPP-7-837', 'PLD4226DPP-7-831', 'PLD4226DPP-7-854', 'PLD4226DPP-7-845', 'PLD4226DPP-7-876', 'PLD4226DPP-7-856', 'PLD4226DPP-7-885', 'PLD4226DPP-7-878', 'PLD4226DPP-7-890', 'PLD4226DPP-7-887', 'PLD4226DPP-2-892', 'PLD4226DPP-2-897', 'PLD4226DPP-2-887', 'PLD4226DPP-2-890', 'PLD4226DPP-2-878', 'PLD4226DPP-2-885', 'PLD4226DPP-2-856', 'PLD4226DPP-2-876', 'PLD4226DPP-4-683', 'PLD4226DPP-4-685', 'PLD4226DPP-4-531', 'PLD4226DPP-4-679', 'PLD4226DPP-4-530', 'PLD4226DPP-2-705', 'PLD4226DPP-2-704', 'PLD4226DPP-2-703', 'PLD4226DPP-2-702', 'PLD4226DPP-2-701', 'PLD4226DPP-2-700', 'PLD4226DPP-2-685', 'PLD4226DPP-2-683', 'PLD4226DPP-2-854', 'PLD4226DPP-2-845', 'PLD4226DPP-2-837', 'PLD4226DPP-2-831', 'PLD4226DPP-2-819', 'PLD4226DPP-2-812', 'PLD4226DPP-2-802', 'PLD4226DPP-2-706', 'PLD4226DPP-4-892', 'PLD4226DPP-4-897', 'PLD4226DPP-4-878', 'PLD4226DPP-4-885', 'PLD4226DPP-4-887', 'PLD4226DPP-4-890', 'PLD4226DPP-5-683', 'PLD4226DPP-5-685', 'PLD4226DPP-5-700', 'PLD4226DPP-5-701', 'PLD4226DPP-5-530', 'PLD4226DPP-5-531', 'PLD4226DPP-5-679', 'PLD4226DPP-4-705', 'PLD4226DPP-4-704', 'PLD4226DPP-4-802', 'PLD4226DPP-4-706', 'PLD4226DPP-4-701', 'PLD4226DPP-4-700', 'PLD4226DPP-4-703', 'PLD4226DPP-4-702', 'PLD4226DPP-4-854', 'PLD4226DPP-4-845', 'PLD4226DPP-4-876', 'PLD4226DPP-4-856', 'PLD4226DPP-4-819', 'PLD4226DPP-4-812', 'PLD4226DPP-4-837', 'PLD4226DPP-4-831', 'PLD4226DPP-2-530', 'PLD4226DPP-2-679', 'PLD4226DPP-2-531', 'PLD5438DPP-5-683', 'PLD5438DPP-5-685', 'PLD5438DPP-5-531', 'PLD5438DPP-5-679', 'PLD5438DPP-5-702', 'PLD5438DPP-5-703', 'PLD5438DPP-5-700', 'PLD5438DPP-5-701', 'PLD5438DPP-5-706', 'PLD5438DPP-5-802', 'PLD5438DPP-5-704', 'PLD5438DPP-5-705', 'PLD5438DPP-5-831', 'PLD5438DPP-5-837', 'PLD5438DPP-5-812', 'PLD5438DPP-5-819', 'PLD5438DPP-5-876', 'PLD5438DPP-5-856', 'PLD5438DPP-5-854', 'PLD5438DPP-5-845', 'PLD5438DPP-5-890', 'PLD5438DPP-5-887', 'PLD5438DPP-5-885', 'PLD5438DPP-5-878', 'PLD5438DPP-5-897', 'PLD5438DPP-5-892', 'PLD5438DPP-7-679', 'PLD5438DPP-7-531', 'PLD5438DPP-7-530', 'PLD5438DPP-4-530', 'PLD5438DPP-4-531', 'PLD5438DPP-4-679', 'PLD5438DPP-4-683', 'PLD5438DPP-4-685', 'PLD5438DPP-4-700', 'PLD5438DPP-4-701', 'PLD5438DPP-4-702', 'PLD5438DPP-4-703', 'PLD5438DPP-4-704', 'PLD5438DPP-4-705', 'PLD5438DPP-4-706', 'PLD5438DPP-4-802', 'PLD5438DPP-4-812', 'PLD5438DPP-4-819', 'PLD5438DPP-4-837', 'PLD5438DPP-4-831', 'PLD5438DPP-4-854', 'PLD5438DPP-4-845', 'PLD5438DPP-4-876', 'PLD5438DPP-4-856', 'PLD5438DPP-4-885', 'PLD5438DPP-4-878', 'PLD5438DPP-4-890', 'PLD5438DPP-4-887', 'PLD5438DPP-4-897', 'PLD5438DPP-4-892', 'PLD5438DPP-5-530',
'PLD5438DPP
None
the final sku in there PLD5438DPP should be PLD5438DPP-5-683 and for some reason the list cuts off there, which is only element 564/85,892, and the program terminates without an error code
I cannot attach the file of skus, this is for my job, just hoping someone can shed light on what I am doing to cause the list to cut short like that
this may or may not also be relevant but when I call .append(sku) as opposed to sku[1] and grab the whole tuple the same issue occurs but at element 292, exactly half of the amount of element appended when only doing half the tuple
The issue seems to be one on my local machine, where it was unable to print such a long list, this one was of size 85,892
for those with similar issues see if any of our specs are overlapping and that may determine the cause of this issue:
VSCode:
Version: 1.52.0 (user setup)
Commit: 940b5f4bb5fa47866a54529ed759d95d09ee80be
Date: 2020-12-10T22:45:11.850Z
Electron: 9.3.5
Chrome: 83.0.4103.122
Node.js: 12.14.1
V8: 8.3.110.13-electron.0
OS: Windows_NT x64 10.0.18363
Python:
3.9.0
see discussion in comments for other details

invert regex pattern in python

I'm trying to filter from string only the Arabic character but the next function doesn't work for me:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('^[\u0627-\u064a]')
text = re.sub(non_arabic_char, "", text)
print(text)
for example:
s = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
The desired output of remove_any_non_arabic_char(s) should be قال جالينوس قد اتفق جل من فسر هذا الكتا but the input stays without changes.
What should I do?
First, you need to fix your regex as suggested in the comments, then for a more efficient solution, you will need to expand your Unicode character selection to include all Arabic character mappings. Finally, you need to keep at least one space between Arabic words to keep the Arabic text legible:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('[^\s\\u0600-\u06FF]')
text_with_no_spaces = re.sub(non_arabic_char, "", text)
text_with_single_spaces = " ".join(re.split("\s+", text_with_no_spaces))
return text_with_single_spaces
text_1 = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
text_2 = '''
تغيّر مفهوم كلمة (أدب) من العصر الجاهلي jahili (pre-Islamic) era إلى الآن عبر
مراحل periods التاريخ المتعددة. ففي الجاهلية، كانت كلمة أدب تعني (الدعوة إلى
الطعام). وبعدها، استخدم الرسول محمد (عليه السلام) الكلمة بمعنى "التهذيب والتربية"
education and mannerism. وفي العصر الأموي، اتصلت had to do كلمة أدب
بالتاريخ والفقه والقرآن والحديث. أما في العصرالعباسي، فأصبحت تعني تعلّم الشعر
والنثر prose واتسع الأدب ليشمل أنواع المعرفة وألوانها وخصوصاً علم البلاغة واللغة.
أما في الوقت الحالي، فأصبحت كلمة أدب ذات صلة pertinent بالكلام البليغ
الجميل المؤثر that impacts في أحاسيس القاريء أو السامع.
'''
# Isleem, N. M., & Abuhakema, G. M. (2020). Kalima wa Nagham: A Textbook for
# Teaching Arabic, Volume 2 (Vol. 3). University of Texas Press. (page 5)
print('text_1: \n', remove_any_non_arabic_char(text_1))
print('\ntext_2: \n\n', remove_any_non_arabic_char(text_2))
Running the code on the two texts above in Jupyter, you get:
Notice that punctuation marks shared between Arabic and English (like periods and brackets) have also been removed. To keep those, you would need to introduce more complex conditionals.

Tokenize tweet based on Regex

I have the following example text / tweet:
RT #trader $AAPL 2012 is o´o´o´o´o´pen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO
I want to follow the procedure of Table 1 in Li, T, van Dalen, J, & van Rees, P.J. (Pieter Jan). (2017). More than just noise? Examining the information content of stock microblogs on financial markets. Journal of Information Technology. doi:10.1057/s41265-016-0034-2 in order to clean up the tweet.
They clean the tweet up in such a way that the final result is:
{RT|123456} {USER|56789} {TICKER|AAPL} {NUMBER|2012} notooopen nottalk patent {COMPANY|GOOG} notdefinetli treatment {HASH|samsung} {EMOTICON|POS} haha {URL}
I use the following script to tokenize the tweet based on the regex:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"""
,
# Twitter username:
r"""(?:#[\w_]+)"""
,
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types:
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = word_re.findall(s)
if not self.preserve_case:
words = map((lambda x: x if emoticon_re.search(x) else x.lower()), words)
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
This yields the following output:
rt
#trader
$aapl
2012
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
#samsung
got
:-)
heh
url_that_cannot_be_posted_on_SO
How can I adjust this script to get:
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|url_that_cannot_be_posted_on_SO}
Thanks in advance for helping me out big time!
You really need to use named capturing groups (mentioned by thebjorn), and use groupdict() to get name-value pairs upon each match. It requires some post-processing though:
All pairs where the value is None must be discarded
If the self.preserve_case is false the value can be turned to lower case at once
If the group name is WORD, ELLIPSIS or ELSE the values are added to words as is
If the group name is HASHTAG, CASHTAG, USER or URL the values are added first stripped of $, # and # chars at the start and then added to words as {<GROUP_NAME>|<VALUE>} item
All other matches are added to words as {<GROUP_NAME>|<VALUE>} item.
Note that \w matches underscores by default, so [\w_] = \w. I optimized the patterns a little bit.
Here is a fixed code snippet:
import re
emoticon_string = r"""
(?P<EMOTICON>
[<>]?
[:;=8] # eyes
[-o*']? # optional nose
[][()dDpP/:{}#|\\] # mouth
|
[][()dDpP/:}{#|\\] # mouth
[-o*']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""(?P<URL>https?://(?:[-a-zA-Z0-9_$#.&+!*(),]|%[0-9a-fA-F][0-9a-fA-F])+)"""
,
# Twitter username:
r"""(?P<USER>#\w+)"""
,
# Hashtags:
r"""(?P<HASHTAG>\#+\w+[\w'-]*\w+)"""
,
# Cashtags:
r"""(?P<CASHTAG>\$+\w+[\w'-]*\w+)"""
,
# Remaining word types:
r"""
(?P<NUMBER>[+-]?\d+(?:[,/.:-]\d+[+-]?)?) # Numbers, including fractions, decimals.
|
(?P<WORD>\w+) # Words without apostrophes or dashes.
|
(?P<ELLIPSIS>\.(?:\s*\.)+) # Ellipsis dots.
|
(?P<ELSE>\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""({}|{})""".format(emoticon_string, "|".join(regex_strings)), re.VERBOSE | re.I | re.UNICODE)
#print(word_re.pattern)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = []
for x in word_re.finditer(s):
for key, val in x.groupdict().items():
if val:
if not self.preserve_case:
val = val.lower()
if key in ['WORD','ELLIPSIS','ELSE']:
words.append(val)
elif key in ['HASHTAG','CASHTAG','USER','URL']: # Add more here if needed
words.append("{{{}|{}}}".format(key, re.sub(r'^[##$]+', '', val)))
else:
words.append("{{{}|{}}}".format(key, val))
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
With test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com', it outputs
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|http://some.site.here.com}
See the regex demo online.

How to train a sense2vec model

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.
The file can be found at:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
What type of input format does this script require?
Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec.
You go from this input text:
"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money."
To this:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN
Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped
Here's the code. Let me know if you have questions.
I'll probably publish it on github.com/woltob soon.
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
You could visualise your model using Gensim in Tensorboard using this approach:
https://github.com/ArdalanM/gensim2tensorboard
I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).
Happy coding,
woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']

Categories

Resources