Sphinx and Python, multilanguage search - python
I have the table in UTF-8 called tr_transcriptions. It stores dozens records with 'text' field on different languages. There're 34 languages supported at the moment:
Afrikaans, Korean, Arabic, Malayalam, Bahasa, Mandarin, Bahasama, Norwegian, Croatian,
Polish, Czech, Portuguese, Danish, Romanian, Dutch, Russian, English, Slovak, Flemish
Spanish, French, Swedish, German, Tagalog, Greek, Tamil, Hindi, Telugu, Hungarian, Thai, Italian, Turkish, Kannada, Vietnamese
I want to gives to users an opportunity to search through this table. I have a few issues with that. I can't get it working with Sphinx. Here's my config file:
source transcriptions
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = pass
sql_db = transcriptions
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query = \
SELECT * \
FROM tr_transcriptions
}
index tr_transcriptions
{
source = transcriptions
charset_type = utf-8
charset_table = U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00E0->a, U+00E1->a,\
U+00E2->a, U+00E3->a, U+00E4->a, U+00E5->a, U+0100->a, U+0101->a, U+0102->a, U+0103->a,\
U+010300->a, U+0104->a, U+0105->a, U+01CD->a, U+01CE->a, U+01DE->a, U+01DF->a, U+01E0->a,\
U+01E1->a, U+01FA->a, U+01FB->a, U+0200->a, U+0201->a, U+0202->a, U+0203->a, U+0226->a,\
U+0227->a, U+023A->a, U+0250->a, U+04D0->a, U+04D1->a, U+1D2C->a, U+1D43->a, U+1D44->a,\
U+1D8F->a, U+1E00->a, U+1E01->a, U+1E9A->a, U+1EA0->a, U+1EA1->a, U+1EA2->a, U+1EA3->a,\
U+1EA4->a, U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a, U+1EA9->a, U+1EAA->a, U+1EAB->a,\
U+1EAC->a, U+1EAD->a, U+1EAE->a, U+1EAF->a, U+1EB0->a, U+1EB1->a, U+1EB2->a, U+1EB3->a,\
U+1EB4->a, U+1EB5->a, U+1EB6->a, U+1EB7->a, U+2090->a, U+2C65->a, U+0180->b, U+0181->b,\
U+0182->b, U+0183->b, U+0243->b, U+0253->b, U+0299->b, U+16D2->b, U+1D03->b, U+1D2E->b,\
U+1D2F->b, U+1D47->b, U+1D6C->b, U+1D80->b, U+1E02->b, U+1E03->b, U+1E04->b, U+1E05->b,\
U+1E06->b, U+1E07->b, U+00C7->c, U+00E7->c, U+0106->c, U+0107->c, U+0108->c, U+0109->c,\
U+010A->c, U+010B->c, U+010C->c, U+010D->c, U+0187->c, U+0188->c, U+023B->c, U+023C->c,\
U+0255->c, U+0297->c, U+1D9C->c, U+1D9D->c, U+1E08->c, U+1E09->c, U+212D->c, U+2184->c,\
U+010E->d, U+010F->d, U+0110->d, U+0111->d, U+0189->d, U+018A->d, U+018B->d, U+018C->d,\
U+01C5->d, U+01F2->d, U+0221->d, U+0256->d, U+0257->d, U+1D05->d, U+1D30->d, U+1D48->d,\
U+1D6D->d, U+1D81->d, U+1D91->d, U+1E0A->d, U+1E0B->d, U+1E0C->d, U+1E0D->d, U+1E0E->d,\
U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d, U+1E13->d, U+00C8->e, U+00C9->e, U+00CA->e,\
U+00CB->e, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e, U+0112->e, U+0113->e, U+0114->e,\
U+0115->e, U+0116->e, U+0117->e, U+0118->e, U+0119->e, U+011A->e, U+011B->e, U+018E->e,\
U+0190->e, U+01DD->e, U+0204->e, U+0205->e, U+0206->e, U+0207->e, U+0228->e, U+0229->e,\
U+0246->e, U+0247->e, U+0258->e, U+025B->e, U+025C->e, U+025D->e, U+025E->e, U+029A->e,\
U+1D07->e, U+1D08->e, U+1D31->e, U+1D32->e, U+1D49->e, U+1D4B->e, U+1D4C->e, U+1D92->e,\
U+1D93->e, U+1D94->e, U+1D9F->e, U+1E14->e, U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e,\
U+1E19->e, U+1E1A->e, U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1EB8->e, U+1EB9->e, U+1EBA->e,\
U+1EBB->e, U+1EBC->e, U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e, U+1EC1->e, U+1EC2->e,\
U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e, U+1EC7->e, U+2091->e, U+0191->f, U+0192->f,\
U+1D6E->f, U+1D82->f, U+1DA0->f, U+1E1E->f, U+1E1F->f, U+011C->g, U+011D->g, U+011E->g,\
U+011F->g, U+0120->g, U+0121->g, U+0122->g, U+0123->g, U+0193->g, U+01E4->g, U+01E5->g,\
U+01E6->g, U+01E7->g, U+01F4->g, U+01F5->g, U+0260->g, U+0261->g, U+0262->g, U+029B->g,\
U+1D33->g, U+1D4D->g, U+1D77->g, U+1D79->g, U+1D83->g, U+1DA2->g, U+1E20->g, U+1E21->g,\
U+0124->h, U+0125->h, U+0126->h, U+0127->h, U+021E->h, U+021F->h, U+0265->h, U+0266->h,\
U+029C->h, U+02AE->h, U+02AF->h, U+02B0->h, U+02B1->h, U+1D34->h, U+1DA3->h, U+1E22->h,\
U+1E23->h, U+1E24->h, U+1E25->h, U+1E26->h, U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h,\
U+1E2B->h, U+1E96->h, U+210C->h, U+2C67->h, U+2C68->h, U+2C75->h, U+2C76->h, U+00CC->i,\
U+00CD->i, U+00CE->i, U+00CF->i, U+00EC->i, U+00ED->i, U+00EE->i, U+00EF->i, U+010309->i,\
U+0128->i, U+0129->i, U+012A->i, U+012B->i, U+012C->i, U+012D->i, U+012E->i, U+012F->i,\
U+0130->i, U+0131->i, U+0197->i, U+01CF->i, U+01D0->i, U+0208->i, U+0209->i, U+020A->i,\
U+020B->i, U+0268->i, U+026A->i, U+040D->i, U+0418->i, U+0419->i, U+0438->i, U+0439->i,\
U+0456->i, U+1D09->i, U+1D35->i, U+1D4E->i, U+1D62->i, U+1D7B->i, U+1D96->i, U+1DA4->i,\
U+1DA6->i, U+1DA7->i, U+1E2C->i, U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1EC8->i, U+1EC9->i,\
U+1ECA->i, U+1ECB->i, U+2071->i, U+2111->i, U+0134->j, U+0135->j, U+01C8->j, U+01CB->j,\
U+01F0->j, U+0237->j, U+0248->j, U+0249->j, U+025F->j, U+0284->j, U+029D->j, U+02B2->j,\
U+1D0A->j, U+1D36->j, U+1DA1->j, U+1DA8->j, U+0136->k, U+0137->k, U+0198->k, U+0199->k,\
U+01E8->k, U+01E9->k, U+029E->k, U+1D0B->k, U+1D37->k, U+1D4F->k, U+1D84->k, U+1E30->k,\
U+1E31->k, U+1E32->k, U+1E33->k, U+1E34->k, U+1E35->k, U+2C69->k, U+2C6A->k, U+0139->l,\
U+013A->l, U+013B->l, U+013C->l, U+013D->l, U+013E->l, U+013F->l, U+0140->l, U+0141->l,\
U+0142->l, U+019A->l, U+01C8->l, U+0234->l, U+023D->l, U+026B->l, U+026C->l, U+026D->l,\
U+029F->l, U+02E1->l, U+1D0C->l, U+1D38->l, U+1D85->l, U+1DA9->l, U+1DAA->l, U+1DAB->l,\
U+1E36->l, U+1E37->l, U+1E38->l, U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l, U+1E3D->l,\
U+2C60->l, U+2C61->l, U+2C62->l, U+019C->m, U+026F->m, U+0270->m, U+0271->m, U+1D0D->m,\
U+1D1F->m, U+1D39->m, U+1D50->m, U+1D5A->m, U+1D6F->m, U+1D86->m, U+1DAC->m, U+1DAD->m,\
U+1E3E->m, U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m, U+1E43->m, U+00D1->n, U+00F1->n,\
U+0143->n, U+0144->n, U+0145->n, U+0146->n, U+0147->n, U+0148->n, U+0149->n, U+019D->n,\
U+019E->n, U+01CB->n, U+01F8->n, U+01F9->n, U+0220->n, U+0235->n, U+0272->n, U+0273->n,\
U+0274->n, U+1D0E->n, U+1D3A->n, U+1D3B->n, U+1D70->n, U+1D87->n, U+1DAE->n, U+1DAF->n,\
U+1DB0->n, U+1E44->n, U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n, U+1E49->n, U+1E4A->n,\
U+1E4B->n, U+207F->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o, U+00D6->o, U+00D8->o,\
U+00F2->o, U+00F3->o, U+00F4->o, U+00F5->o, U+00F6->o, U+00F8->o, U+01030F->o, U+014C->o,\
U+014D->o, U+014E->o, U+014F->o, U+0150->o, U+0151->o, U+0186->o, U+019F->o, U+01A0->o,\
U+01A1->o, U+01D1->o, U+01D2->o, U+01EA->o, U+01EB->o, U+01EC->o, U+01ED->o, U+01FE->o,\
U+01FF->o, U+020C->o, U+020D->o, U+020E->o, U+020F->o, U+022A->o, U+022B->o, U+022C->o,\
U+022D->o, U+022E->o, U+022F->o, U+0230->o, U+0231->o, U+0254->o, U+0275->o, U+043E->o,\
U+04E6->o, U+04E7->o, U+04E8->o, U+04E9->o, U+04EA->o, U+04EB->o, U+1D0F->o, U+1D10->o,\
U+1D11->o, U+1D12->o, U+1D13->o, U+1D16->o, U+1D17->o, U+1D3C->o, U+1D52->o, U+1D53->o,\
U+1D54->o, U+1D55->o, U+1D97->o, U+1DB1->o, U+1E4C->o, U+1E4D->o, U+1E4E->o, U+1E4F->o,\
U+1E50->o, U+1E51->o, U+1E52->o, U+1E53->o, U+1ECC->o, U+1ECD->o, U+1ECE->o, U+1ECF->o,\
U+1ED0->o, U+1ED1->o, U+1ED2->o, U+1ED3->o, U+1ED4->o, U+1ED5->o, U+1ED6->o, U+1ED7->o,\
U+1ED8->o, U+1ED9->o, U+1EDA->o, U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o, U+1EDF->o,\
U+1EE0->o, U+1EE1->o, U+1EE2->o, U+1EE3->o, U+2092->o, U+2C9E->o, U+2C9F->o, U+01A4->p,\
U+01A5->p, U+1D18->p, U+1D3E->p, U+1D56->p, U+1D71->p, U+1D7D->p, U+1D88->p, U+1E54->p,\
U+1E55->p, U+1E56->p, U+1E57->p, U+2C63->p, U+024A->q, U+024B->q, U+02A0->q, U+0154->r,\
U+0155->r, U+0156->r, U+0157->r, U+0158->r, U+0159->r, U+0210->r, U+0211->r, U+0212->r,\
U+0213->r, U+024C->r, U+024D->r, U+0279->r, U+027A->r, U+027B->r, U+027C->r, U+027D->r,\
U+027E->r, U+027F->r, U+0280->r, U+0281->r, U+02B3->r, U+02B4->r, U+02B5->r, U+02B6->r,\
U+1D19->r, U+1D1A->r, U+1D3F->r, U+1D63->r, U+1D72->r, U+1D73->r, U+1D89->r, U+1DCA->r,\
U+1E58->r, U+1E59->r, U+1E5A->r, U+1E5B->r, U+1E5C->r, U+1E5D->r, U+1E5E->r, U+1E5F->r,\
U+211C->r, U+2C64->r, U+00DF->s, U+015A->s, U+015B->s, U+015C->s, U+015D->s, U+015E->s,\
U+015F->s, U+0160->s, U+0161->s, U+017F->s, U+0218->s, U+0219->s, U+023F->s, U+0282->s,\
U+02E2->s, U+1D74->s, U+1D8A->s, U+1DB3->s, U+1E60->s, U+1E61->s, U+1E62->s, U+1E63->s,\
U+1E64->s, U+1E65->s, U+1E66->s, U+1E67->s, U+1E68->s, U+1E69->s, U+1E9B->s, U+0162->t,\
U+0163->t, U+0164->t, U+0165->t, U+0166->t, U+0167->t, U+01AB->t, U+01AC->t, U+01AD->t,\
U+01AE->t, U+021A->t, U+021B->t, U+0236->t, U+023E->t, U+0287->t, U+0288->t, U+1D1B->t,\
U+1D40->t, U+1D57->t, U+1D75->t, U+1DB5->t, U+1E6A->t, U+1E6B->t, U+1E6C->t, U+1E6D->t,\
U+1E6E->t, U+1E6F->t, U+1E70->t, U+1E71->t, U+1E97->t, U+2C66->t, U+00D9->u, U+00DA->u,\
U+00DB->u, U+00DC->u, U+00F9->u, U+00FA->u, U+00FB->u, U+00FC->u, U+010316->u, U+0168->u,\
U+0169->u, U+016A->u, U+016B->u, U+016C->u, U+016D->u, U+016E->u, U+016F->u, U+0170->u,\
U+0171->u, U+0172->u, U+0173->u, U+01AF->u, U+01B0->u, U+01D3->u, U+01D4->u, U+01D5->u,\
U+01D6->u, U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u, U+01DB->u, U+01DC->u, U+0214->u,\
U+0215->u, U+0216->u, U+0217->u, U+0244->u, U+0289->u, U+1D1C->u, U+1D1D->u, U+1D1E->u,\
U+1D41->u, U+1D58->u, U+1D59->u, U+1D64->u, U+1D7E->u, U+1D99->u, U+1DB6->u, U+1DB8->u,\
U+1E72->u, U+1E73->u, U+1E74->u, U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u, U+1E79->u,\
U+1E7A->u, U+1E7B->u, U+1EE4->u, U+1EE5->u, U+1EE6->u, U+1EE7->u, U+1EE8->u, U+1EE9->u,\
U+1EEA->u, U+1EEB->u, U+1EEC->u, U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u, U+1EF1->u,\
U+01B2->v, U+0245->v, U+028B->v, U+028C->v, U+1D20->v, U+1D5B->v, U+1D65->v, U+1D8C->v,\
U+1DB9->v, U+1DBA->v, U+1E7C->v, U+1E7D->v, U+1E7E->v, U+1E7F->v, U+2C74->v, U+0174->w,\
U+0175->w, U+028D->w, U+02B7->w, U+1D21->w, U+1D42->w, U+1E80->w, U+1E81->w, U+1E82->w,\
U+1E83->w, U+1E84->w, U+1E85->w, U+1E86->w, U+1E87->w, U+1E88->w, U+1E89->w, U+1E98->w,\
U+02E3->x, U+1D8D->x, U+1E8A->x, U+1E8B->x, U+1E8C->x, U+1E8D->x, U+2093->x, U+00DD->y,\
U+00FD->y, U+00FF->y, U+0176->y, U+0177->y, U+0178->y, U+01B3->y, U+01B4->y, U+0232->y,\
U+0233->y, U+024E->y, U+024F->y, U+028E->y, U+028F->y, U+02B8->y, U+1E8E->y, U+1E8F->y,\
U+1E99->y, U+1EF2->y, U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y, U+1EF7->y, U+1EF8->y,\
U+1EF9->y, U+0179->z, U+017A->z, U+017B->z, U+017C->z, U+017D->z, U+017E->z, U+01B5->z,\
U+01B6->z, U+0224->z, U+0225->z, U+0240->z, U+0290->z, U+0291->z, U+1D22->z, U+1D76->z,\
U+1D8E->z, U+1DBB->z, U+1DBC->z, U+1DBD->z, U+1E90->z, U+1E91->z, U+1E92->z, U+1E93->z,\
U+1E94->z, U+1E95->z, U+2128->z, U+2C6B->z, U+2C6C->z, U+00C6->U+00E6, U+01E2->U+00E6,\
U+01E3->U+00E6, U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, U+1D02->U+00E6,\
U+1D2D->U+00E6, U+1D46->U+00E6, U+00E6, U+0400->U+0435, U+0401->U+0435, U+0402->U+0452,\
U+0452, U+0403->U+0433, U+0404->U+0454, U+0454, U+0405->U+0455, U+0455, U+0406->U+0456,\
U+0407->U+0456, U+0457->U+0456, U+0456, U+0408..U+040B->U+0458..U+045B, U+0458..U+045B,\
U+040C->U+043A, U+040D->U+0438, U+040E->U+0443, U+040F->U+045F, U+045F, U+0450->U+0435,\
U+0451->U+0435, U+0453->U+0433, U+045C->U+043A, U+045D->U+0438, U+045E->U+0443,\
U+0460->U+0461, U+0461, U+0462->U+0463, U+0463, U+0464->U+0465, U+0465, U+0466->U+0467,\
U+0467, U+0468->U+0469, U+0469, U+046A->U+046B, U+046B, U+046C->U+046D, U+046D,\
U+046E->U+046F, U+046F, U+0470->U+0471, U+0471, U+0472->U+0473, U+0473, U+0474->U+0475,\
U+0476->U+0475, U+0477->U+0475, U+0475, U+0478->U+0479, U+0479, U+047A->U+047B, U+047B,\
U+047C->U+047D, U+047D, U+047E->U+047F, U+047F, U+0480->U+0481, U+0481, U+048A->U+0438,\
U+048B->U+0438, U+048C->U+044C, U+048D->U+044C, U+048E->U+0440, U+048F->U+0440,\
U+0490->U+0433, U+0491->U+0433, U+0490->U+0433, U+0491->U+0433, U+0492->U+0433,\
U+0493->U+0433, U+0494->U+0433, U+0495->U+0433, U+0496->U+0436, U+0497->U+0436,\
U+0498->U+0437, U+0499->U+0437, U+049A->U+043A, U+049B->U+043A, U+049C->U+043A,\
U+049D->U+043A, U+049E->U+043A, U+049F->U+043A, U+04A0->U+043A, U+04A1->U+043A,\
U+04A2->U+043D, U+04A3->U+043D, U+04A4->U+043D, U+04A5->U+043D, U+04A6->U+043F,\
U+04A7->U+043F, U+04A8->U+04A9, U+04A9, U+04AA->U+0441, U+04AB->U+0441, U+04AC->U+0442,\
U+04AD->U+0442, U+04AE->U+0443, U+04AF->U+0443, U+04B0->U+0443, U+04B1->U+0443,\
U+04B2->U+0445, U+04B3->U+0445, U+04B4->U+04B5, U+04B5, U+04B6->U+0447, U+04B7->U+0447,\
U+04B8->U+0447, U+04B9->U+0447, U+04BA->U+04BB, U+04BB, U+04BC->U+04BD, U+04BE->U+04BD,\
U+04BF->U+04BD, U+04BD, U+04C0->U+04CF, U+04CF, U+04C1->U+0436, U+04C2->U+0436,\
U+04C3->U+043A, U+04C4->U+043A, U+04C5->U+043B, U+04C6->U+043B, U+04C7->U+043D,\
U+04C8->U+043D, U+04C9->U+043D, U+04CA->U+043D, U+04CB->U+0447, U+04CC->U+0447,\
U+04CD->U+043C, U+04CE->U+043C, U+04D0->U+0430, U+04D1->U+0430, U+04D2->U+0430,\
U+04D3->U+0430, U+04D4->U+00E6, U+04D5->U+00E6, U+04D6->U+0435, U+04D7->U+0435,\
U+04D8->U+04D9, U+04DA->U+04D9, U+04DB->U+04D9, U+04D9, U+04DC->U+0436, U+04DD->U+0436,\
U+04DE->U+0437, U+04DF->U+0437, U+04E0->U+04E1, U+04E1, U+04E2->U+0438, U+04E3->U+0438,\
U+04E4->U+0438, U+04E5->U+0438, U+04E6->U+043E, U+04E7->U+043E, U+04E8->U+043E,\
U+04E9->U+043E, U+04EA->U+043E, U+04EB->U+043E, U+04EC->U+044D, U+04ED->U+044D,\
U+04EE->U+0443, U+04EF->U+0443, U+04F0->U+0443, U+04F1->U+0443, U+04F2->U+0443,\
U+04F3->U+0443, U+04F4->U+0447, U+04F5->U+0447, U+04F6->U+0433, U+04F7->U+0433,\
U+04F8->U+044B, U+04F9->U+044B, U+04FA->U+0433, U+04FB->U+0433, U+04FC->U+0445,\
U+04FD->U+0445, U+04FE->U+0445, U+04FF->U+0445, U+0410..U+0418->U+0430..U+0438,\
U+0419->U+0438, U+0430..U+0438, U+041A..U+042F->U+043A..U+044F, U+043A..U+044F,\
U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z\
But when I search for e.g. Arabic text it doesn't give me any results.. but mysql's LIKE does.
Also I wanted to ask is there a better Search Server to such things than Sphinx?
Thanks a lot
Related
Encoding foreign alphabet characters
I am getting data from an XML provided by an API that for some reason lists Czechslovak characters in a different encoding (e.g. instead of correct Czechoslovak "ý" it uses "ý"). Therefore, instead of providing the correct output to the user -> "Zelený" the output is -> "Zelený" I went through multiple StackOverflow posts, other fora and tutorials, but I still cannot figure out how to make it turn "Zelený" into "Zelený" (this is just one of the weird characters used by the XML so I cannot use str.replace). I figured out, that the correct encoding for the Czech/Slovak language is "windows-1250" My code: def change_encoding(what): what = what.encode("windows-1250") return what clean_xml_input = change_encoding(xml_input) This produces error: 'charmap' codec can't encode characters in position 5-6: character maps to <undefined>
"Zelený".encode("Windows-1252").decode("utf-8") #'Zelený' "Zelený".encode("windows-1254").decode("utf-8") #'Zelený' "Zelený".encode("iso-8859-1").decode("utf-8") #'Zelený' "Zelený".encode("iso-8859-9").decode("utf-8") #'Zelený' If it is helpful from itertools import permutations all_encoding = ['ASMO-708', 'big5', 'cp1025', 'cp866', 'cp875', 'csISO2022JP', 'DOS-720', 'DOS-862', 'EUC-CN', 'EUC-JP', 'euc-jp', 'euc-kr', 'GB18030', 'gb2312', 'hz-gb-2312', 'IBM00858', 'IBM00924', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'IBM037', 'IBM1026', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'IBM437', 'IBM500', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM860', 'ibm861', 'IBM863', 'IBM864', 'IBM865', 'ibm869', 'IBM870', 'IBM871', 'IBM880', 'IBM905', 'IBM-Thai', 'iso-2022-jp', 'iso-2022-jp', 'iso-2022-kr', 'iso-8859-1', 'iso-8859-13', 'iso-8859-15', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-8-i', 'iso-8859-9', 'Johab', 'koi8-r', 'koi8-u', 'ks_c_5601-1987', 'macintosh', 'shift_jis', 'us-ascii', 'utf-16', 'utf-16BE', 'utf-32', 'utf-32BE', 'utf-7', 'utf-8', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'windows-874', 'x-Chinese-CNS', 'x-Chinese-Eten', 'x-cp20001', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-cp20261', 'x-cp20269', 'x-cp20936', 'x-cp20949', 'x-cp50227', 'x-EBCDIC-KoreanExtended', 'x-Europa', 'x-IA5', 'x-IA5-German', 'x-IA5-Norwegian', 'x-IA5-Swedish', 'x-iscii-as', 'x-iscii-be', 'x-iscii-de', 'x-iscii-gu', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-or', 'x-iscii-pa', 'x-iscii-ta', 'x-iscii-te', 'x-mac-arabic', 'x-mac-ce', 'x-mac-chinesesimp', 'x-mac-chinesetrad', 'x-mac-croatian', 'x-mac-cyrillic', 'x-mac-greek', 'x-mac-hebrew', 'x-mac-icelandic', 'x-mac-japanese', 'x-mac-korean', 'x-mac-romanian', 'x-mac-thai', 'x-mac-turkish', 'x-mac-ukrainian'] for i,j in permutations(all_encoding, 2): try: if("Zelený".encode(i).decode(j) == 'Zelený'): print(f'encode with `{i}` and decode with `{j}`') except: pass
Output not displaying full list of elements appended
from csv import reader def func(sku_list): values = [] with open(sku_list, 'r', encoding = 'utf-8') as pr: rows = reader(pr) for sku in rows: values.append(sku[1]) return(values) if __name__ == '__main__': dir_path = "C:/Users/XXXX/Downloads/" vendors = dir_path + 'file.csv' new_prices = func(vendors) print(new_prices) sku_list is a csv file filled with pairs of brand names and their skus that I have downloaded from my db, for some reason as it iters through the rows and grabs just the sku value, hence sku[1], it stops well short of the actual length I expect the list to be sku_list is 85,892 tuples long but when I print out the values appended to the list values it simply returns this: ['SKU', 'MWGB4896', 'MWGB4872', 'MWGB4848', 'MWGB3648', 'WGB4896', 'WGB4872', 'WGB4848', 'WGB3648', 'WGB2436', 'WGB1824', 'BKGB4896NT', 'BKGB4872NT', 'BKGB4848NT', 'BKGB3648NT', 'BKGB2436NT', 'BKGB1824NT', 'WFC2418G', 'WFC2418', 'WFC3624', 'WFC2418LB', 'WFC3648LB', 'WFC3624LB', 'WFC3624G', 'WFC3648G', 'WFC3648', 'LOWFC3624LB', 'LOWFC3624G', 'LOWFC3624', 'LOWFC2418LB', 'LOWFC2418G', 'LOWFC2418', 'LOWFC3648LB', 'LOWFC3648', 'LOWFC3648G', 'WM-7-B', 'WM-7-G', 'WM-7-BK', 'WMC-7', 'WM-7-R', 'APS-50', 'APS-70', 'APS-60', 'APS-84', 'SS15W', 'SC15W', 'SB15W', 'MFL-2W', 'WP-48', 'WP-40', 'WP-36', 'MP-48', 'MP-40', 'MP-36', 'OP-40', 'OP-36', 'OP-48', 'FFVSU96-2', 'FFVSU144-2', 'FFVSU192-2', '1-WA-1B', '1-WA-1BP', 'WCS-12', 'WCS-144', 'OPLD3416LSPP-2', 'OPLD3416LSPP-4', 'OPLD3416LSPP-5', 'OPLD3416LSPP-7', 'OPLD3416LSPP-8', 'OPLD1818LSPP-2', 'OPLD1818LSPP-4', 'OPLD1818LSPP-5', 'OPLD1818LSPP-7', 'OPLD1818LSPP-8', 'OPLD1818L-2', 'OPLD1818L-5', 'OPLD1818L-4', 'OPLD3416L-2', 'OPLD3416L-4', 'OPLD3416L-5', 'OPLD3416L-7', 'OPLD3416L-8', 'OPLD1818L-7', 'OPLD1818L-8', 'OPLD3416SPP-8-892', 'OPLD3416SPP-8-897', 'OPLD3416SPP-8-878', 'OPLD3416SPP-8-885', 'OPLD3416SPP-8-887', 'OPLD3416SPP-8-890', 'OPLD3416SPP-8-845', 'OPLD3416SPP-8-854', 'OPLD3416SPP-8-856', 'OPLD3416SPP-8-876', 'OPLD3416SPP-8-802', 'OPLD3416SPP-8-706', 'OPLD3416SPP-8-705', 'OPLD3416SPP-8-704', 'OPLD3416SPP-8-837', 'OPLD3416SPP-8-831', 'OPLD3416SPP-8-819', 'OPLD3416SPP-8-812', 'OPLD3416SPP-8-685', 'OPLD3416SPP-8-683', 'OPLD3416SPP-8-679', 'OPLD3416SPP-8-531', 'OPLD3416SPP-8-703', 'OPLD3416SPP-8-702', 'OPLD3416SPP-8-701', 'OPLD3416SPP-8-700', 'OPLD3416SPP-7-892', 'OPLD3416SPP-7-897', 'OPLD3416SPP-7-887', 'OPLD3416SPP-7-890', 'OPLD3416SPP-8-530', 'OPLD3416SPP-7-845', 'OPLD3416SPP-7-854', 'OPLD3416SPP-7-831', 'OPLD3416SPP-7-837', 'OPLD3416SPP-7-878', 'OPLD3416SPP-7-885', 'OPLD3416SPP-7-856', 'OPLD3416SPP-7-876', 'OPLD3416SPP-7-703', 'OPLD3416SPP-7-702', 'OPLD3416SPP-7-705', 'OPLD3416SPP-7-704', 'OPLD3416SPP-7-802', 'OPLD3416SPP-7-706', 'OPLD3416SPP-7-819', 'OPLD3416SPP-7-812', 'OPLD3416SPP-7-530', 'OPLD3416SPP-7-679', 'OPLD3416SPP-7-531', 'OPLD3416SPP-7-685', 'OPLD3416SPP-7-683', 'OPLD3416SPP-7-701', 'OPLD3416SPP-7-700', 'OPLD3416SPP-5-878', 'OPLD3416SPP-5-885', 'OPLD3416SPP-5-887', 'OPLD3416SPP-5-890', 'OPLD3416SPP-5-892', 'OPLD3416SPP-5-897', 'OPLD3416SPP-5-812', 'OPLD3416SPP-5-819', 'OPLD3416SPP-5-831', 'OPLD3416SPP-5-837', 'OPLD3416SPP-5-845', 'OPLD3416SPP-5-854', 'OPLD3416SPP-5-856', 'OPLD3416SPP-5-876', 'OPLD1818SPP-8-819', 'OPLD1818SPP-8-831', 'OPLD1818SPP-8-802', 'OPLD1818SPP-8-812', 'OPLD1818SPP-8-854', 'OPLD1818SPP-8-856', 'OPLD1818SPP-8-837', 'OPLD1818SPP-8-845', 'OPLD1818SPP-8-701', 'OPLD1818SPP-8-702', 'OPLD1818SPP-8-685', 'OPLD1818SPP-8-700', 'OPLD1818SPP-8-705', 'OPLD1818SPP-8-706', 'OPLD1818SPP-8-703', 'OPLD1818SPP-8-704', 'OPLD1818SPP-8-887', 'OPLD1818SPP-8-885', 'OPLD1818SPP-8-878', 'OPLD1818SPP-8-876', 'OPLD1818SPP-8-897', 'OPLD1818SPP-8-892', 'OPLD1818SPP-8-890', 'OPLD3416SPP-4-837', 'OPLD3416SPP-4-831', 'OPLD3416SPP-4-854', 'OPLD3416SPP-4-845', 'OPLD3416SPP-4-802', 'OPLD3416SPP-4-706', 'OPLD3416SPP-4-819', 'OPLD3416SPP-4-812', 'OPLD3416SPP-4-890', 'OPLD3416SPP-4-887', 'OPLD3416SPP-4-897', 'OPLD3416SPP-4-892', 'OPLD3416SPP-4-876', 'OPLD3416SPP-4-856', 'OPLD3416SPP-4-885', 'OPLD3416SPP-4-878', 'OPLD3416SPP-5-531', 'OPLD3416SPP-5-679', 'OPLD3416SPP-5-683', 'OPLD3416SPP-5-685', 'OPLD3416SPP-5-530', 'OPLD3416SPP-5-704', 'OPLD3416SPP-5-705', 'OPLD3416SPP-5-706', 'OPLD3416SPP-5-802', 'OPLD3416SPP-5-700', 'OPLD3416SPP-5-701', 'OPLD3416SPP-5-702', 'OPLD3416SPP-5-703', 'OPLD3416SPP-2-837', 'OPLD3416SPP-2-831', 'OPLD3416SPP-2-819', 'OPLD3416SPP-2-812', 'OPLD3416SPP-2-802', 'OPLD3416SPP-2-706', 'OPLD3416SPP-2-705', 'OPLD3416SPP-2-704', 'OPLD3416SPP-2-890', 'OPLD3416SPP-2-887', 'OPLD3416SPP-2-885', 'OPLD3416SPP-2-878', 'OPLD3416SPP-2-876', 'OPLD3416SPP-2-856', 'OPLD3416SPP-2-854', 'OPLD3416SPP-2-845', 'OPLD3416SPP-4-531', 'OPLD3416SPP-4-679', 'OPLD3416SPP-4-530', 'OPLD3416SPP-2-892', 'OPLD3416SPP-2-897', 'OPLD3416SPP-4-704', 'OPLD3416SPP-4-705', 'OPLD3416SPP-4-702', 'OPLD3416SPP-4-703', 'OPLD3416SPP-4-700', 'OPLD3416SPP-4-701', 'OPLD3416SPP-4-683', 'OPLD3416SPP-4-685', 'OPLD3416SPP-2-530', 'OPLD3416SPP-2-531', 'OPLD3416SPP-2-679', 'OPLD3416SPP-2-683', 'OPLD3416SPP-2-685', 'OPLD3416SPP-2-700', 'OPLD3416SPP-2-701', 'OPLD3416SPP-2-702', 'OPLD3416SPP-2-703', 'OPLD1818SPP-7-819', 'OPLD1818SPP-7-831', 'OPLD1818SPP-7-837', 'OPLD1818SPP-7-845', 'OPLD1818SPP-7-705', 'OPLD1818SPP-7-706', 'OPLD1818SPP-7-802', 'OPLD1818SPP-7-812', 'OPLD1818SPP-7-701', 'OPLD1818SPP-7-702', 'OPLD1818SPP-7-703', 'OPLD1818SPP-7-704', 'OPLD1818SPP-7-679', 'OPLD1818SPP-7-683', 'OPLD1818SPP-7-685', 'OPLD1818SPP-7-700', 'OPLD1818SPP-8-531', 'OPLD1818SPP-8-530', 'OPLD1818SPP-8-683', 'OPLD1818SPP-8-679', 'OPLD1818SPP-7-897', 'OPLD1818SPP-7-887', 'OPLD1818SPP-7-885', 'OPLD1818SPP-7-892', 'OPLD1818SPP-7-890', 'OPLD1818SPP-7-856', 'OPLD1818SPP-7-854', 'OPLD1818SPP-7-878', 'OPLD1818SPP-7-876', 'OPLD1818SPP-5-819', 'OPLD1818SPP-5-831', 'OPLD1818SPP-5-802', 'OPLD1818SPP-5-812', 'OPLD1818SPP-5-705', 'OPLD1818SPP-5-706', 'OPLD1818SPP-5-703', 'OPLD1818SPP-5-704', 'OPLD1818SPP-5-701', 'OPLD1818SPP-5-702', 'OPLD1818SPP-5-685', 'OPLD1818SPP-5-700', 'OPLD1818SPP-5-679', 'OPLD1818SPP-5-683', 'OPLD1818SPP-5-530', 'OPLD1818SPP-5-531', 'OPLD1818SPP-7-531', 'OPLD1818SPP-7-530', 'OPLD1818SPP-5-897', 'OPLD1818SPP-5-892', 'OPLD1818SPP-5-890', 'OPLD1818SPP-5-887', 'OPLD1818SPP-5-885', 'OPLD1818SPP-5-878', 'OPLD1818SPP-5-876', 'OPLD1818SPP-5-856', 'OPLD1818SPP-5-854', 'OPLD1818SPP-5-845', 'OPLD1818SPP-5-837', 'OPLD1818SPP-4-701', 'OPLD1818SPP-4-702', 'OPLD1818SPP-4-703', 'OPLD1818SPP-4-704', 'OPLD1818SPP-4-705', 'OPLD1818SPP-4-706', 'OPLD1818SPP-4-802', 'OPLD1818SPP-4-812', 'OPLD1818SPP-4-530', 'OPLD1818SPP-4-531', 'OPLD1818SPP-4-679', 'OPLD1818SPP-4-683', 'OPLD1818SPP-4-685', 'OPLD1818SPP-4-700', 'OPLD1818SPP-4-887', 'OPLD1818SPP-2-837', 'OPLD1818SPP-2-845', 'OPLD1818SPP-2-854', 'OPLD1818SPP-2-856', 'OPLD1818SPP-2-802', 'OPLD1818SPP-2-812', 'OPLD1818SPP-2-819', 'OPLD1818SPP-2-831', 'OPLD1818SPP-2-890', 'OPLD1818SPP-2-892', 'OPLD1818SPP-2-897', 'OPLD1818SPP-2-876', 'OPLD1818SPP-2-878', 'OPLD1818SPP-2-885', 'OPLD1818SPP-2-887', 'OPLD1818SPP-2-531', 'OPLD1818SPP-2-530', 'OPLD1818SPP-2-683', 'OPLD1818SPP-2-679', 'OPLD1818SPP-2-704', 'OPLD1818SPP-2-703', 'OPLD1818SPP-2-706', 'OPLD1818SPP-2-705', 'OPLD1818SPP-2-700', 'OPLD1818SPP-2-685', 'OPLD1818SPP-2-702', 'OPLD1818SPP-2-701', 'OPLD1818SPP-4-876', 'OPLD1818SPP-4-878', 'OPLD1818SPP-4-854', 'OPLD1818SPP-4-856', 'OPLD1818SPP-4-837', 'OPLD1818SPP-4-845', 'OPLD1818SPP-4-819', 'OPLD1818SPP-4-831', 'OPLD1818SPP-4-897', 'OPLD1818SPP-4-890', 'OPLD1818SPP-4-892', 'OPLD1818SPP-4-885', 'PLD4832DPP-2-845', 'PLD4832DPP-2-837', 'PLD4832DPP-2-856', 'PLD4832DPP-2-854', 'PLD4832DPP-2-878', 'PLD4832DPP-2-876', 'PLD4832DPP-2-887', 'PLD4832DPP-2-885', 'PLD4832DPP-2-892', 'PLD4832DPP-2-890', 'PLD4832DPP-2-897', 'PLD4832DPP-4-531', 'PLD4832DPP-4-530', 'PLD4832DPP-4-679', 'PLD4832DPP-4-683', 'PLD4832DPP-4-685', 'PLD4832DPP-4-700', 'PLD4832DPP-4-701', 'PLD4832DPP-4-702', 'PLD4832DPP-4-703', 'PLD4832DPP-4-704', 'PLD4832DPP-4-705', 'PLD4832DPP-4-706', 'PLD4832DPP-4-802', 'PLD4832DPP-4-812', 'PLD4832DPP-4-819', 'PLD4832DPP-4-831', 'PLD4832DPP-4-837', 'PLD4832DPP-4-845', 'PLD4832DPP-4-878', 'PLD4832DPP-4-876', 'PLD4832DPP-4-856', 'PLD4832DPP-4-854', 'PLD4832DPP-4-892', 'PLD4832DPP-4-890', 'PLD4832DPP-4-887', 'PLD4832DPP-4-885', 'PLD4832DPP-4-897', 'PLD4832DPP-5-683', 'PLD4832DPP-5-679', 'PLD4832DPP-5-531', 'PLD4832DPP-5-530', 'PLD4832DPP-5-701', 'PLD4832DPP-5-702', 'PLD4832DPP-5-685', 'PLD4832DPP-5-700', 'PLD4832DPP-5-705', 'PLD4832DPP-5-706', 'PLD4832DPP-5-703', 'PLD4832DPP-5-704', 'PLD4832DPP-5-819', 'PLD4832DPP-5-831', 'PLD4832DPP-5-802', 'PLD4832DPP-5-812', 'PLD4832DPP-5-854', 'PLD4832DPP-5-856', 'PLD4832DPP-5-837', 'PLD4832DPP-5-845', 'PLD4832DPP-2-701', 'PLD4832DPP-2-702', 'PLD4832DPP-2-685', 'PLD4832DPP-2-700', 'PLD4832DPP-2-679', 'PLD4832DPP-2-683', 'PLD4832DPP-2-530', 'PLD4832DPP-2-531', 'PLD4832DPP-2-819', 'PLD4832DPP-2-831', 'PLD4832DPP-2-802', 'PLD4832DPP-2-812', 'PLD4832DPP-2-705', 'PLD4832DPP-2-706', 'PLD4832DPP-2-703', 'PLD4832DPP-2-704', 'PLD4226DPP-8-887', 'PLD4226DPP-8-890', 'PLD4226DPP-8-892', 'PLD4226DPP-8-897', 'PLD4226DPP-8-856', 'PLD4226DPP-8-876', 'PLD4226DPP-8-878', 'PLD4226DPP-8-885', 'PLD4226DPP-8-831', 'PLD4226DPP-8-837', 'PLD4226DPP-8-845', 'PLD4226DPP-8-854', 'PLD4226DPP-8-706', 'PLD4226DPP-8-802', 'PLD4226DPP-8-812', 'PLD4226DPP-8-819', 'PLD4226DPP-5-892', 'PLD4226DPP-5-897', 'PLD4226DPP-5-887', 'PLD4226DPP-5-890', 'PLD4226DPP-7-530', 'PLD4226DPP-7-683', 'PLD4226DPP-7-685', 'PLD4226DPP-7-531', 'PLD4226DPP-7-679', 'PLD4226DPP-7-702', 'PLD4226DPP-7-703', 'PLD4226DPP-7-700', 'PLD4226DPP-7-701', 'PLD4226DPP-5-705', 'PLD4226DPP-5-704', 'PLD4226DPP-5-703', 'PLD4226DPP-5-702', 'PLD4226DPP-5-819', 'PLD4226DPP-5-812', 'PLD4226DPP-5-802', 'PLD4226DPP-5-706', 'PLD4226DPP-5-854', 'PLD4226DPP-5-845', 'PLD4226DPP-5-837', 'PLD4226DPP-5-831', 'PLD4226DPP-5-885', 'PLD4226DPP-5-878', 'PLD4226DPP-5-876', 'PLD4226DPP-5-856', 'PLD4226DPP-7-892', 'PLD4226DPP-7-897', 'PLD4226DPP-8-530', 'PLD4226DPP-8-531', 'PLD4226DPP-8-679', 'PLD4226DPP-8-683', 'PLD4226DPP-8-685', 'PLD4226DPP-8-700', 'PLD4226DPP-8-701', 'PLD4226DPP-8-702', 'PLD4226DPP-8-703', 'PLD4226DPP-8-704', 'PLD4226DPP-8-705', 'PLD4226DPP-7-705', 'PLD4226DPP-7-704', 'PLD4226DPP-7-802', 'PLD4226DPP-7-706', 'PLD4226DPP-7-819', 'PLD4226DPP-7-812', 'PLD4226DPP-7-837', 'PLD4226DPP-7-831', 'PLD4226DPP-7-854', 'PLD4226DPP-7-845', 'PLD4226DPP-7-876', 'PLD4226DPP-7-856', 'PLD4226DPP-7-885', 'PLD4226DPP-7-878', 'PLD4226DPP-7-890', 'PLD4226DPP-7-887', 'PLD4226DPP-2-892', 'PLD4226DPP-2-897', 'PLD4226DPP-2-887', 'PLD4226DPP-2-890', 'PLD4226DPP-2-878', 'PLD4226DPP-2-885', 'PLD4226DPP-2-856', 'PLD4226DPP-2-876', 'PLD4226DPP-4-683', 'PLD4226DPP-4-685', 'PLD4226DPP-4-531', 'PLD4226DPP-4-679', 'PLD4226DPP-4-530', 'PLD4226DPP-2-705', 'PLD4226DPP-2-704', 'PLD4226DPP-2-703', 'PLD4226DPP-2-702', 'PLD4226DPP-2-701', 'PLD4226DPP-2-700', 'PLD4226DPP-2-685', 'PLD4226DPP-2-683', 'PLD4226DPP-2-854', 'PLD4226DPP-2-845', 'PLD4226DPP-2-837', 'PLD4226DPP-2-831', 'PLD4226DPP-2-819', 'PLD4226DPP-2-812', 'PLD4226DPP-2-802', 'PLD4226DPP-2-706', 'PLD4226DPP-4-892', 'PLD4226DPP-4-897', 'PLD4226DPP-4-878', 'PLD4226DPP-4-885', 'PLD4226DPP-4-887', 'PLD4226DPP-4-890', 'PLD4226DPP-5-683', 'PLD4226DPP-5-685', 'PLD4226DPP-5-700', 'PLD4226DPP-5-701', 'PLD4226DPP-5-530', 'PLD4226DPP-5-531', 'PLD4226DPP-5-679', 'PLD4226DPP-4-705', 'PLD4226DPP-4-704', 'PLD4226DPP-4-802', 'PLD4226DPP-4-706', 'PLD4226DPP-4-701', 'PLD4226DPP-4-700', 'PLD4226DPP-4-703', 'PLD4226DPP-4-702', 'PLD4226DPP-4-854', 'PLD4226DPP-4-845', 'PLD4226DPP-4-876', 'PLD4226DPP-4-856', 'PLD4226DPP-4-819', 'PLD4226DPP-4-812', 'PLD4226DPP-4-837', 'PLD4226DPP-4-831', 'PLD4226DPP-2-530', 'PLD4226DPP-2-679', 'PLD4226DPP-2-531', 'PLD5438DPP-5-683', 'PLD5438DPP-5-685', 'PLD5438DPP-5-531', 'PLD5438DPP-5-679', 'PLD5438DPP-5-702', 'PLD5438DPP-5-703', 'PLD5438DPP-5-700', 'PLD5438DPP-5-701', 'PLD5438DPP-5-706', 'PLD5438DPP-5-802', 'PLD5438DPP-5-704', 'PLD5438DPP-5-705', 'PLD5438DPP-5-831', 'PLD5438DPP-5-837', 'PLD5438DPP-5-812', 'PLD5438DPP-5-819', 'PLD5438DPP-5-876', 'PLD5438DPP-5-856', 'PLD5438DPP-5-854', 'PLD5438DPP-5-845', 'PLD5438DPP-5-890', 'PLD5438DPP-5-887', 'PLD5438DPP-5-885', 'PLD5438DPP-5-878', 'PLD5438DPP-5-897', 'PLD5438DPP-5-892', 'PLD5438DPP-7-679', 'PLD5438DPP-7-531', 'PLD5438DPP-7-530', 'PLD5438DPP-4-530', 'PLD5438DPP-4-531', 'PLD5438DPP-4-679', 'PLD5438DPP-4-683', 'PLD5438DPP-4-685', 'PLD5438DPP-4-700', 'PLD5438DPP-4-701', 'PLD5438DPP-4-702', 'PLD5438DPP-4-703', 'PLD5438DPP-4-704', 'PLD5438DPP-4-705', 'PLD5438DPP-4-706', 'PLD5438DPP-4-802', 'PLD5438DPP-4-812', 'PLD5438DPP-4-819', 'PLD5438DPP-4-837', 'PLD5438DPP-4-831', 'PLD5438DPP-4-854', 'PLD5438DPP-4-845', 'PLD5438DPP-4-876', 'PLD5438DPP-4-856', 'PLD5438DPP-4-885', 'PLD5438DPP-4-878', 'PLD5438DPP-4-890', 'PLD5438DPP-4-887', 'PLD5438DPP-4-897', 'PLD5438DPP-4-892', 'PLD5438DPP-5-530', 'PLD5438DPP None the final sku in there PLD5438DPP should be PLD5438DPP-5-683 and for some reason the list cuts off there, which is only element 564/85,892, and the program terminates without an error code I cannot attach the file of skus, this is for my job, just hoping someone can shed light on what I am doing to cause the list to cut short like that this may or may not also be relevant but when I call .append(sku) as opposed to sku[1] and grab the whole tuple the same issue occurs but at element 292, exactly half of the amount of element appended when only doing half the tuple
The issue seems to be one on my local machine, where it was unable to print such a long list, this one was of size 85,892 for those with similar issues see if any of our specs are overlapping and that may determine the cause of this issue: VSCode: Version: 1.52.0 (user setup) Commit: 940b5f4bb5fa47866a54529ed759d95d09ee80be Date: 2020-12-10T22:45:11.850Z Electron: 9.3.5 Chrome: 83.0.4103.122 Node.js: 12.14.1 V8: 8.3.110.13-electron.0 OS: Windows_NT x64 10.0.18363 Python: 3.9.0 see discussion in comments for other details
invert regex pattern in python
I'm trying to filter from string only the Arabic character but the next function doesn't work for me: import re def remove_any_non_arabic_char(text): non_arabic_char = re.compile('^[\u0627-\u064a]') text = re.sub(non_arabic_char, "", text) print(text) for example: s = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا" The desired output of remove_any_non_arabic_char(s) should be قال جالينوس قد اتفق جل من فسر هذا الكتا but the input stays without changes. What should I do?
First, you need to fix your regex as suggested in the comments, then for a more efficient solution, you will need to expand your Unicode character selection to include all Arabic character mappings. Finally, you need to keep at least one space between Arabic words to keep the Arabic text legible: import re def remove_any_non_arabic_char(text): non_arabic_char = re.compile('[^\s\\u0600-\u06FF]') text_with_no_spaces = re.sub(non_arabic_char, "", text) text_with_single_spaces = " ".join(re.split("\s+", text_with_no_spaces)) return text_with_single_spaces text_1 = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا" text_2 = ''' تغيّر مفهوم كلمة (أدب) من العصر الجاهلي jahili (pre-Islamic) era إلى الآن عبر مراحل periods التاريخ المتعددة. ففي الجاهلية، كانت كلمة أدب تعني (الدعوة إلى الطعام). وبعدها، استخدم الرسول محمد (عليه السلام) الكلمة بمعنى "التهذيب والتربية" education and mannerism. وفي العصر الأموي، اتصلت had to do كلمة أدب بالتاريخ والفقه والقرآن والحديث. أما في العصرالعباسي، فأصبحت تعني تعلّم الشعر والنثر prose واتسع الأدب ليشمل أنواع المعرفة وألوانها وخصوصاً علم البلاغة واللغة. أما في الوقت الحالي، فأصبحت كلمة أدب ذات صلة pertinent بالكلام البليغ الجميل المؤثر that impacts في أحاسيس القاريء أو السامع. ''' # Isleem, N. M., & Abuhakema, G. M. (2020). Kalima wa Nagham: A Textbook for # Teaching Arabic, Volume 2 (Vol. 3). University of Texas Press. (page 5) print('text_1: \n', remove_any_non_arabic_char(text_1)) print('\ntext_2: \n\n', remove_any_non_arabic_char(text_2)) Running the code on the two texts above in Jupyter, you get: Notice that punctuation marks shared between Arabic and English (like periods and brackets) have also been removed. To keep those, you would need to introduce more complex conditionals.
Tokenize tweet based on Regex
I have the following example text / tweet: RT #trader $AAPL 2012 is o´o´o´o´o´pen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO I want to follow the procedure of Table 1 in Li, T, van Dalen, J, & van Rees, P.J. (Pieter Jan). (2017). More than just noise? Examining the information content of stock microblogs on financial markets. Journal of Information Technology. doi:10.1057/s41265-016-0034-2 in order to clean up the tweet. They clean the tweet up in such a way that the final result is: {RT|123456} {USER|56789} {TICKER|AAPL} {NUMBER|2012} notooopen nottalk patent {COMPANY|GOOG} notdefinetli treatment {HASH|samsung} {EMOTICON|POS} haha {URL} I use the following script to tokenize the tweet based on the regex: #!/usr/bin/env python # -*- coding: utf-8 -*- import re emoticon_string = r""" (?: [<>]? [:;=8] # eyes [\-o\*\']? # optional nose [\)\]\(\[dDpP/\:\}\{#\|\\] # mouth | [\)\]\(\[dDpP/\:\}\{#\|\\] # mouth [\-o\*\']? # optional nose [:;=8] # eyes [<>]? )""" regex_strings = ( # URL: r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+""" , # Twitter username: r"""(?:#[\w_]+)""" , # Hashtags: r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)""" , # Cashtags: r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)""" , # Remaining word types: r""" (?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals. | (?:[\w_]+) # Words without apostrophes or dashes. | (?:\.(?:\s*\.){1,}) # Ellipsis dots. | (?:\S) # Everything else that isn't whitespace. """ ) word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE) emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE) ###################################################################### class Tokenizer: def __init__(self, preserve_case=False): self.preserve_case = preserve_case def tokenize(self, s): try: s = str(s) except UnicodeDecodeError: s = str(s).encode('string_escape') s = unicode(s) # Tokenize: words = word_re.findall(s) if not self.preserve_case: words = map((lambda x: x if emoticon_re.search(x) else x.lower()), words) return words if __name__ == '__main__': tok = Tokenizer(preserve_case=False) test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO' tokenized = tok.tokenize(test) print("\n".join(tokenized)) This yields the following output: rt #trader $aapl 2012 is oooopen to ‘ talk ’ about patents with goog definitely not the treatment #samsung got :-) heh url_that_cannot_be_posted_on_SO How can I adjust this script to get: rt {USER|trader} {CASHTAG|aapl} {NUMBER|2012} is oooopen to ‘ talk ’ about patents with goog definitely not the treatment {HASHTAG|samsung} got {EMOTICON|:-)} heh {URL|url_that_cannot_be_posted_on_SO} Thanks in advance for helping me out big time!
You really need to use named capturing groups (mentioned by thebjorn), and use groupdict() to get name-value pairs upon each match. It requires some post-processing though: All pairs where the value is None must be discarded If the self.preserve_case is false the value can be turned to lower case at once If the group name is WORD, ELLIPSIS or ELSE the values are added to words as is If the group name is HASHTAG, CASHTAG, USER or URL the values are added first stripped of $, # and # chars at the start and then added to words as {<GROUP_NAME>|<VALUE>} item All other matches are added to words as {<GROUP_NAME>|<VALUE>} item. Note that \w matches underscores by default, so [\w_] = \w. I optimized the patterns a little bit. Here is a fixed code snippet: import re emoticon_string = r""" (?P<EMOTICON> [<>]? [:;=8] # eyes [-o*']? # optional nose [][()dDpP/:{}#|\\] # mouth | [][()dDpP/:}{#|\\] # mouth [-o*']? # optional nose [:;=8] # eyes [<>]? )""" regex_strings = ( # URL: r"""(?P<URL>https?://(?:[-a-zA-Z0-9_$#.&+!*(),]|%[0-9a-fA-F][0-9a-fA-F])+)""" , # Twitter username: r"""(?P<USER>#\w+)""" , # Hashtags: r"""(?P<HASHTAG>\#+\w+[\w'-]*\w+)""" , # Cashtags: r"""(?P<CASHTAG>\$+\w+[\w'-]*\w+)""" , # Remaining word types: r""" (?P<NUMBER>[+-]?\d+(?:[,/.:-]\d+[+-]?)?) # Numbers, including fractions, decimals. | (?P<WORD>\w+) # Words without apostrophes or dashes. | (?P<ELLIPSIS>\.(?:\s*\.)+) # Ellipsis dots. | (?P<ELSE>\S) # Everything else that isn't whitespace. """ ) word_re = re.compile(r"""({}|{})""".format(emoticon_string, "|".join(regex_strings)), re.VERBOSE | re.I | re.UNICODE) #print(word_re.pattern) emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE) ###################################################################### class Tokenizer: def __init__(self, preserve_case=False): self.preserve_case = preserve_case def tokenize(self, s): try: s = str(s) except UnicodeDecodeError: s = str(s).encode('string_escape') s = unicode(s) # Tokenize: words = [] for x in word_re.finditer(s): for key, val in x.groupdict().items(): if val: if not self.preserve_case: val = val.lower() if key in ['WORD','ELLIPSIS','ELSE']: words.append(val) elif key in ['HASHTAG','CASHTAG','USER','URL']: # Add more here if needed words.append("{{{}|{}}}".format(key, re.sub(r'^[##$]+', '', val))) else: words.append("{{{}|{}}}".format(key, val)) return words if __name__ == '__main__': tok = Tokenizer(preserve_case=False) test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com' tokenized = tok.tokenize(test) print("\n".join(tokenized)) With test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com', it outputs rt {USER|trader} {CASHTAG|aapl} {NUMBER|2012} is oooopen to ‘ talk ’ about patents with goog definitely not the treatment {HASHTAG|samsung} got {EMOTICON|:-)} heh {URL|http://some.site.here.com} See the regex demo online.
How to train a sense2vec model
The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2. The file can be found at: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py What type of input format does this script require? Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec. You go from this input text: "As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it's an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that's great, but it's really just about the money." To this: as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN Double line breaks are interpreted as separate documents. Urls are recognized as such, stripped down to domain.tld and marked as |URL Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs) Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped Here's the code. Let me know if you have questions. I'll probably publish it on github.com/woltob soon. import spacy import re nlp = spacy.load('en') nlp.matcher = None LABELS = { 'ENT': 'ENT', 'PERSON': 'PERSON', 'NORP': 'ENT', 'FAC': 'ENT', 'ORG': 'ENT', 'GPE': 'ENT', 'LOC': 'ENT', 'LAW': 'ENT', 'PRODUCT': 'ENT', 'EVENT': 'ENT', 'WORK_OF_ART': 'ENT', 'LANGUAGE': 'ENT', 'DATE': 'DATE', 'TIME': 'TIME', 'PERCENT': 'PERCENT', 'MONEY': 'MONEY', 'QUANTITY': 'QUANTITY', 'ORDINAL': 'ORDINAL', 'CARDINAL': 'CARDINAL' } pre_format_re = re.compile(r'^[\`\*\~]') post_format_re = re.compile(r'[\`\*\~]$') url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})') single_linebreak_re = re.compile('\n') double_linebreak_re = re.compile('\n{2,}') whitespace_re = re.compile(r'[ \t]+') quote_re = re.compile(r'"|`|´') def strip_meta(text): text = text.replace('per cent', 'percent') text = text.replace('>', '>').replace('<', '<') text = pre_format_re.sub('', text) text = post_format_re.sub('', text) text = double_linebreak_re.sub('{2break}', text) text = single_linebreak_re.sub(' ', text) text = text.replace('{2break}', '\n') text = whitespace_re.sub(' ', text) text = quote_re.sub('', text) return text def transform_doc(doc): for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_]) for np in doc.noun_chunks: while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'): np = np[1:] np.merge(np.root.tag_, np.text, np.root.ent_type_) strings = [] for sent in doc.sents: sentence = [] if sent.text.strip(): for w in sent: if w.is_space: continue w_ = represent_word(w) if w_: sentence.append(w_) strings.append(' '.join(sentence)) if strings: return '\n'.join(strings) + '\n' else: return '' def represent_word(word): if word.like_url: x = url_re.search(word.text.strip().lower()) if x: return x.group(3)+'|URL' else: return word.text.lower().strip()+'|URL?' text = re.sub(r'\s', '_', word.text.strip().lower()) tag = LABELS.get(word.ent_type_) # Dropping PUNCTUATION such as commas and DET like the if tag is None and word.pos_ not in ['PUNCT', 'DET']: tag = word.pos_ elif tag is None: return None # if not word.pos_: # tag = '?' return text + '|' + tag corpus = ''' As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it's an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that's great, but it's really just about the money. ''' corpus_stripped = strip_meta(corpus) doc = nlp(corpus_stripped) corpus_ = [] for word in doc: # only lemmatize NOUN and PROPN if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_): # Keep the original word with the length of the lemma, then add the white space, if it was there.: lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):]) # print(word.text, lemma_) corpus_.append(lemma_) # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):]) # All other words are added normally. else: corpus_.append(word.text_with_ws) result = transform_doc(nlp(''.join(corpus_))) sense2vec_filename = 'text.txt' file = open(sense2vec_filename,'w') file.write(result) file.close() print(result) You could visualise your model using Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code). Happy coding, woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow: def iter_comments(loc): with bz2.BZ2File(loc) as file_: for i, line in enumerate(file_): yield line.decode('utf-8', errors='ignore') # yield ujson.loads(line)['body']