Change unicode hardwrited in csv to corresponding character - python

I have a csv with 1 column having hard writed unicode character :
["Investir dans un parc d'activit\u00e9s"]
["S\u00e9curiser, restaurer et g\u00e9rer 1 372 ha de milieux naturels impact\u00e9s par la construction de l'autoroute"]
["Am\u00e9liorer la consommation \u00e9nerg\u00e9tique de b\u00e2timents publics"]
["Favoriser la recherche, am\u00e9liorer la qualit\u00e9 des traitements et assurer un \u00e9gal acc\u00e8s des soins \u00e0 tous les patients de Franche-Comt\u00e9."]
I'm trying to fix/replace them with the corresponding char, but I can't seems to make it, I tried with
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
but the column don't change
Seing that the code below work, I tried to loop over the row to fix it with no success
s = "d'activit\u00e9s"
print(s) # d'activités
print(s.replace('\u00e9', 'é' )) # d'activités
for case in df['Objectif(s)']:
s = str(case)
df['Objectif(s)'][case] = s # ["Investir dans un parc d'activit\u00e9s"]

if this '\u00e9' is actually written into the file as \ u 0 0 e 9 as normal characters by the source of the data, you need to do a string replace.
the trick here is that you need to escape the \ character in the replace function first parameter
s.replace('\\u00e9', 'é' )
or use a "raw string literal" by prefixing r
s.replace(r'\u00e9', 'é' )

Try replacing
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
to
df['Objectif(s)'] = df['Objectif(s)'].str.replace('\u00e9', 'é')

Related

How to add a suffix (like measure units cm, m, km, etc) at the end of a float in my list?

I have to make a program that will allow the user to save people's personal information like their last names, first names, genders, heights and so on... I am not allowed to use dictionaries so no use in recommending it.
I don't know how to add the suffix "cm" when I print the person's height. Here is my code that request height input:
starting line 68
taille = input("Entrez la taille de la personne en cm (0 à 250) :\n").strip()
isTailleValid = validation_taille(taille)
while not isTailleValid:
taille = input("Taille invalide, entrez bien une valeur entre 0 et 250 :\n").strip()
isTailleValid = validation_taille(taille)
taille = float(taille)
personInf[Personne_taille] = taille
This is where the program request information about the height (french: taille) after that it adds that input to a list called Liste_info under the Personne_taille index by doing:
Liste_info.append(personInf)
now when I call the function to print out the result it shows up like this:
Is there any way to add " cm" at the end of 175?
while not isTailleValid:
taille = input("Taille invalide, entrez bien une valeur entre 0 et 250 :\n").strip()
isTailleValid = validation_taille(taille)
personInf[Personne_taille] = taille + "cm"
I think you have to make the float a str type then merge the strings
while not isTailleValid:
taille = input("Taille invalide, entrez bien une valeur entre 0 et 250 :\n").strip()
isTailleValid = validation_taille(taille)
personInf[Personne_taille] = str(taille) + " cm"

Pandas Excel row extract changes fields

I work on a small translation program. I have a .xlsx file attached with 5 columns each in different Language(English, French, German, Spanish, Italian).
The program provides a drop down list with with each row from the .xlsx being one of the available options(English Only). Selecting one of the options takes the English Value and adds it to a list.
I then use following to later extract the whole row of other languages based on the English selected and split by deliminator(;):
instructionList = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.']
for i in range(len(instructionList)):
newCompInst.append(translationFile.loc[translationFile['English'] == instructionList[i]].to_string(index=False, header=False))
newInst = [i.replace(' ', ',;') for i in newInst ]
strippedInst = [item.lstrip() for item in newInst ]
print('strippedInst: ', strippedInst)
The output I get from the following code is:
strippedInst: ['Avoid contact with light-coloured fabrics, lea...,;Bviter le contact avec les tissus clairs, le c...,;Kontakt mit hellen Stoffen, Leder und Polsterm...,;Evitar el contacto con tejidos de colores clar...,;Evitare il contatto con capi dai colori delica...']
After running this code all of the languages get cut in half and the rest of the sentence gets replaced with '...' - (NOTE the ENGLISH in the 'strippedInst' and compare with what has been inputed to the loop (instructionList).
The output gets cut only when the sentence is long. I tried running smaller phrases and it all seems to come through fine.
This is the Expected output:
strippedInst:
['
Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.,;
Éviter le contact avec les tissus clairs, le cuir et les tissus d'ameublement. Les couleurs peuvent déteindre en raison de la nature de la teinture indigo du denim.,;
Kontakt mit hellen Stoffen, Leder und Polstermöbeln vermeiden. Aufgrund der Indigofärbung kann sich die Farbe übertragen,;
Evitar el contacto con tejidos de colores claros, con cuero y con tapicerías. El tinte índigo de los vaqueros podría transferirse a dichas superficies.,;
Evitare il contatto con capi dai colori delicati, pelli e tappezzerie. Si potrebbe verificare una perdita del colore blu intenso del tessuto di jeans.,
']
EDIT:
Here is the entire standalone working function:
import pandas as pd
excel_file = 'C:/Users/user/Desktop/Translation_Table_Edit.xlsx'
translationFile = pd.read_excel(excel_file, encoding='utf-8')
compList = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.', 'Do not soak']
newComp = []
def myFunction():
global newComp
for i in range(len(compList)):
newComp.append(translationFile.loc[translationFile['English'] == compList[i]].to_string(index=False, header=False))
newComp = [i.replace(' ', ';') for i in newComp]
myFunction()
strippedComp = [item.lstrip() for item in newComp]
print(strippedComp)
This outputs following:
['Avoid contact with light-coloured fabrics, lea...;�viter le contact avec les tissus clairs, le c...;Kontakt mit hellen Stoffen, Leder und Polsterm...;Vermijd contact met lichtgekleurde stoffen, le...;Evitar el contacto con tejidos de colores clar...;Evitare il contatto con capi dai colori delica...', 'Do not soak;Ne pas laisser tremper;Nicht einweichen;Niet weken;No dejar en remojo;Non lasciare in ammollo']
The issues lies with calling to_string on a dataframe. Instead, first extract the values into an array (df_sub.iloc[0].values), and then join the elements of that list (';'.join(...)).
This should do the trick:
def myFunction():
global newComp
for i in range(len(compList)):
df_sub = translationFile.loc[translationFile['English'] == compList[i]]
if df_sub.shape[0] > 0:
newComp.append(';'.join(df_sub.iloc[0].values))
EDIT: suggested code improvements
In addition, (in my opinion) your code could be improved by the following (using pandas functionality instead of looping, adherence to naming convention in pep8, avoiding use of global variables):
import pandas as pd
df_translations = pd.read_excel('./Translation_Table_Edit.xlsx', encoding='utf-8')
to_translate = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.',
'Do not soak']
def get_translations(df_translations, to_translate, language='English'):
"""Looks up translatios for all items in to_translate.
Returns a list with semi-colon separated translations. None if no translations found."""
df_sub = df_translations[df_translations[language].isin(to_translate)].copy() # filter translations
df_sub = df_sub.apply(lambda x: x.str.strip()) # strip each cell
# format and combine translations into a list
ret = []
for translation in df_sub.values:
ret.append(';'.join(translation))
return ret
translations = get_translations(df_translations, to_translate)

How to understand regular expression with python?

I'm new with python. Could anybody help me on how I can create a regular expression given a list of strings like this:
test_string = "pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi mi DP1CSS
madre madre NCFS000"
How to return a tuple like this:
> ([madre, NCFS00],[antigua, AQ0FS0])
I would like to return the word with it's associated tag given test_string, this is what I've done:
# -- coding: utf-8 --
import re
#str = "pero pero CC " \
"tan tan RG " \
"antigua antiguo AQ0FS0" \
"que que CS " \
"según según SPS00 " \
"mi mi DP1CSS " \
"madre madre NCFS000"
tupla1 = re.findall(r'(\w+)\s\w+\s(AQ0FS0)', str)
print tupla1
tupla2 = re.findall(r'(\w+)\s\w+\s(NCFS00)',str)
print tupla2
The output is the following:
[('antigua', 'AQ0FS0')] [('madre', 'NCFS00')]
The problem with this output is that if I pass it along test_string I need to preserve the "order" or "occurrence" of the tags (i.e. I only can print a tuple if and only if they have the following order: AQ0FS0 and NCFS000 in other words: female adjective, female noun).
^([a-zA-Z]+)\s+[a-zA-Z]+\s+([\w]+(?=\d$)\d)
Dont really know the basis for this selection but still you can get it like this.Just grab the captures.Dont forget to set the flags g and m.See demo.
http://regex101.com/r/nA6hN9/38

How to correctly parse a tiled map in a text file?

I'm making a RPG with Python and pygame for a school project. In order to create the few maps, I have chosen the Tile Mapping techniques I have seen in some tutorials, using a *.txt file.
However, I have to cut some sprites (trees, houses, ...) into several pieces. The problem is that I'm running out of characters to represent them all!
I also remember that it's possible to take several characters as one (ex : take "100" as one an not as one "1" and two "0"s) and/or to put spaces between characters in the file (e.g. "170 0 2 5 12 48" which is read as six sprites).
But I really don't know how to adapt my program to do this way. I'm pretty sure that I need to modify the way the file is read, but how?
Here's the reading function :
class Niveau:
def __init__(self, fichier):
self.fichier = fichier
self.structure = 0
def generer(self):
"""Méthode permettant de générer le niveau en fonction du fichier.
On crée une liste générale, contenant une liste par ligne à afficher"""
#On ouvre le fichier
with open(self.fichier, "r") as fichier:
structure_niveau = []
#On parcourt les lignes du fichier
for ligne in fichier:
ligne_niveau = []
#On parcourt les sprites (lettres) contenus dans le fichier
for sprite in ligne:
#On ignore les "\n" de fin de ligne
if sprite != '\n':
#On ajoute le sprite à la liste de la ligne
ligne_niveau.append(sprite)
#On ajoute la ligne à la liste du niveau
structure_niveau.append(ligne_niveau)
#On sauvegarde cette structure
self.structure = structure_niveau
def afficher(self, fenetre):
"""Méthode permettant d'afficher le niveau en fonction
de la liste de structure renvoyée par generer()"""
#Chargement des images (seule celle d'arrivée contient de la transparence)
Rocher = pygame.image.load(image_Rocher).convert()
Buisson = pygame.image.load(image_Buisson).convert()
#On parcourt la liste du niveau
num_ligne = 0
for ligne in self.structure:
#On parcourt les listes de lignes
num_case = 0
for sprite in ligne:
#On calcule la position réelle en pixels
x = (num_case+0.5) * taille_sprite
y = (num_ligne+1) * taille_sprite
if sprite == 'R': #R = Rocher
fenetre.blit(Rocher, (x,y))
if sprite == 'B':
fenetre.blit(Buisson,(x,y))
num_case += 1
num_ligne += 1
I think what you want is str.split():
for ligne in fichier:
ligne_niveau = []
#On parcourt les sprites (lettres) contenus dans le fichier
for sprite in ligne.split(): # note split here
ligne_niveau.append(sprite) # no need to check for line end
#On ajoute la ligne à la liste du niveau
structure_niveau.append(ligne_niveau)
split without any arguments will join all consecutive whitespace (including tabs '\t' and newlines '\n') into a single split. For example:
"\ta 13 b \t22 6e\n".split() == ['a', '13', 'b', '22', '6e']
Note that the "sprites" don't have to be the same length, so there's no need for fill characters like extra 0s or *s. You can also simplify using a list comprehension:
def generer(self):
with open(self.fichier) as fichier:
self.structure = [ligne.split() for ligne in fichier]
Alternatively, consider using a comma-separated value format - Python has a whole module (csv) for that:
a,13,b,22,6e

Cleaning text files with regex of python

I have a huge file where there lines like this one:
"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
How to remove these Sinographic characters from the lines of the file so I get a new file where these lines are with Roman alphabet characters only?
I was thinking of using regular expressions.
Is there a character class for all Roman alphabet characters, e.g. Arabic numerals, a-nA-N and other(punctuation)?
I find this regex cheet sheet to come in very handy for situations like these.
# -*- coding: utf-8
import re
import string
u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
p = re.compile(r"[^\w\s\d{}]".format(re.escape(string.punctuation)))
for m in p.finditer(u):
print m.group()
>>> 茅
>>> 茅
>>> 猫
>>> 猫
I'm also a huge fan of the unidecode module.
from unidecode import unidecode
u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
print unidecode(u)
>>> En.!?+ 123 gMao nMao ral un trMao s bon hotel La terrasse du bar prMao s du lobby
You can use the string module.
>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
And it seems the code you want to replace is Chinese. If you all your string is unicode, you can use the simple range [\u4e00-\u9fa5] to replace them. This is not the whole range of Chinese but enough.
>>> s = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
>>> s
u'En g\u8305n\u8305ral un tr\u732bs bon hotel La terrasse du bar pr\u732bs du lobby'
>>> import re
>>> re.sub(ur'[\u4e00-\u9fa5]', '', s)
u'En gnral un trs bon hotel La terrasse du bar prs du lobby'
>>>
You can do it without regexes.
To keep only ascii characters:
# -*- coding: utf-8 -*-
import unicodedata
unistr = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
unistr = unicodedata.normalize('NFD', unistr) # to preserve `e` in `é`
ascii_bytes = unistr.encode('ascii', 'ignore')
To remove everything except ascii letters, numbers, punctuation:
from string import ascii_letters, digits, punctuation, whitespace
to_keep = set(map(ord, ascii_letters + digits + punctuation + whitespace))
all_bytes = range(0x100)
to_remove = bytearray(b for b in all_bytes if b not in to_keep)
text = ascii_bytes.translate(None, to_remove).decode()
# -> En gnral un trs bon hotel La terrasse du bar prs du lobby

Categories

Resources