Pandas Excel row extract changes fields

Pandas Excel row extract changes fields - python

I work on a small translation program. I have a .xlsx file attached with 5 columns each in different Language(English, French, German, Spanish, Italian).
The program provides a drop down list with with each row from the .xlsx being one of the available options(English Only). Selecting one of the options takes the English Value and adds it to a list.
I then use following to later extract the whole row of other languages based on the English selected and split by deliminator(;):
instructionList = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.']
for i in range(len(instructionList)):
newCompInst.append(translationFile.loc[translationFile['English'] == instructionList[i]].to_string(index=False, header=False))
newInst = [i.replace(' ', ',;') for i in newInst ]
strippedInst = [item.lstrip() for item in newInst ]
print('strippedInst: ', strippedInst)
The output I get from the following code is:
strippedInst: ['Avoid contact with light-coloured fabrics, lea...,;Bviter le contact avec les tissus clairs, le c...,;Kontakt mit hellen Stoffen, Leder und Polsterm...,;Evitar el contacto con tejidos de colores clar...,;Evitare il contatto con capi dai colori delica...']
After running this code all of the languages get cut in half and the rest of the sentence gets replaced with '...' - (NOTE the ENGLISH in the 'strippedInst' and compare with what has been inputed to the loop (instructionList).
The output gets cut only when the sentence is long. I tried running smaller phrases and it all seems to come through fine.
This is the Expected output:
strippedInst:
['
Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.,;
Éviter le contact avec les tissus clairs, le cuir et les tissus d'ameublement. Les couleurs peuvent déteindre en raison de la nature de la teinture indigo du denim.,;
Kontakt mit hellen Stoffen, Leder und Polstermöbeln vermeiden. Aufgrund der Indigofärbung kann sich die Farbe übertragen,;
Evitar el contacto con tejidos de colores claros, con cuero y con tapicerías. El tinte índigo de los vaqueros podría transferirse a dichas superficies.,;
Evitare il contatto con capi dai colori delicati, pelli e tappezzerie. Si potrebbe verificare una perdita del colore blu intenso del tessuto di jeans.,
']
EDIT:
Here is the entire standalone working function:
import pandas as pd
excel_file = 'C:/Users/user/Desktop/Translation_Table_Edit.xlsx'
translationFile = pd.read_excel(excel_file, encoding='utf-8')
compList = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.', 'Do not soak']
newComp = []
def myFunction():
global newComp
for i in range(len(compList)):
newComp.append(translationFile.loc[translationFile['English'] == compList[i]].to_string(index=False, header=False))
newComp = [i.replace(' ', ';') for i in newComp]
myFunction()
strippedComp = [item.lstrip() for item in newComp]
print(strippedComp)
This outputs following:
['Avoid contact with light-coloured fabrics, lea...;�viter le contact avec les tissus clairs, le c...;Kontakt mit hellen Stoffen, Leder und Polsterm...;Vermijd contact met lichtgekleurde stoffen, le...;Evitar el contacto con tejidos de colores clar...;Evitare il contatto con capi dai colori delica...', 'Do not soak;Ne pas laisser tremper;Nicht einweichen;Niet weken;No dejar en remojo;Non lasciare in ammollo']

The issues lies with calling to_string on a dataframe. Instead, first extract the values into an array (df_sub.iloc[0].values), and then join the elements of that list (';'.join(...)).
This should do the trick:
def myFunction():
global newComp
for i in range(len(compList)):
df_sub = translationFile.loc[translationFile['English'] == compList[i]]
if df_sub.shape[0] > 0:
newComp.append(';'.join(df_sub.iloc[0].values))
EDIT: suggested code improvements
In addition, (in my opinion) your code could be improved by the following (using pandas functionality instead of looping, adherence to naming convention in pep8, avoiding use of global variables):
import pandas as pd
df_translations = pd.read_excel('./Translation_Table_Edit.xlsx', encoding='utf-8')
to_translate = ['Avoid contact with light-coloured fabrics, leather and upholstery. Colour may transfer due to the nature of indigo-dyed denim.',
'Do not soak']
def get_translations(df_translations, to_translate, language='English'):
"""Looks up translatios for all items in to_translate.
Returns a list with semi-colon separated translations. None if no translations found."""
df_sub = df_translations[df_translations[language].isin(to_translate)].copy() # filter translations
df_sub = df_sub.apply(lambda x: x.str.strip()) # strip each cell
# format and combine translations into a list
ret = []
for translation in df_sub.values:
ret.append(';'.join(translation))
return ret
translations = get_translations(df_translations, to_translate)

Related

Python Pandas create new columns from existing one avoiding row iteration

Heading ##I have this df['title'] column:
Apartamento en Venta
Proyecto Nuevo de Apartamentos
Proyecto Nuevo de Apartamentos
Lote en Venta
Casa Campestre en Venta
Proyecto Nuevo de Apartamentos
Based on this column I want to create three new ones:
df['property_type'] => (House, Apartment, Lot, etc)
df['property_status'] => (New, Used)
df['ofert_type'] => (Sale, Rent)
I'm achieving this through row iteration and splitting:
df['tipo_inmueble'] = ''
df['estado_inmueble'] = ''
df['tipo_oferta'] = ''
for data in range(len(df)):
if 'Proyecto Nuevo de' in df.loc[data,'title']:
df.loc[data,'property_type'] = df.loc[data,'title'].split('Proyecto Nuevo de')[1]
df.loc[data,'property_type'] = str(df.loc[data,'property_type']).split(' ')[1][:-1]
df.loc[data,'property_status'] = 'new'
df.loc[data,'ofert_type'] = 'sale'
else:
df.loc[data,'property_type'] = df.loc[data,'title'].split(' en ')[0]
df.loc[data,'property_status'] = 'used'
df.loc[data,'ofert_type'] = df.loc[data,'title'].split(' en ')[1].split(' ')[0].lower()
But it seems this approach takes too much time to process the entire data frame. I'm in search of a more "pandas" solution.
Thank you for your help

You can make a function and use the .apply function- might be faster although you are still iterating.
def property_split(row):
if row['delta_points'] == 'apartment:
return 1
else:
return 0
df['apartment'] = df.apply (lambda row: property_split(row), axis=1)

why is the second loop never executed ?

Hi, i am actually working on a python program and i need to read a csv file and use data.append(line) to fill a data Array.
I wrote this following part of the program :
print "Lecture du fichier", table1
lecfi = csv.reader(open(table1,'r'),skipinitialspace = 'true',delimiter='\t')
# delimiter = caractere utilisé pour séparer les différentes valeurs
tempSize = 0
tempLast = ""
oldSize = 0
#on initialise la taille du fichier et la derniere ligne du fichier
if os.path.exists(newFilePath):
tempSize = os.path.getsize(newFilePath)
else:
tempSize = 0
if os.path.exists(newFilePath) and tempSize != 0:
#Si le fichier tampon n'existe pas, on le créer
#Lecture du fichier tampon
lecofi = csv.reader(open(newFilePath,'r'),skipinitialspace = 'true',delimiter='\t')
csvFileArray = []
for lo in lecofi:
csvFileArray.append(lo)
tempLast = str(csvFileArray[0])
tempLast = tempLast[2:-2]
oldSize = csvFileArray[1]
print "Tempon de Last : ", tempLast
print "Taille du fichier : ", str(oldSize)
#on récupere la ligne représentant la derniere ligne de l'ancien fichier
else:
#si le fichier n'existe pas, on lui laisse cette valeur par défaut pour le traitement suivant
tempLast = None
# remplissage des données du fichier pulse dans la variable data
cpt = 0
indLast = 0
fileSize = os.path.getsize(table1)
if oldSize != fileSize:
for lecline in lecfi:
cpt = cpt + 1
last = str(lecline)
if tempLast != None and last == tempLast:
print "TEMPLAST != NONE", cpt
indLast = cpt
print "Indice de la derniere ligne : ", indLast
print last, tempLast
print "Variable indLast : ", indLast
i = 0
for co in lecfi:
print "\nCOOOOOOO : ", co
if i == indLast:
data.append(co[0])
i=i+1
for da in data:
print "\n Variable data : ", da
now look at the prints :
Lecture du fichier Data_Q1/2018-05-23/2018-5-23_13-1-35_P_HOURS_Q1
Tempon de Last : ['(2104.72652']
Taille du fichier : ['20840448']
TEMPLAST != NONE 317127
Indice de la derniere ligne : 317127
['(2104.72652'] ['(2104.72652']
Variable indLast : 317127
It seems like the program doesn't care about what's following my for loop. I assume that it can be a really basic mistake but i can't get it.
Any help ?

You are trying to iterate over the CSV twice without reseting it. this is the reason your data array is empty.
The first time you actually iterates over the file:
for lecline in lecfi:
The second time, the original iterator already reached it's end and is empty:
for co in lecfi:
As mentioned in the comments by Johnny Mopp one possible solution is using the following method:
Python csv.reader: How do I return to the top of the file?
Hope this explains your issue.

Here:
for lecline in lecfi:
cpt = cpt + 1
# ...
you are reading the whole file. After this loop, the file pointer is at the end of the file and there's nothing more to be read. Hence here:
i = 0
for co in lecfi:
# ...
this second loop is never executed, indeed. You'd need to either reset the file pointer, or close and reopen the file, or read it in a list right from the start and iterate over this list instead.
FWIW, note that opening files and not closing them is bad practice and can lead to file corruption (not that much in your case since you're only reading but...). A proper implementation would look like:
with open(table1) as tablefile:
lecfi = csv.reader(tablefile, ....)
for lecline in lecfi:
# ....
tablefile.seek(0)
for lecline in lecfi:
# ....
Also, this:
lecofi = csv.reader(open(newFilePath,'r'),skipinitialspace = 'true',delimiter='\t')
csvFileArray = []
for lo in lecofi:
csvFileArray.append(lo)
would be better rewritten as:
with open(newFilePath) as newFile:
lecofi = csv.reader(newFile, ...)
csvFileArray = list(lecofi)

Python: Split CSV with character count

Need help in importing CSV file into python.
My CSV file
0,Donc, 2 jours, je me suis rendu compte que Musikfest est le lendemain de voir dmb, quel problème. Signifie que je ne peux pas aller ...
0,Le son est définitivement gâché.Noooooo mon bb
0,Il est le mien! Haha il me suit: ') m'aime et me veut.haha.i wana vivre en Amérique annie
I want to split the above file into 2 columns
Coloumn1 ---- Coloumn2
0 ---- Donc, 2 jours, je me suis rendu compte que Musikfest est le
lendemain de voir dmb, quel problème. Signifie que je ne peux pas
aller ...
0 ---- Le son est définitivement gâché.Noooooo mon bb
0 ---- Il est le mien! Haha il me suit: ') m'aime et me veut.haha.i wana
vivre en Amérique annie
Since my text has commas embedded and my value for the text is always the first character. Is it possible to read my CSV file with splitting first character and rest of the text?

You can use string.split() and specify a max split of 1. By this I mean, if you just want to split the line on the first comma, then do not read the file as a CSV. Instead read it line by line and split the line using string.split(',', 1)

You should use csv library to work with csv files: https://docs.python.org/3/library/csv.html#csv.reader
import csv
result = []
with open('test.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
result.append((row[0], ''.join(row[1:])))
print(result)

Change unicode hardwrited in csv to corresponding character

I have a csv with 1 column having hard writed unicode character :
["Investir dans un parc d'activit\u00e9s"]
["S\u00e9curiser, restaurer et g\u00e9rer 1 372 ha de milieux naturels impact\u00e9s par la construction de l'autoroute"]
["Am\u00e9liorer la consommation \u00e9nerg\u00e9tique de b\u00e2timents publics"]
["Favoriser la recherche, am\u00e9liorer la qualit\u00e9 des traitements et assurer un \u00e9gal acc\u00e8s des soins \u00e0 tous les patients de Franche-Comt\u00e9."]
I'm trying to fix/replace them with the corresponding char, but I can't seems to make it, I tried with
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
but the column don't change
Seing that the code below work, I tried to loop over the row to fix it with no success
s = "d'activit\u00e9s"
print(s) # d'activités
print(s.replace('\u00e9', 'é' )) # d'activités
for case in df['Objectif(s)']:
s = str(case)
df['Objectif(s)'][case] = s # ["Investir dans un parc d'activit\u00e9s"]

if this '\u00e9' is actually written into the file as \ u 0 0 e 9 as normal characters by the source of the data, you need to do a string replace.
the trick here is that you need to escape the \ character in the replace function first parameter
s.replace('\\u00e9', 'é' )
or use a "raw string literal" by prefixing r
s.replace(r'\u00e9', 'é' )

Try replacing
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
to
df['Objectif(s)'] = df['Objectif(s)'].str.replace('\u00e9', 'é')

How to correctly parse a tiled map in a text file?

I'm making a RPG with Python and pygame for a school project. In order to create the few maps, I have chosen the Tile Mapping techniques I have seen in some tutorials, using a *.txt file.
However, I have to cut some sprites (trees, houses, ...) into several pieces. The problem is that I'm running out of characters to represent them all!
I also remember that it's possible to take several characters as one (ex : take "100" as one an not as one "1" and two "0"s) and/or to put spaces between characters in the file (e.g. "170 0 2 5 12 48" which is read as six sprites).
But I really don't know how to adapt my program to do this way. I'm pretty sure that I need to modify the way the file is read, but how?
Here's the reading function :
class Niveau:
def __init__(self, fichier):
self.fichier = fichier
self.structure = 0
def generer(self):
"""MÃ©thode permettant de gÃ©nÃ©rer le niveau en fonction du fichier.
On crÃ©e une liste gÃ©nÃ©rale, contenant une liste par ligne Ã afficher"""
#On ouvre le fichier
with open(self.fichier, "r") as fichier:
structure_niveau = []
#On parcourt les lignes du fichier
for ligne in fichier:
ligne_niveau = []
#On parcourt les sprites (lettres) contenus dans le fichier
for sprite in ligne:
#On ignore les "\n" de fin de ligne
if sprite != '\n':
#On ajoute le sprite Ã la liste de la ligne
ligne_niveau.append(sprite)
#On ajoute la ligne Ã la liste du niveau
structure_niveau.append(ligne_niveau)
#On sauvegarde cette structure
self.structure = structure_niveau
def afficher(self, fenetre):
"""MÃ©thode permettant d'afficher le niveau en fonction
de la liste de structure renvoyÃ©e par generer()"""
#Chargement des images (seule celle d'arrivÃ©e contient de la transparence)
Rocher = pygame.image.load(image_Rocher).convert()
Buisson = pygame.image.load(image_Buisson).convert()
#On parcourt la liste du niveau
num_ligne = 0
for ligne in self.structure:
#On parcourt les listes de lignes
num_case = 0
for sprite in ligne:
#On calcule la position rÃ©elle en pixels
x = (num_case+0.5) * taille_sprite
y = (num_ligne+1) * taille_sprite
if sprite == 'R': #R = Rocher
fenetre.blit(Rocher, (x,y))
if sprite == 'B':
fenetre.blit(Buisson,(x,y))
num_case += 1
num_ligne += 1

I think what you want is str.split():
for ligne in fichier:
ligne_niveau = []
#On parcourt les sprites (lettres) contenus dans le fichier
for sprite in ligne.split(): # note split here
ligne_niveau.append(sprite) # no need to check for line end
#On ajoute la ligne Ã la liste du niveau
structure_niveau.append(ligne_niveau)
split without any arguments will join all consecutive whitespace (including tabs '\t' and newlines '\n') into a single split. For example:
"\ta 13 b \t22 6e\n".split() == ['a', '13', 'b', '22', '6e']
Note that the "sprites" don't have to be the same length, so there's no need for fill characters like extra 0s or *s. You can also simplify using a list comprehension:
def generer(self):
with open(self.fichier) as fichier:
self.structure = [ligne.split() for ligne in fichier]
Alternatively, consider using a comma-separated value format - Python has a whole module (csv) for that:
a,13,b,22,6e

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Excel row extract changes fields - python

Related

Python Pandas create new columns from existing one avoiding row iteration

why is the second loop never executed ?

Python: Split CSV with character count

Change unicode hardwrited in csv to corresponding character

How to correctly parse a tiled map in a text file?

Categories

Resources