Data formatting in Python

Data formatting in Python - python

I want to know if there is a possibility to modify these dates (they are in PT-BR), I want to put them in numerical form in EN or PT-BR.
Início Fim
0 15 de março de 1985 14 de fevereiro de 1986
1 14 de fevereiro de 1986 23 de novembro de 1987
2 23 de novembro de 1987 15 de janeiro de 1989
3 16 de janeiro de 1989 14 de março de 1990
4 15 de março de 1990 23 de janeiro de 1992

We can setlocale LC_TIME to pt_PT then to_datetime will work as expected with a format string:
import locale
import pandas as pd
locale.setlocale(locale.LC_TIME, 'pt_PT')
df = pd.DataFrame({
'Início': ['15 de março de 1985', '14 de fevereiro de 1986',
'23 de novembro de 1987', '16 de janeiro de 1989',
'15 de março de 1990'],
'Fim': ['14 de fevereiro de 1986', '23 de novembro de 1987',
'15 de janeiro de 1989', '14 de março de 1990',
'23 de janeiro de 1992']
})
cols = ['Início', 'Fim']
df[cols] = df[cols].apply(pd.to_datetime, format='%d de %B de %Y')
df:
Início Fim
0 1985-03-15 1986-02-14
1 1986-02-14 1987-11-23
2 1987-11-23 1989-01-15
3 1989-01-16 1990-03-14
4 1990-03-15 1992-01-23

Split the string inside each cell by " de ".
Replace the 2nd element with the corresponding number (I suggest using a dictionary for this).
Join the list into a string. I suggest using str.join, but string concatenation or formatted strings work too.
Let's use an example.
date = "23 de novembro de 1987"
dates = date.split(" de ") # ['23', 'novembro', '1987']
dates[1] = "11" # ['23', '11', '1987']
numerical_date = '/'.join(dates) # "23/11/1987"

Related

Add substring at specified position within a string if a pattern is identified in it except for the absence of that substring which should be complete

import re, datetime
input_text = 'del dia 10 a las 10:00 am hasta el 15 de noviembre de 2020' #example 1
input_text = 'de el 10 hasta el 15 a las 20:00 pm de noviembre del año 2020' #example 2
input_text = 'desde el 10 hasta el 15 de noviembre del año 2020' #example 3
input_text = 'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020' #example 4
identificate_day_or_month = r"\b(\d{1,2})\b"
identificate_hours = r"[\s|]*(\d{1,2}):(\d{1,2})[\s|]*(?:a.m.|a.m|am|p.m.|p.m|pm)[\s|]*"
months = r"(?:enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre|este mes|mes que viene|siguiente mes|mes siguiente|mes pasado|pasado mes|anterior año|mes anterior)"
identificate_years = r"(?:del[\s|]*del[\s|]*año|de[\s|]*el[\s|]*año|del[\s|]*del[\s|]*ano|de[\s|]*el[\s|]*ano|del|de)[\s|]*(?:el|)[\s|]*(?:este[\s|]*año[\s|]*\d*|este[\s|]*año|año[\s|]*que[\s|]*viene|siguiente[\s|]*año|año[\s|]*siguiente|año[\s|]*pasado|pasado[\s|]*año|anterior[\s|]*año|año[\s|]*anterior|este[\s|]*ano[\s|]*\d*|este[\s|]*ano|ano[\s|]*que[\s|]*viene|siguiente[\s|]*ano|ano[\s|]*siguiente|ano[\s|]*pasado|pasado[\s|]*ano|anterior[\s|]*ano|ano[\s|]*anterior|este[\s|]*\d*|año \d*|ano \d*|el \d*|\d*)"
#Identification pattern conformed to the sequence of characters with which I am trying to define the search pattern
identification_re_0 = r"(?:(?<=\s)|^)(?:desde[\s|]*el|desde|del|de[\s|]*el|de )[\s|]*(?:día|dia|)[\s|]*" + identificate_day_or_month + identificate_hours + r"[\s|]*(?:hasta|al|a )[\s|]*(?:el|)[\s|]*" + identificate_day_or_month + identificate_hours + r"[\s|]*(?:del|de[\s|]*el|de)[\s|]*(?:mes|)[\s|]*(?:de|)[\s|]*(?:" + identificate_day_or_month + r"|" + months + r"|este mes|mes[\s|]*que[\s|]*viene|siguiente[\s|]*mes|mes[\s|]*siguiente|mes[\s|]*pasado|pasado[\s|]*mes|anterior[\s|]*mes|mes[\s|]*anterior)[\s|]*" + r"(?:" + identificate_years + r"|" + r")"
#Replacement in the input string by a string with built-in corrections where necessary
input_text = re.sub(identification_re_0,
lambda m: ,
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I was trying to get that if the pattern identification_re_0 is found incomplete, that is, without the times indicated, then it completes them with "a las 00:00 am", which represents the beginning of that indicated day with that date.
Within the same input string there may be more than one occurrence of this pattern where this procedure must be performed, therefore the number of replacements in the re.sub() function has not been limited. And I have added the re.IGNORECASE flag since capital letters should not have relevance when performing time recognition within a text.
And the correct output in each of the cases should be like this.
'del dia 10 a las 10:00 am hasta el 15 a las 00:00 am de noviembre de 2020' #for the example 1
'de el 10 a las 00:00 am hasta el 15 a las 20:00 pm de noviembre del año 2020' #for the example 2
'desde el 10 a las 00:00 am hasta el 15 a las 00:00 am de noviembre del año 2020' #for the example 3
'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020' #for the example 4, NOT modify
In example 1 , "a las 00:00 am" has been added to the first date (reading from left to right).
In example 2 , "a las 00:00 am" has been added to the second date.
And in example 3, "a las 00:00 am" has been added to both dates that make up the time interval.
Note that in example 4 it was not necessary to add anything, since the times associated with the dates are already indicated (following the model pattern).

you can capture the part of string which has to be replaced and then replace them as in orginal text.
In regex, (?!a\slas) will validate next words not same as a las.
sample code
import re
def replacer(string, capture_data, replaced_data):
for i in range(len(capture_data)):
string = string.replace(capture_data[i], replaced_data[i])
return string
text = 'del dia 10 a las 10:00 am hasta el 15 de noviembre de 2020'
text1 = 'de el 10 hasta el 15 a las 20:00 pm de noviembre del año 2020' # example 2
text2 = 'desde el 10 hasta el 15 de noviembre del año 2020' # example 3
text3 = 'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020'
re_exp = r'[A-Za-z]+\s\d{2}\s(?!a\slas)'
capture_data = re.findall(re_exp, text3)
replaced_data = [i + "a las 00:00 am " for i in capture_data]
print(replacer(text3, capture_data, replaced_data))
>>> del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020

Add a comma after two words in pandas

I have the following texts in a df column:
La Palma
La Palma Nueva
La Palma, Nueva Concepcion
El Estor
El Estor Nuevo
Nuevo Leon
San Jose
La Paz Colombia
Mexico Distrito Federal
El Estor, Nuevo Lugar
What I need is to add a comma at the end of each row but the condition that it is only two words. I found a partial solution:
df['Column3'] = df['Column3'].apply(lambda x: str(x)+',')
(solution found in stackoverflow)

Given:
words
0 La Palma
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor
4 El Estor Nuevo
5 Nuevo Leon
6 San Jose
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
Doing:
df.words = df.words.apply(lambda x: x+',' if len(x.split(' ')) == 2 else x)
print(df)
Outputs:
words
0 La Palma,
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor,
4 El Estor Nuevo
5 Nuevo Leon,
6 San Jose,
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar

In pandas how to create a dataframe from a list of dictionaries?

In python3 and pandas I have a list of dictionaries in this format:
a = [{'texto27/2': 'SENADO: PLS 00143/2016, de autoria de Telmário Mota, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Assuntos Sociais. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=2914881'}, {'texto27/3': 'SENADO: PEC 00176/2019, de autoria de Randolfe Rodrigues, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Constituição, Justiça e Cidadania. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=8027142'}, {'texto6/4': 'SENADO: PL 05643/2019, de autoria de Câmara dos Deputados, fala sobre violência sexual e sofreu alterações em sua tramitação. Tramitação: Comissão de Direitos Humanos e Legislação Participativa. Situação: MATÉRIA COM A RELATORIA. http://legis.senado.leg.br/sdleg-getter/documento?dm=8015569'}]
I tried to transform it into a dataframe with these commands:
import pandas as pd
df_lista_sentencas = pd.DataFrame(a)
df_lista_sentencas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
texto27/2 1 non-null object
texto27/3 1 non-null object
texto6/4 1 non-null object
dtypes: object(3)
memory usage: 100.0+ bytes
But the generated dataframe has blank lines:
df_lista_sentencas.reset_index()
index texto27/2 texto27/3 texto6/4
0 0 SENADO: PLS 00143/2016, de autoria de Telmário... NaN NaN
1 1 NaN SENADO: PEC 00176/2019, de autoria de Randolfe... NaN
2 2 NaN NaN SENADO: PL 05643/2019, de autoria de Câmara do...
I would like to generate something like this:
texto27/2 texto27/3 texto6/4
SENADO: PLS 00143/2016, de autoria de Telmário... SENADO: PEC 00176/2019, de autoria de Randolfe.. SENADO: PL 05643/2019, de autoria de Câmara do...
Please, does anyone know how I can create a dataframe without blank lines?

May be using bfill:
df = df_lista_sentencas.bfill().iloc[[0]]
print(df)

Removing entire row if there are repetitive values in specific columns

I Have read a CSV file (that have name and addresses of customers) and assign the data into DataFrame table.
Description of the csv file (or the DataFrame table)
DataFrame contains several rows and 7 columns
Database example
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
So far I have written this code to generate the aformentioned table :
The code is
import pandas as pd
import glob
Excel_file = 'Address.xlsx'
Address_Info = pd.read_excel(Excel_file)
# rename the columns name
Address_Info.columns = ['Client_ID', 'Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']
# extract specfic columns into a new dataframe
Bin_Address= Address_Info[['Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']].copy()
# Clean existing whitespace from the ends of the strings
Bin_Address= Bin_Address.apply(lambda x: x.str.strip(), axis=1) # ← added
# Adding a new column called (Full_Address) that concatenate address columns into one
# for example Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Bin_Address['Full_Address'] = Bin_Address[Bin_Address.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
Bin_Address['Full_Address']=Bin_Address[['Full_Address']].copy()
Bin_Address['latitude'] = 'None'
Bin_Address['longitude'] = 'None'
# Remove repetitive addresses
#Temp = list( dict.fromkeys(Bin_Address.Full_Address) )
# Remove repetitive values ( I do beleive the modification should be here)
Temp = list( dict.fromkeys(Address_Info.Client_ID) )
I am looking to remove the entire row if there are repetitive values in the Client id, Client name , and Full_Address columns, so far code doesnt show any error but at the same time, I havnt got the expected out ( i do beleive the modification would be in the last line of the attached code)
The expected output is
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG

You can use built in method called dorp_duplicates() from pandas. Also there is lot of options out of the box you can apply.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"])
You also have options for when it is duplicate, if you want to keep the first value or last value.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"], keep="first") # "first" or "last"
By default it will always keep the first value.

Try:
df = df.drop_duplicates(['Client id', 'Client name', 'Full_Address'])

csv.reader fooled by line terminators inside data fields

I have a CSV file downloaded from the internet that I need to parse. Python with the csv.reader seems to be the tool of choice, however my input has line terminators (both \r and \n) inside some data fields. This makes for incomplete lines. The field data are surrounded by apostrophes so the issue ought to be avoidable - but how?
I tried with and without the dialect='excel', no difference. I know I need to apply iconv to my input data, too.
import csv
with open(INFILNAM,'rb') as csvfile:
infil = csv.reader(csvfile, dialect='excel', delimiter=';', quotechar='"')
for row in infil:
print ', '.join(row)
Sample of the output, "aaa " preceeding each line for clarity:
aaa , LF0121, La Trancli�re, Base ULM Acc�s priv�, 9/03/2011, 20/02/2012, 20/02/2012, 46.095556, 5.286667, N 46 05 44, E 005 17 12, 820 ft, Tour de piste imp�ratif du c�t� autoroute. Ne pas voler au dessus du village de la Trancli�re. Presque toujours d�gag�., 1, herbe, '36, 40, 640, '18-36, , , , , , 123.5, , , , roger.emeriat#wanadoo.fr, +33 4 74 46 84 34, Village le plus proche : essence , hotels, etc = Pont D'ain � 4, 5 km. En cas de brume : altiport de corlier a environ 15 km
aaa
aaa Infos suppl�mentaires : Laurent Pradel St� Vectral. repr�sentant appareil savannah dans la r�gion. Possibilit� essai de l'appareil en vol. T�l : 04 74 35 60 00 email vectral#wanadoo.fr,
aaa , LF0125, Lavours, Base ULM Autorisation OBLIGATOIRE , 8/03/2011, 24/06/2015, 25/06/2015, 45.795, 5.77361099999996, N 45 47 42, E 005 46 25, 768 ft, TDP � l'est
aaa Eviter les villages en rouge sur la photo, faire la vent-arri�re sur le Rh�ne.
aaa attention aux rouleaux par vent de travers, 1, herbe, '01, 20, 450, '01-19, herbe, , , , 'Inconnue, 123.5, , , , , +33 4 79 42 11 57, attention, base Hydro ULM club de Lavours � proximit� ,
aaa , LF0151, Corbonod Seyssel, A�rodrome Priv� Avec Restrictions Autorisation OBLIGATOIRE , 6/09/2011, 10/01/2015, 11/01/2015, 45.960629840733, 5.817818641662598, N 45 57 38, E 005 49 04, 1175 ft, Arriv�e dans axe de la piste puis, du centre vent arri�re main gauche
aaa Suivre le plan imperatif (photo jointe), 1, dur, '01, 15, 400, '01-19, herbe, , , , 'Inconnue, 123.5, , 6, Restauration � proximit�, dudunoyer#yahoo.fr, +33 6 07 38 20 15, PPR obligatoire pour tous,ULM et avion (arr�t� pr�fectoral). Contacter le Pdt de l'AAC gestionnaire.

The file is being parsed fine; the new lines are just being printed again in your print statement.
Replacing print ', '.join(row) with print row gives the following output:
['Obsol\xe8te', 'Code terrain', 'Toponyme', 'Type', 'Date creation', 'Derni\xe8re modification', 'Date validation', 'Position', 'Latitude', 'Longitude', 'Altitude', 'Consignes', 'Nombre de pistes', 'Nature premi\xe8re piste', 'Axe pr\xe9f\xe9rentiel premi\xe8re piste', 'Largeur premi\xe8re piste', 'Longueur premi\xe8re piste', 'Orientation premi\xe8re piste', 'Nature deuxi\xe8me piste', 'Axe pr\xe9f\xe9rentiel deuxi\xe8me piste', 'Largeur deuxi\xe8me piste', 'Longueur deuxi\xe8me piste', 'Orientation deuxi\xe8me piste', 'Radio', 'Carburant', 'Facilit\xe9s', 'Facilit\xe9s en clair', 'Email de contact', 'T\xe9l\xe9phone', 'Informations compl\xe9mentaires']
['', 'LF0121', 'La Trancli\xe8re', 'Base ULM Acc\xe8s priv\xe9', '8/03/2011', '6/09/2015', '12/09/2015', '46.095556, 5.286667', 'N 46 05 44', 'E 005 17 12', '820 ft', 'Tour de piste imp\xe9ratif du c\xf4t\xe9 autoroute. Ne pas voler au dessus du village de la Trancli\xe8re. Presque toujours d\xe9gag\xe9.', '1', 'herbe', "'36", '40', '640', "'18-36", 'herbe', '', '', '', "'Inconnue", '123.5', '', '', '', 'roger.emeriat#wanadoo.fr', '+33 4 74 46 84 34', "Village le plus proche : essence , hotels, etc = Pont D'ain \xe0 4, 5 km. En cas de brume : altiport de Corlier a environ 15 km\r\r\r\n\r\nInfos suppl\xe9mentaires : Laurent Pradel St\xe9 Vectral. repr\xe9sentant appareil savannah dans la r\xe9gion. Possibilit\xe9 essai de l'appareil en vol. T\xe9l : 04 74 35 60 00 email vectral#wanadoo.fr", '']
['', 'LF0125', 'Lavours', 'Base ULM Autorisation OBLIGATOIRE ', '8/03/2011', '24/06/2015', '25/06/2015', '45.795, 5.77361099999996', 'N 45 47 42', 'E 005 46 25', '768 ft', "TDP \xe0 l'est\r\r\nEviter les villages en rouge sur la photo, faire la vent-arri\xe8re sur le Rh\xf4ne.\r\r\nattention aux rouleaux par vent de travers", '1', 'herbe', "'01", '20', '450', "'01-19", 'herbe', '', '', '', "'Inconnue", '123.5', '', '', '', '', '+33 4 79 42 11 57', 'attention, base Hydro ULM club de Lavours \xe0 proximit\xe9 ', '']
['', 'LF0151', 'Corbonod Seyssel', 'A\xe9rodrome Priv\xe9 Avec Restrictions Autorisation OBLIGATOIRE ', '6/09/2011', '10/01/2015', '11/01/2015', '45.960629840733, 5.817818641662598', 'N 45 57 38', 'E 005 49 04', '1175 ft', 'Arriv\xe9e dans axe de la piste puis, du centre vent arri\xe8re main gauche \r\r\nSuivre le plan imperatif (photo jointe)', '1', 'dur', "'01", '15', '400', "'01-19", 'herbe', '', '', '', "'Inconnue", '123.5', '', '6', 'Restauration \xe0 proximit\xe9', 'dudunoyer#yahoo.fr', '+33 6 07 38 20 15', "PPR obligatoire pour tous,ULM et avion (arr\xeat\xe9 pr\xe9fectoral). Contacter le Pdt de l'AAC gestionnaire.\r\r\nRadio obligatoire pour tous + qualif montagne souhait\xe9e pour avion. Interdit ULM classe 1\r\r\nSurvol interdit \xe0 l'Est du Rh\xf4ne entre Seyssel et Culoz: Zone Biotope Natura 2000\r\r\n\xc9colage et TDP interdit\r\r\nRadio 123.5 car piste l\xe9g\xe8rement convexe.\r\r\r\n\r\nCamping 1.5 KM\r\r\nResto + h\xf4tel : 2Km", '']
As you can see, the new lines within quotes have not created new rows.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data formatting in Python - python

Related

Add substring at specified position within a string if a pattern is identified in it except for the absence of that substring which should be complete

Add a comma after two words in pandas

In pandas how to create a dataframe from a list of dictionaries?

Removing entire row if there are repetitive values in specific columns

csv.reader fooled by line terminators inside data fields

Categories

Resources