import re, datetime
input_text = 'del dia 10 a las 10:00 am hasta el 15 de noviembre de 2020' #example 1
input_text = 'de el 10 hasta el 15 a las 20:00 pm de noviembre del año 2020' #example 2
input_text = 'desde el 10 hasta el 15 de noviembre del año 2020' #example 3
input_text = 'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020' #example 4
identificate_day_or_month = r"\b(\d{1,2})\b"
identificate_hours = r"[\s|]*(\d{1,2}):(\d{1,2})[\s|]*(?:a.m.|a.m|am|p.m.|p.m|pm)[\s|]*"
months = r"(?:enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre|este mes|mes que viene|siguiente mes|mes siguiente|mes pasado|pasado mes|anterior año|mes anterior)"
identificate_years = r"(?:del[\s|]*del[\s|]*año|de[\s|]*el[\s|]*año|del[\s|]*del[\s|]*ano|de[\s|]*el[\s|]*ano|del|de)[\s|]*(?:el|)[\s|]*(?:este[\s|]*año[\s|]*\d*|este[\s|]*año|año[\s|]*que[\s|]*viene|siguiente[\s|]*año|año[\s|]*siguiente|año[\s|]*pasado|pasado[\s|]*año|anterior[\s|]*año|año[\s|]*anterior|este[\s|]*ano[\s|]*\d*|este[\s|]*ano|ano[\s|]*que[\s|]*viene|siguiente[\s|]*ano|ano[\s|]*siguiente|ano[\s|]*pasado|pasado[\s|]*ano|anterior[\s|]*ano|ano[\s|]*anterior|este[\s|]*\d*|año \d*|ano \d*|el \d*|\d*)"
#Identification pattern conformed to the sequence of characters with which I am trying to define the search pattern
identification_re_0 = r"(?:(?<=\s)|^)(?:desde[\s|]*el|desde|del|de[\s|]*el|de )[\s|]*(?:día|dia|)[\s|]*" + identificate_day_or_month + identificate_hours + r"[\s|]*(?:hasta|al|a )[\s|]*(?:el|)[\s|]*" + identificate_day_or_month + identificate_hours + r"[\s|]*(?:del|de[\s|]*el|de)[\s|]*(?:mes|)[\s|]*(?:de|)[\s|]*(?:" + identificate_day_or_month + r"|" + months + r"|este mes|mes[\s|]*que[\s|]*viene|siguiente[\s|]*mes|mes[\s|]*siguiente|mes[\s|]*pasado|pasado[\s|]*mes|anterior[\s|]*mes|mes[\s|]*anterior)[\s|]*" + r"(?:" + identificate_years + r"|" + r")"
#Replacement in the input string by a string with built-in corrections where necessary
input_text = re.sub(identification_re_0,
lambda m: ,
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I was trying to get that if the pattern identification_re_0 is found incomplete, that is, without the times indicated, then it completes them with "a las 00:00 am", which represents the beginning of that indicated day with that date.
Within the same input string there may be more than one occurrence of this pattern where this procedure must be performed, therefore the number of replacements in the re.sub() function has not been limited. And I have added the re.IGNORECASE flag since capital letters should not have relevance when performing time recognition within a text.
And the correct output in each of the cases should be like this.
'del dia 10 a las 10:00 am hasta el 15 a las 00:00 am de noviembre de 2020' #for the example 1
'de el 10 a las 00:00 am hasta el 15 a las 20:00 pm de noviembre del año 2020' #for the example 2
'desde el 10 a las 00:00 am hasta el 15 a las 00:00 am de noviembre del año 2020' #for the example 3
'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020' #for the example 4, NOT modify
In example 1 , "a las 00:00 am" has been added to the first date (reading from left to right).
In example 2 , "a las 00:00 am" has been added to the second date.
And in example 3, "a las 00:00 am" has been added to both dates that make up the time interval.
Note that in example 4 it was not necessary to add anything, since the times associated with the dates are already indicated (following the model pattern).
you can capture the part of string which has to be replaced and then replace them as in orginal text.
In regex, (?!a\slas) will validate next words not same as a las.
sample code
import re
def replacer(string, capture_data, replaced_data):
for i in range(len(capture_data)):
string = string.replace(capture_data[i], replaced_data[i])
return string
text = 'del dia 10 a las 10:00 am hasta el 15 de noviembre de 2020'
text1 = 'de el 10 hasta el 15 a las 20:00 pm de noviembre del año 2020' # example 2
text2 = 'desde el 10 hasta el 15 de noviembre del año 2020' # example 3
text3 = 'del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020'
re_exp = r'[A-Za-z]+\s\d{2}\s(?!a\slas)'
capture_data = re.findall(re_exp, text3)
replaced_data = [i + "a las 00:00 am " for i in capture_data]
print(replacer(text3, capture_data, replaced_data))
>>> del 10 a las 10:00 am hasta el 15 a las 20:00 pm de noviembre de 2020
I Have read a CSV file (that have name and addresses of customers) and assign the data into DataFrame table.
Description of the csv file (or the DataFrame table)
DataFrame contains several rows and 7 columns
Database example
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
So far I have written this code to generate the aformentioned table :
The code is
import pandas as pd
import glob
Excel_file = 'Address.xlsx'
Address_Info = pd.read_excel(Excel_file)
# rename the columns name
Address_Info.columns = ['Client_ID', 'Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']
# extract specfic columns into a new dataframe
Bin_Address= Address_Info[['Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']].copy()
# Clean existing whitespace from the ends of the strings
Bin_Address= Bin_Address.apply(lambda x: x.str.strip(), axis=1) # ← added
# Adding a new column called (Full_Address) that concatenate address columns into one
# for example Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Bin_Address['Full_Address'] = Bin_Address[Bin_Address.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
Bin_Address['Full_Address']=Bin_Address[['Full_Address']].copy()
Bin_Address['latitude'] = 'None'
Bin_Address['longitude'] = 'None'
# Remove repetitive addresses
#Temp = list( dict.fromkeys(Bin_Address.Full_Address) )
# Remove repetitive values ( I do beleive the modification should be here)
Temp = list( dict.fromkeys(Address_Info.Client_ID) )
I am looking to remove the entire row if there are repetitive values in the Client id, Client name , and Full_Address columns, so far code doesnt show any error but at the same time, I havnt got the expected out ( i do beleive the modification would be in the last line of the attached code)
The expected output is
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
You can use built in method called dorp_duplicates() from pandas. Also there is lot of options out of the box you can apply.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"])
You also have options for when it is duplicate, if you want to keep the first value or last value.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"], keep="first") # "first" or "last"
By default it will always keep the first value.
Try:
df = df.drop_duplicates(['Client id', 'Client name', 'Full_Address'])
I have a CSV file downloaded from the internet that I need to parse. Python with the csv.reader seems to be the tool of choice, however my input has line terminators (both \r and \n) inside some data fields. This makes for incomplete lines. The field data are surrounded by apostrophes so the issue ought to be avoidable - but how?
I tried with and without the dialect='excel', no difference. I know I need to apply iconv to my input data, too.
import csv
with open(INFILNAM,'rb') as csvfile:
infil = csv.reader(csvfile, dialect='excel', delimiter=';', quotechar='"')
for row in infil:
print ', '.join(row)
Sample of the output, "aaa " preceeding each line for clarity:
aaa , LF0121, La Trancli�re, Base ULM Acc�s priv�, 9/03/2011, 20/02/2012, 20/02/2012, 46.095556, 5.286667, N 46 05 44, E 005 17 12, 820 ft, Tour de piste imp�ratif du c�t� autoroute. Ne pas voler au dessus du village de la Trancli�re. Presque toujours d�gag�., 1, herbe, '36, 40, 640, '18-36, , , , , , 123.5, , , , roger.emeriat#wanadoo.fr, +33 4 74 46 84 34, Village le plus proche : essence , hotels, etc = Pont D'ain � 4, 5 km. En cas de brume : altiport de corlier a environ 15 km
aaa
aaa Infos suppl�mentaires : Laurent Pradel St� Vectral. repr�sentant appareil savannah dans la r�gion. Possibilit� essai de l'appareil en vol. T�l : 04 74 35 60 00 email vectral#wanadoo.fr,
aaa , LF0125, Lavours, Base ULM Autorisation OBLIGATOIRE , 8/03/2011, 24/06/2015, 25/06/2015, 45.795, 5.77361099999996, N 45 47 42, E 005 46 25, 768 ft, TDP � l'est
aaa Eviter les villages en rouge sur la photo, faire la vent-arri�re sur le Rh�ne.
aaa attention aux rouleaux par vent de travers, 1, herbe, '01, 20, 450, '01-19, herbe, , , , 'Inconnue, 123.5, , , , , +33 4 79 42 11 57, attention, base Hydro ULM club de Lavours � proximit� ,
aaa , LF0151, Corbonod Seyssel, A�rodrome Priv� Avec Restrictions Autorisation OBLIGATOIRE , 6/09/2011, 10/01/2015, 11/01/2015, 45.960629840733, 5.817818641662598, N 45 57 38, E 005 49 04, 1175 ft, Arriv�e dans axe de la piste puis, du centre vent arri�re main gauche
aaa Suivre le plan imperatif (photo jointe), 1, dur, '01, 15, 400, '01-19, herbe, , , , 'Inconnue, 123.5, , 6, Restauration � proximit�, dudunoyer#yahoo.fr, +33 6 07 38 20 15, PPR obligatoire pour tous,ULM et avion (arr�t� pr�fectoral). Contacter le Pdt de l'AAC gestionnaire.
The file is being parsed fine; the new lines are just being printed again in your print statement.
Replacing print ', '.join(row) with print row gives the following output:
['Obsol\xe8te', 'Code terrain', 'Toponyme', 'Type', 'Date creation', 'Derni\xe8re modification', 'Date validation', 'Position', 'Latitude', 'Longitude', 'Altitude', 'Consignes', 'Nombre de pistes', 'Nature premi\xe8re piste', 'Axe pr\xe9f\xe9rentiel premi\xe8re piste', 'Largeur premi\xe8re piste', 'Longueur premi\xe8re piste', 'Orientation premi\xe8re piste', 'Nature deuxi\xe8me piste', 'Axe pr\xe9f\xe9rentiel deuxi\xe8me piste', 'Largeur deuxi\xe8me piste', 'Longueur deuxi\xe8me piste', 'Orientation deuxi\xe8me piste', 'Radio', 'Carburant', 'Facilit\xe9s', 'Facilit\xe9s en clair', 'Email de contact', 'T\xe9l\xe9phone', 'Informations compl\xe9mentaires']
['', 'LF0121', 'La Trancli\xe8re', 'Base ULM Acc\xe8s priv\xe9', '8/03/2011', '6/09/2015', '12/09/2015', '46.095556, 5.286667', 'N 46 05 44', 'E 005 17 12', '820 ft', 'Tour de piste imp\xe9ratif du c\xf4t\xe9 autoroute. Ne pas voler au dessus du village de la Trancli\xe8re. Presque toujours d\xe9gag\xe9.', '1', 'herbe', "'36", '40', '640', "'18-36", 'herbe', '', '', '', "'Inconnue", '123.5', '', '', '', 'roger.emeriat#wanadoo.fr', '+33 4 74 46 84 34', "Village le plus proche : essence , hotels, etc = Pont D'ain \xe0 4, 5 km. En cas de brume : altiport de Corlier a environ 15 km\r\r\r\n\r\nInfos suppl\xe9mentaires : Laurent Pradel St\xe9 Vectral. repr\xe9sentant appareil savannah dans la r\xe9gion. Possibilit\xe9 essai de l'appareil en vol. T\xe9l : 04 74 35 60 00 email vectral#wanadoo.fr", '']
['', 'LF0125', 'Lavours', 'Base ULM Autorisation OBLIGATOIRE ', '8/03/2011', '24/06/2015', '25/06/2015', '45.795, 5.77361099999996', 'N 45 47 42', 'E 005 46 25', '768 ft', "TDP \xe0 l'est\r\r\nEviter les villages en rouge sur la photo, faire la vent-arri\xe8re sur le Rh\xf4ne.\r\r\nattention aux rouleaux par vent de travers", '1', 'herbe', "'01", '20', '450', "'01-19", 'herbe', '', '', '', "'Inconnue", '123.5', '', '', '', '', '+33 4 79 42 11 57', 'attention, base Hydro ULM club de Lavours \xe0 proximit\xe9 ', '']
['', 'LF0151', 'Corbonod Seyssel', 'A\xe9rodrome Priv\xe9 Avec Restrictions Autorisation OBLIGATOIRE ', '6/09/2011', '10/01/2015', '11/01/2015', '45.960629840733, 5.817818641662598', 'N 45 57 38', 'E 005 49 04', '1175 ft', 'Arriv\xe9e dans axe de la piste puis, du centre vent arri\xe8re main gauche \r\r\nSuivre le plan imperatif (photo jointe)', '1', 'dur', "'01", '15', '400', "'01-19", 'herbe', '', '', '', "'Inconnue", '123.5', '', '6', 'Restauration \xe0 proximit\xe9', 'dudunoyer#yahoo.fr', '+33 6 07 38 20 15', "PPR obligatoire pour tous,ULM et avion (arr\xeat\xe9 pr\xe9fectoral). Contacter le Pdt de l'AAC gestionnaire.\r\r\nRadio obligatoire pour tous + qualif montagne souhait\xe9e pour avion. Interdit ULM classe 1\r\r\nSurvol interdit \xe0 l'Est du Rh\xf4ne entre Seyssel et Culoz: Zone Biotope Natura 2000\r\r\n\xc9colage et TDP interdit\r\r\nRadio 123.5 car piste l\xe9g\xe8rement convexe.\r\r\r\n\r\nCamping 1.5 KM\r\r\nResto + h\xf4tel : 2Km", '']
As you can see, the new lines within quotes have not created new rows.