Split a column in pandas twice to multiple columns - python

I have a column "Nome_propriedade" with complete addresses, such as establishment name, streets, neighborhood, city and state
It always ends with the name of the city and state. With this pattern:
Nome_propriedade
"Rod. BR 386, bairro Olarias/Conventos, Lajeado/RS"
"Fazenda da Várzea - zona rural, Serro/MG"
"Cidade do Rock - Jacarepaguá, Rio de Janeiro/RJ"
"Área de extração de carnaúba - Povoado Areal, zona rural, Santa Cruz do Piauí/PI"
"Pastelaria - Av. Vicente de Carvalho, 995, Loja Q, Vila da Penha, Rio de Janeiro/RJ"
I want to create two new columns, "city" and "state", and fill them with the last values found in column "Nome_propriedade". I also want to stip those away from Nome_propiedade.
Nome_propriedade City State
Rod. BR 386, bairro Olarias/Conventos Lajeado RS
Fazenda da Várzea - zona rural Serro MG
Cidade do Rock - Jacarepaguá... Rio de Janeiro RJ
Área de extração de carnaúba - Povoado A... Santa Cruz do Piauí PI
Pastelaria - Av. Vicente de Carvalho, 99... Rio de Janeiro RJ
Please anyone know how I can create these two columns?
I can not do a general split because I just want to separate the city and state information. Other information may remain unchanged.

What do you think about:
import pandas as pd
propiedades = ["Rod. BR 386, bairro Olarias/Conventos, Lajeado/RS",
"Fazenda da Várzea - zona rural, Serro/MG",
"Cidade do Rock - Jacarepaguá, Rio de Janeiro/RJ",
"Área de extração de carnaúba - Povoado Areal, zona rural, Santa Cruz do Piauí/PI",
"Pastelaria - Av. Vicente de Carvalho, 995, Loja Q, Vila da Penha, Rio de Janeiro/RJ"]
df = pd.DataFrame({"Nome_propriedade":propiedades})
df[["City", "State"]] = df["Nome_propriedade"].apply(lambda x :x.split(",")[-1]).str.split("/",
expand=True)
UPDATE
If you then want to delete these infos from Nome_propriedade you can add this line
df["Nome_propriedade"] = df["Nome_propriedade"].apply(lambda x :",".join(x.split(",")[:-1]))

You need to split the string in the column by ,, takw the last element in the list and split it by /. That list is your two columns.
pd.DataFrame(list(df['Nome_propriedade'].str.split(',').apply(lambda x: x[-1]).str.split('/')), columns=['city', 'state'])
Output:
city state
0 Lajeado RS
1 Serro MG
2 Rio de Janeiro RJ
3 Santa Cruz do Piauí PI
4 Rio de Janeiro RJ

Here is an effective solution avoiding the tedious apply and simply sticking with str-operations.
df["Nome_propriedade"], x = df["Nome_propriedade"].str.rsplit(', ', 1).str
df["City"], df['State'] = x.str.split('/').str
Full example:
import pandas as pd
propiedades = [
"Rod. BR 386, bairro Olarias/Conventos, Lajeado/RS",
"Fazenda da Várzea - zona rural, Serro/MG",
"Cidade do Rock - Jacarepaguá, Rio de Janeiro/RJ",
"Área de extração de carnaúba - Povoado Areal, zona rural, Santa Cruz do Piauí/PI",
"Pastelaria - Av. Vicente de Carvalho, 995, Loja Q, Vila da Penha, Rio de Janeiro/RJ"
]
df = pd.DataFrame({
"Nome_propriedade":propiedades
})
df["Nome_propriedade"], x = df["Nome_propriedade"].str.rsplit(', ', 1).str
df["City"], df['State'] = x.str.split('/').str
# Stripping Nome_propriedade to len 40 to fit screen
print(df.assign(Nome_propriedade=df['Nome_propriedade'].str[:40]))
Returns:
Nome_propriedade City State
0 Rod. BR 386, bairro Olarias/Conventos Lajeado RS
1 Fazenda da Várzea - zona rural Serro MG
2 Cidade do Rock - Jacarepaguá Rio de Janeiro RJ
3 Área de extração de carnaúba - Povoado A Santa Cruz do Piauí PI
4 Pastelaria - Av. Vicente de Carvalho, 99 Rio de Janeiro RJ
If you'd like to keep the items:
df["City"], df['State'] = df["Nome_propriedade"]\
.str.rsplit(', ', 1).str[-1]\
.str.split('/').str

The easiest approach I can see is, for a single example:
example = 'some, stuff, here, city/state'
elements = example.split(',')
city, state = elements[-1].split('/')
To apply this to the column in your dataframe:
df['city_state'] = df.Nome_propriedade.apply(lambda r: r.split(',')[-1].split('/'))
df['city'] = [cs[0] for cs in df['city_state']]
df['state'] = [cs[1] for cs in df['city_state']]
For example:
example2 = 'another, thing here city2/state2'
df = pd.DataFrame({'address': [example, example2],
'other': [1, 2]})
df['city_state'] = df.address.apply(lambda r: r.split()[-1].split('/'))
df['city'] = [cs[0] for cs in df['city_state']]
df['state'] = [cs[1] for cs in df['city_state']]
df.drop(columns=['city_state'], inplace=True)
print(df)
# address other city state
# 0 some, stuff, here, city/state 1 city state
# 1 another, thing here city2/state2 2 city2 state2
Note: some of the other answers provide a more efficient way to unpack the result into your dataframe. I'll leave this here because I think breaking it out into steps is illustrative, but for efficiency sake, I'd go with one of the others.

Related

AttributeError: 'ChatBot' object has no attribute 'input'

I'm having trouble finding the error in my code:
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
from chatterbot.comparisons import JaccardSimilarity
from chatterbot.comparisons import LevenshteinDistance
from chatterbot.conversation import Statement
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
#Creo una instancia de la clase ChatBot
chatbot = ChatBot(
'Jazz',
storage_adapter='chatterbot.storage.SQLStorageAdapter',
database='./database.sqlite5', #fichero de la base de datos (si no existe se creará automáticamente)
input_adapter='chatterbot.input.TerminalAdapter', #indica que la pregunta se toma del terminal
output_adapter='chatterbot.output.TerminalAdapter', #indeica que la respuesta se saca por el terminal
trainer='chatterbot.trainers.ListTrainer',
#Un Logic_adapter es una clase que devuelve una respuesta ante una pregunta dada.
#Se pueden usar tantos logic_adapters como se quiera
logic_adapters=[
#'chatterbot.logic.MathematicalEvaluation', #Este es un logic_adapter que responde preguntas sobre matemáticas en inglés
#'chatterbot.logic.TimeLogicAdapter', #Este es un logic_adapter que responde preguntas sobre la hora actual en inglés
{
"import_path": "chatterbot.logic.BestMatch",
"statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
"response_selection_method": "chatterbot.response_selection.get_most_frequent_response"
}
#{
# 'import_path': 'chatterbot.logic.LowConfidenceAdapter',
# 'threshold': 0.51,
# 'default_response': 'Disculpa, no te he entendido bien. ¿Puedes ser más específico?.'
#},
#{
# 'import_path': 'chatterbot.logic.SpecificResponseAdapter',
# 'input_text': 'Eso es todo',
# 'output_text': 'Perfecto. Hasta la próxima'
#},
],
preprocessors=[
'chatterbot.preprocessors.clean_whitespace'
],
#read_only=True,
)
trainer = ChatterBotCorpusTrainer(chatbot)
trainer.train("chatterbot.corpus.spanish")
trainer.train("./PreguntasYRespuestas.yml")
#chatbot.train([
# '¿Cómo estás?',
# 'Bien.',
# 'Me alegro.',
# 'Gracias.',
# 'De nada.',
# '¿Y tú?'
#])
levenshtein_distance = LevenshteinDistance(None)
disparate=Statement('No te he entendido')#convertimos una frase en un tipo statement
entradaDelUsuario="" #variable que contendrá lo que haya escrito el usuario
entradaDelUsuarioAnterior=""
while entradaDelUsuario!="adios":
entradaDelUsuario = chatbot.input.process_input_statement() #leemos la entrada del usuario
statement, respuesta = chatbot.generate_response(entradaDelUsuario)
if levenshtein_distance.compare(entradaDelUsuario,disparate)>0.51:
print('¿Qué debería haber dicho?')
entradaDelUsuarioCorreccion = chatbot.input.process_input_statement()
chatbot.train([entradaDelUsuarioAnterior.text,entradaDelUsuarioCorreccion.text])
print("He aprendiendo que cuando digas {} debo responder {}".format(entradaDelUsuarioAnterior.text,entradaDelUsuarioCorreccion.text))
entradaDelUsuarioAnterior=entradaDelUsuario
print("\n%s\n\n" % respuesta)
I have tried to follow the tutorial, I am new to pyton and I would like you to help me find the error since the following appears when compiling:
AttributeError: 'ChatBot' object has no attribute 'input'

Tweepy: bad Authentication Data

I've been trying to do some basic sentiment analysis on some tweets about La Sagrada Familia, and cannot for the life of me figure out why I get this basd authentication data error:
Traceback (most recent call last):
File "saDemo.py", line 15, in <module>
public_tweets = api.search('')
File "/Users/declancasey/opt/miniconda3/lib/python3.8/site-packages/tweepy/binder.py", line 252, in _call
return method.execute()
File "/Users/declancasey/opt/miniconda3/lib/python3.8/site-packages/tweepy/binder.py", line 234, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{'code': 215, 'message': 'Bad Authentication data.'}]
I've seen other people have issues where it relates to the keys they're using, but I gave the original keys I used a few days in case it hadn't yet authenticated, but still get the same error. I've regenerated my keys several times over the past few days, messed with the formatting, tried commenting out different lines but keep getting this error. I'm using python 3.8 and am on Mac Big Sur, any help would be appreciated. My code is below:
import tweepy
from textblob import TextBlob
consumer_key = "XXXX"
consumer_secret = "XXXX"
access_token = "XXXX"
access_token_secret = "XXXX"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
public_tweets = api.search("La Sagrada Familia")
for tweet in public_tweets:
print(tweet.text)
analysis = TextBlob(tweet.text)
print(analysis.sentiment)
I have used your code and could not reproduce the error:
Python 3.7.4
I suggest you check your API keys by loggin into your Twitter developer account or try using a different key.
import tweepy
from textblob import TextBlob
consumer_key = key[0]
consumer_secret = key[1]
access_token = key[2]
access_token_secret = key[3]
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
public_tweets = api.search("La Sagrada Familia")
for tweet in public_tweets:
print(tweet.text)
analysis = TextBlob(tweet.text)
print(analysis.sentiment)
This is the output I got: (I've removed the URLs from output)
El Evangelio del dia 27 DE DICIEMBRE – DOMINGO -La Sagrada Familia: Jesús, María y José – Ciclo B SAN JUAN EVAN…
Sentiment(polarity=0.0, subjectivity=0.0)
Lecturas LSE: La Sagrada Familia (B), 27 de diciembre de 2020 a través de #YouTube
Sentiment(polarity=0.0, subjectivity=0.0)
LAUDES 2020/12/27 #LaudesFrayNelson para la Fiesta de la Sagrada Familia
Sentiment(polarity=0.0, subjectivity=0.0)
LECTIO 2020/12/27 LECTURA ESPIRITUAL. #LectioFrayNelson para la Fiesta de la Sagrada Familia
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #HdadBorriquita: 🔴 ACTUALIDAD | La imagen de San Juan Evangelista se encuentra en el Altar Mayor de la Parroquia de San Agustín con moti…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #alvpe: Basílica de la Sagrada Família, Barcelona.
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #bsainzm: hasta que una mujer alzó la mano. El padre agradeció a la mujer y pidió a la familia de migrantes que pasara al frente. La "fa…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #SemiCuenca: El Oratorio y la Capilla del Seminario se “visten” de Navidad. Sagrada Familia, sed guía y estímulo para padres e hijos. Qu…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)
La sagrada familia de Nazaret, la de Jesús, era a ojos de su tiempo una familia desestructurada: un hijo inesperado…
Sentiment(polarity=0.0, subjectivity=0.0)
RT #DespertaFerro11: Els que estigueu demà prop de la Sagrada Família, passeu a donar suport a #JuntsXCat! Es necessita el teu aval! #Junts…
Sentiment(polarity=0.0, subjectivity=0.0)

KeyError: word fransız not in vocabulary

When I tried to run below code, I get keyerror:
KeyError: word fransız not in vocabulary.
What is the issue?
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize,word_tokenize
import string
text="Victor Marie Hugo, Romantik akıma bağlı Fransız şair, romancı ve oyun yazarı. En büyük ve ünlü Fransız yazarlardan biri kabul edilir. Hugo'nun Fransa'daki edebi ünü ilk olarak şiirlerinden sonra da romanlarından ve tiyatro oyunlarından gelir. Pek çok şiirinin içinde özellikle Les Contemplations ve La Légende des siècles büyük saygı görür. Fransa dışında en çok Sefiller ve Notre Dame'ın Kamburu romanlarıyla tanınır.Gençliğinde şiddetli bir kral yanlısı olsa da, görüşü yıllar içinde değişti ve tutkulu bir cumhuriyet destekçisi oldu. Eserleri zamanının politik ve sosyal sorunlarına ve de sanatsal akımlarına değinir. Hugo'nun cenazesi 1885'te Panthéon'da gömüldü. Hugo hakkında en çok eser yazılan ilk 100 kişi listesinde yer almaktadır. Victor Hugo, Joseph Léopold Sigisbert Hugo (1773–1828) ve Sophie Trébuchet (1772–1821) çiftinin üçüncü oğluydu; Abel Joseph Hugo (1798–1855) ve Eugène Hugo (1800–1837) isminde iki ağabeyi vardı. 1802'de Besançon'da doğdu. Napolyon'un bir kahraman olduğunu düşünen serbest fikirli bir cumhuriyetçiydi. Annesi 1812'de Napolyon'a karşı komplo kurduğu için idam edilen General Victor Lahorie ile sevgili olduğu düşünülen Katolik bir Kralcıydı.Hugo'nun çocukluğu ülkede siyasi karmaşıklığın olduğu bir dönemde geçti. Doğumundan iki yıl sonra Napolyon İmparator ilan edilmiş, 18 yaşındayken de Bourbon Monarşisi yeniden tahta geçirilmişti. Hugo'nun ailesinin ters dini ve politik görüşleri Fransa'da egemenlik mücadelesi veren kuvvetleri yansıtıyordu. Hugo'nun babası İspanya'da yenilene kadar orduda yüksek rütbeli bir subaydı.Babası subay olduğu sürece aile sık sık taşındı ve bu yolculuklar sırasında Hugo pek çok şey öğrendi. Çocukluğunda Napoli'ye giderken geniş Alpler'deki geçitleri ve karlı zirveleri, muhteşem Akdeniz mavisini ve şenlikler yapılan Roma'yı gördü. 5 yaşında olmasına rağmen bu 6 aylık geziyi her zaman aklında tuttu. Aile Napoli'de birkaç ay kalıp doğruca Paris'e döndü.Hugo'nun annesi Sophie evliliğinin başında kocasına İtalya (Leopold Napoli'ye yakın bir vilayette valiydi) ve İspanya'ya (üç vilayette görev almıştı) kadar eşlik etti. Askeri hayatın getirdiği yorucu yolculuklar ve kocasının inancının zayıflığı nedeniyle ters düşmelerinden dolayı Sophie 1803'te Leopold'dan bir süreliğine ayrılıp üç çocuğuyla Paris'e yerleşti. Bundan sonra Hugo'nun eğitimi ve yetişmesi üzerine eğildi. Bu yüzden Hugo'nun kariyerinin ilk dönemindeki şiir ve kurgu çalışmaları annesinin inancının ve krala bağlılığının yansımasıydı. Ama başını Fransa'daki 1848 Devrimi'nin çektiği olaylar sırasında Katolik Kralcı yanlısı eğitime başkaldırıp Cumhuriyetçiliği ve Özgür düşünceyi desteklemeye başladı.Gençliğinde aşık oldu ve annesinin isteklerine karşı gelip çocukluk arkadaşı Adèle Foucher (1803–1868) ile gizlice nişanlandı. Annesi ile yakın ilişkisinden dolayı Adèle ile evlenmek için annesinin ölümüne (1821) kadar bekledi ve 1822'de evlendi.Adèle ve Victor Hugo'nun ilk çocuğu Leopold 1823'te doğdu ama doğduktan kısa süre sonra öldü. Sonraki sene kızları 28 Ağustos 1824'te Léopoldine doğdu. Onu 4 Kasım 1826'da doğan Charles, 28 Ekim 1828'de doğan François-Victor, ve 24 Ağustos 1830'da doğan Adèle takip etti.Hugo'nun en büyük ve en sevdiği kızı Léopoldine, Charles Vacquerie ile evliliğinden kısa süre sonra 19 yaşındayken 1843'te öldü. 4 Eylül 1843'te Seine nehrinde boğuldu. Gemi alabaro olduğundan ağır eteği tarafından dibe doğru çekildi ve kocası Charles Vacquerie de onu kurtarmaya çalışırken öldü. O zaman metresi ile Fransa'nın güneyinde seyahat etmekte olan Hugo kızının ölümünü oturduğu cafede okuduğu bir gazeteden öğrendi. Kızının ölümü Hugo'yu oldukça harap etti.III. Napolyon'un 1851 yılının sonundaki askeri darbesi sebebiyle sürgüne çıktı. Fransa'dan ayrıldıktan sonra, Channel Adaları'na gitmeden önce kısa bir süre Brüksel'de yaşadı. 1852'den 1855'e kadar Jersey'de yaşadı. 1855'te 15 yıl yaşayacağı Guernsey'e taşındı. III. Napolyon 1859'da genel af ilan ettiğinde ülkesine dönme fırsatı elde ettiyse de sürgünde kalmayı tercih etti. Kaybedilen Fransa-Prusya Savaşı'nın sonucu olarak III. Napolyon iktidardan çekilmek zorunda kalınca ülkesine döndü. Paris Kuşatması'ndan sonra hayatının geri kalanını Fransa'da geçirmek için geri dönmeden önce tekrar Guernsey'e taşınıp 1872 ve 1873 arası orada kaldı. Hugo ilk romanını (Han d'Islande, 1823) evliliğinden bir yıl sonra yayımladı. Üç yıl sonra da ikinci romanı (Bug-Jargal, 1826) basıldı. 1829 ve 1840 arasında zamanının en iyi şairlerinden biri olarak ününü pekiştiren beş şiir kitabı (Les Orientales, 1829; Les Feuilles d'automne, 1831; Les Chants du crépuscule, 1835; Les Voix intérieures, 1837; ve Les Rayons et les ombres, 1840) yayınladı."
punctuations = ",;:()[]/{}''"
sentence="!.?"
no_punct = ""
for char in text:
if char not in punctuations:
no_punct = no_punct + char
t_sen = ""
for char in no_punct:
if char in sentence:
t_sen = no_punct.split(char)
corpus=[]
for cumle in t_sen:
corpus.append(cumle.split())
model=Word2Vec(corpus,size=30,window=5,min_count=5,sg=1)
model.wv.most_similar('fransız')
I am wondering if your model returns anything for 'Fransız':
model.wv.most_similar('Fransız')
You are not doing any preprocessing on the input vocabulary so I don't think you can expect to find words that differ in casing (e.g. as in your case - lowercase word vs. a capitalized one).
Another reason (thank you for suggestion, #gojomo) - might be the min_count paramter. Here it is 5 which sets the threshold above the count of the words in the text 3 (including both lowercase and capitalized version).

Convert physical addresses to Geographic locations Latitude and Longitude

I Have read a CSV file (that have addresses of customers) and assign the data into DataFrame table.
Description of the csv file (or the DataFrame table)
DataFrame contains several rows and 5 columns
Database example
Address1 Address3 Post_Code City_Name Full_Address
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
I have written a code (Geocode with Python) inorder to convert physical addresses to Geographic locations → Latitude and Longitude, but the code keep showing several errors
So far I have written this code :
The code is
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
# Read the CSV, by the way the csv file contains 43 columns
ERP_Data = pd.read_csv("test.csv")
# Extracting the address information into a new DataFrame
Address_info= ERP_Data[['Address1','Address3','Post_Code','City_Name']].copy()
# Adding a new column called (Full_Address) that concatenate address columns into one
# for example Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Address_info['Full_Address'] = Address_info[Address_info.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)), axis=1)
locator = Nominatim(user_agent="myGeocoder") # holds the Geocoding service, Nominatim
# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# 2- create location column
Address_info['location'] = Address_info['Full_Address'].apply(geocode)
# 3 - create longitude, laatitude and altitude from location column (returns tuple)
Address_info['point'] = Address_info['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
Address_info[['latitude', 'longitude', 'altitude']] = pd.DataFrame(Address_info['point'].tolist(), index=Address_info.index)
# using Folium to map out the points we created
folium_map = folium.Map(location=[49.61167,6.13], zoom_start=12,)
An example of the full output error is :
RateLimiter caught an error, retrying (0/2 tries). Called with (*('44 AVENUE JOHN FITZGERALD KENNEDY,L-1855,LUXEMBOURG',), **{}).
Traceback (most recent call last):
File "e:\Anaconda3\lib\urllib\request.py", line 1317, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "e:\Anaconda3\lib\http\client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "e:\Anaconda3\lib\http\client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "e:\Anaconda3\lib\http\client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "e:\Anaconda3\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "e:\Anaconda3\lib\http\client.py", line 966, in send
self.connect()
File "e:\Anaconda3\lib\http\client.py", line 1414, in connect
server_hostname=server_hostname)
File "e:\Anaconda3\lib\ssl.py", line 423, in wrap_socket
session=session
File "e:\Anaconda3\lib\ssl.py", line 870, in _create
self.do_handshake()
File "e:\Anaconda3\lib\ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
socket.timeout: _ssl.c:1059: The handshake operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "e:\Anaconda3\lib\site-packages\geopy\geocoders\base.py", line 355, in _call_geocoder
page = requester(req, timeout=timeout, **kwargs)
File "e:\Anaconda3\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "e:\Anaconda3\lib\urllib\request.py", line 543, in _open
'_open', req)
File "e:\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "e:\Anaconda3\lib\urllib\request.py", line 1360, in https_open
context=self._context, check_hostname=self._check_hostname)
File "e:\Anaconda3\lib\urllib\request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error _ssl.c:1059: The handshake operation timed out>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "e:\Anaconda3\lib\site-packages\geopy\extra\rate_limiter.py", line 126, in __call__
return self.func(*args, **kwargs)
File "e:\Anaconda3\lib\site-packages\geopy\geocoders\osm.py", line 387, in geocode
self._call_geocoder(url, timeout=timeout), exactly_one
File "e:\Anaconda3\lib\site-packages\geopy\geocoders\base.py", line 378, in _call_geocoder
raise GeocoderTimedOut('Service timed out')
geopy.exc.GeocoderTimedOut: Service timed out
Expected output is
Address1 Address3 Post_Code City_Name Full_Address Latitude Longitude
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH 49.7508296 6.1085476
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH 49.7508296 6.1085476
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH 49.7508296 6.1085476
10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG 49.6302147 6.1713374
11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ 49.5217917 6.0101385
I've updated your code:
Added: Address_info = Address_info.apply(lambda x: x.str.strip(), axis=1)
Removes whitespace before and after str
Added a function with try-except, to handle the lookup
from geopy.exc import GeocoderTimedOut, GeocoderQuotaExceeded
import time
ERP_Data = pd.read_csv("test.csv")
# Extracting the address information into a new DataFrame
Address_info= ERP_Data[['Address1','Address3','Post_Code','City_Name']].copy()
# Clean existing whitespace from the ends of the strings
Address_info = Address_info.apply(lambda x: x.str.strip(), axis=1) # ← added
# Adding a new column called (Full_Address) that concatenate address columns into one
# for example Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Address_info['Full_Address'] = Address_info[Address_info.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
locator = Nominatim(user_agent="myGeocoder") # holds the Geocoding service, Nominatim
# 1 - convenient function to delay between geocoding calls
# geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
def geocode_me(location):
time.sleep(1.1)
try:
return locator.geocode(location)
except (GeocoderTimedOut, GeocoderQuotaExceeded) as e:
if GeocoderQuotaExceeded:
print(e)
else:
print(f'Location not found: {e}')
return None
# 2- create location column
Address_info['location'] = Address_info['Full_Address'].apply(lambda x: geocode_me(x)) # ← note the change here
# 3 - create longitude, latitude and altitude from location column (returns tuple)
Address_info['point'] = Address_info['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
Address_info[['latitude', 'longitude', 'altitude']] = pd.DataFrame(Address_info['point'].tolist(), index=Address_info.index)
Output:
Address1 Address3 Post_Code City_Name Full_Address location point latitude longitude altitude
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535,MERSCH (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476)) (49.7508296, 6.1085476, 0.0) 49.750830 6.108548 0.0
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535,MERSCH (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476)) (49.7508296, 6.1085476, 0.0) 49.750830 6.108548 0.0
10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535,MERSCH (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476)) (49.7508296, 6.1085476, 0.0) 49.750830 6.108548 0.0
10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG (Rue Edward Steichen, Grünewald, Weimershof, Neudorf-Weimershof, Luxembourg, Canton Luxembourg, 2540, Lëtzebuerg, (49.6302147, 6.1713374)) (49.6302147, 6.1713374, 0.0) 49.630215 6.171337 0.0
11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898,FOETZ (Rue du Brill, Mondercange, Canton Esch-sur-Alzette, 3898, Luxembourg, (49.5217917, 6.0101385)) (49.5217917, 6.0101385, 0.0) 49.521792 6.010139 0.0
10000052 3 RUE DU PUITS ROMAIN L-8070 BERTRANGE 3 RUE DU PUITS ROMAIN,L-8070,BERTRANGE (Rue du Puits Romain, Z.A. Bourmicht, Bertrange, Canton Luxembourg, 8070, Lëtzebuerg, (49.6084531, 6.0771901)) (49.6084531, 6.0771901, 0.0) 49.608453 6.077190 0.0
Note & Additional Resources:
The output includes the address that caused the error in your TraceBack
RateLimiter caught an error, retrying (0/2 tries). Called with (*('3 RUE DU PUITS ROMAIN ,L-8070 ,BERTRANGE ',)
Note all the extra whitespace in the address. I've added a line of code to remove whitespace from the beginning and end of the strings
GeocoderTimedOut, a real pain?
Geopy: catch timeout error
Final:
The final result is the service times out because of HTTP Error 429: Too Many Requests for the day.
Review Nominatim Usage Policy
Suggestion: Use a different Geocoder

Unicode elements in list save to file

I have two questions:
1) What I have done wrong in the script below? The result in not encoded propertly and all non standard characters are stored incorrectly. When I print out data list it gives me a proper list of unicode types:
[u'Est-ce que tu peux traduire \xc3\xa7a pour moi? \n \n \n Can you translate this for me?'], [u'Chicago est tr\xc3\xa8s diff\xc3\xa9rente de Boston. \n \n \n Chicago is very different from Boston.'],
After that I strip all extra spaces and next lines and result in file is like this (looks same when print and save to file):
Est-ce que tu peux traduire ça pour moi?;Can you translate this for me?
Chicago est très différente de Boston.;Chicago is very different from Boston.
2) What other than Python scripting langage would you recommend?
import requests
import unicodecsv, os
from bs4 import BeautifulSoup
import re
import html5lib
countries = ["fr"] #,"id","bn","my","chin","de","es","fr","hi","ja","ko","pt","ru","th","vi","zh"]
for country in countries:
f = open("phrase_" + country + ".txt","w")
w = unicodecsv.writer(f, encoding='utf-8')
toi = 1
print country
while toi<2:
url = "http://www.englishspeak.com/"+ country +"/english-phrases.cfm?newCategoryShowed=" + str(toi) + "&sortBy=28"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
soup.unicode
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
[s.extract() for s in soup('head')]
[s.extract() for s in soup("table" , { "height" : "102" })]
[s.extract() for s in soup("td", { "class" : "copyLarge"})]
[s.extract() for s in soup("td", { "width" : "21%"})]
[s.extract() for s in soup("td", { "colspan" : "3"})]
[s.extract() for s in soup("td", { "width" : "25%"})]
[s.extract() for s in soup("td", { "class" : "blacktext"})]
[s.extract() for s in soup("div", { "align" : "center"})]
data = []
rows = soup.find_all('tr', {"class": re.compile("Data.")})
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
wordsList = []
for index, item in enumerate(data):
str_tmp = "".join(data[index]).encode('utf-8')
str_tmp = re.sub(r' +\n\s+', ';', str_tmp)
str_tmp = re.sub(r' +', ' ', str_tmp)
wordsList.append(str_tmp.decode('utf-8'))
print str_tmp
w.writerow(wordsList)
toi += 1
You should use r.text not r.content because content are the bytes and text is the decoded text:
soup = BeautifulSoup(r.text, 'html5lib')
You can just write utf-8 encoded to file:
with open("out.txt","w") as f:
for d in data:
d = " ".join(d).encode("utf-8")
d = re.sub(r'\n\s+', ';', d)
d = re.sub(r' +', ' ', d)
f.write(d)
Output:
Fais attention en conduisant. ;Be careful driving.Fais attention. ;Be careful.Est-ce que tu peux traduire ça pour moi? ;Can you translate this for me?Chicago est très différente de Boston. ;Chicago is very different from Boston.Ne t'inquiète pas. ;Don't worry.Tout le monde le sais. ;Everyone knows it.Tout est prêt. ;Everything is ready.Excellent. ;Excellent.De temps en temps. ;From time to time.Bonne idée. ;Good idea.Il l'aime beaucoup. ;He likes it very much.A l'aide! ;Help!Il arrive bientôt. ;He's coming soon.Il a raison. ;He's right.Il est très ennuyeux. ;He's very annoying.Il est très célèbre. ;He's very famous.Comment ça va? ;How are you?Comment va le travail? ;How's work going?Dépêche-toi! ;Hurry!J'ai déjà mangé. ;I ate already.Je ne vous entends pas. ;I can't hear you.Je ne sais pas m'en servir. ;I don't know how to use it.Je ne l'aime pas. ;I don't like him.Je ne l'aime pas. ;I don't like it.Je ne parle pas très bien. ;I don't speak very well.Je ne comprends pas. ;I don't understand.Je n'en veux pas. ;I don't want it.Je ne veux pas ça. ;I don't want that.Je ne veux pas te déranger. ;I don't want to bother you.Je me sens bien. ;I feel good.Je sors du travail à six heures. ;I get off of work at 6.J'ai mal à la tête. ;I have a headache.J'espère que votre femme et vous ferez un bon voyage. ;I hope you and your wife have a nice trip.Je sais. ;I know.Je l'aime. ;I like her.J'ai perdu ma montre. ;I lost my watch.Je t'aime. ;I love you.J'ai besoin de changer de vêtements. ;I need to change clothes.J'ai besoin d'aller chez moi. ;I need to go home.Je veux seulement un en-cas. ;I only want a snack.Je pense que c'est bon. ;I think it tastes good.Je pense que c'est très bon. ;I think it's very good.Je pensais que les vêtements étaient plus chers. ;I thought the clothes were cheaper.J'allais quitter le restaurant quand mes amis sont arrivés. ;I was about to leave the restaurant when my friends arrived.Je voudrais faire une promenade. ;I'd like to go for a walk.Si vous avez besoin de mon aide, faites-le-moi savoir s'il vous plaît. ;If you need my help, please let me know.Je t'appellerai vendredi. ;I'll call you when I leave.Je reviendrai plus tard. ;I'll come back later.Je paierai. ;I'll pay.Je vais le prendre. ;I'll take it.Je t'emmenerai à l'arrêt de bus. ;I'll take you to the bus stop.Je suis un Américain. ;I'm an American.Je nettoie ma chambre. ;I'm cleaning my room.J'ai froid. ;I'm cold.Je viens te chercher. ;I'm coming to pick you up.Je vais partir. ;I'm going to leave.Je vais bien, et toi? ;I'm good, and you?Je suis content. ;I'm happy.J'ai faim. ;I'm hungry.Je suis marié. ;I'm married.Je ne suis pas occupé. ;I'm not busy.Je ne suis pas marié. ;I'm not married.Je ne suis pas encore prêt. ;I'm not ready yet.Je ne suis pas sûr. ;I'm not sure.Je suis désolé, nous sommes complets. ;I'm sorry, we're sold out.J'ai soif. ;I'm thirsty.Je suis très occupé. Je n'ai pas le temps maintenant. ;I'm very busy. I don't have time now.Est-ce que Monsieur Smith est un Américain? ;Is Mr. Smith an American?Est-ce que ça suffit? ;Is that enough?C'est plus long que deux kilomètres. ;It's longer than 2 miles.Je suis ici depuis deux jours. ;I've been here for two days.J'ai entendu dire que le Texas était beau comme endroit. ;I've heard Texas is a beautiful place.Je n'ai jamais vu ça avant. ;I've never seen that before.Juste un peu. ;Just a little.Juste un moment. ;Just a moment.Laisse-moi vérifier. ;Let me check.laisse-moi y réfléchir. ;Let me think about it.Allons voir. ;Let's go have a look.Pratiquons l'anglais. ;Let's practice English.Pourrais-je parler à madame Smith s'il vous plaît? ;May I speak to Mrs. Smith please?Plus que ça. ;More than that.Peu importe. ;Never mind.La prochaine fois. ;Next time.Non, merci. ;No, thank you.Non. ;No.N'importe quoi. ;Nonsense.Pas récemment. ;Not recently.Pas encore. ;Not yet.Rien d'autre. ;Nothing else.Bien sûr. ;Of course.D'accord. ;Okay.S'il vous plaît remplissez ce formulaire. ;Please fill out this form.S'il vous plaît emmenez-moi à cette adresse. ;Please take me to this address.S'il te plaît écris-le. ;Please write it down.Vraiment? ;Really?Juste ici. ;Right here.Juste là. ;Right there.A bientôt. ;See you later.A demain. ;See you tomorrow.A ce soir. ;See you tonight.Elle est jolie. ;She's pretty.Désolé de vous déranger. ;Sorry to bother you.Arrête! ;Stop!Tente ta chance. ;Take a chance.Réglez ça dehors. ;Take it outside.Dis-moi. ;Tell me.Merci Mademoiselle. ;Thank you miss.Merci Monsieur. ;Thank you sir.Merci beaucoup. ;Thank you very much.Merci. ;Thank you.Merci pour tout. ;Thanks for everything.Merci pour ton aide. ;Thanks for your help.Ça a l'air super. ;That looks great.Ça sent mauvais. ;That smells bad.C'est pas mal. ;That's alright.Ça suffit. ;That's enough.C'est bon. ;That's fine.C'est tout. ;That's it.Ce n'est pas juste. ;That's not fair.Ce n'est pas vrai. ;That's not right.C'est vrai. ;That's right.C'est dommage. ;That's too bad.C'est trop. ;That's too many.C'est trop. ;That's too much.Le livre est sous la table. ;The book is under the table.Ils vont revenir tout de suite. ;They'll be right back.Ce sont les mêmes. ;They're the same.Ils sont très occupés. ;They're very busy.Ça ne marche pas. ;This doesn't work.C'est très difficile. ;This is very difficult.C'est très important. ;This is very important.Essaie-le/la. ;Try it.Très bien, merci. ;Very good, thanks.Nous l'aimons beaucoup. ;We like it very much.Voudriez-vous prendre un message s'il vous plaît? ;Would you take a message please?Oui, vraiment. ;Yes, really.Vos affaires sont toutes là. ;Your things are all here.Tu es belle. ;You're beautiful.Tu es très sympa. ;You're very nice.Tu es très intelligent. ;You're very smart.
Also you don't actually use the data in your list comps so they seem a little pointless:

Categories

Resources