I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))
Related
I am trying to extract the information of a tweet from an ID. With the ID, I want to get the tweet creation date, the tweet, location, followers, friends, favorites, their profile description, if they're verified, and the language, but I'm having trouble doing it. Next, I will show the steps I follow to carry out what I want.
I have made the following code. To start with, I have the IDs of the tweets in a txt file and I read them as follows:
# Read txt file
txt = '/content/drive/MyDrive/Mini-proyecto Texto/archivo.txt'
with open(txt) as archivo:
lines = archivo.readlines()
Next, I add each of the IDs to a list:
# Add the IDs to a list
IDs = []
for i in lines:
IDs.append(i.rsplit())
#print(i.rsplit())
IDs
#[['1206924075374956547'],
# ['1210912199402819584'],
# ['1210643148998938625'],
# ['1207776839697129472'],
# ['1203627609759920128'],
# ['1205895318212136961'],
# ['1208145724879364100'], ...
Finally, start extracting the information you require as follows:
# Extract information from tweets
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
The following image is the output of the tweets_df2 variable, but I don't understand why the values are repeated over and over again. Does anyone know what's wrong with my code?
If you need the txt I provide you with the link of the drive. https://drive.google.com/file/d/1vyohQMpLqlKqm6b4iTItcVVqL2wBMXWp/view?usp=sharing
Thank you very much in advance for your time :3
Your code basically runs fine for me with a few adjustments. I noticed that your identation is not correct and your list of IDs is a list of lists (with just one element) rather than one list of elements.
Try this:
api = tweepy.Client(consumer_key=api_key,
consumer_secret=api_key_secret,
access_token=access_token,
access_token_secret=access_token_secret,
bearer_token=bearer_token,
wait_on_rate_limit=True,
)
auth = tweepy.OAuth1UserHandler(
api_key, api_key_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
txt = "misocorpus-misogyny.txt"
with open(txt) as archivo:
lines = archivo.readlines()
IDs = []
for i in lines:
IDs.append(i.strip())# <-- use strip() to remove /n rather than split
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
Result:
ID Tweet Creado_tweet Locacion_usuario Seguidores_usuario Amigos_usuario Favoritos_usuario Descripcion_usuario Verificado_usuario Idioma
0 1206924075374956547 Las feminazis quieren por poco que este chico ... 2019-12-17 13:08:17+00:00 Argentina 1683 2709 28982 El Progresismo es un Cáncer que quiere destrui... False es
1 1210912199402819584 #CarlosVerareal #Galois2807 Los halagos con pi... 2019-12-28 13:15:40+00:00 Ecuador 398 1668 3123 Cuando te encuentres n una situación imposible... False es
2 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
3 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
4 1203627609759920128 Mostritas #Feminazi amenazando como ellas sabe... 2019-12-08 10:49:19+00:00 Lima, Perú 2505 3825 45087 Latam News Report. Regional and World affairs.... False
I am crawling several websites and extract the names of the products. In some names there are errors like this:
Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
How can I fix that?
I am using xpath and re.search to get the names.
In every Python file, this is the first code: # -*- coding: utf-8 -*-
Edit:
This is the sourcecode, how I get the information.
if '"articleName":' in details:
closer_to_product = details.split('"articleName":', 1)[1]
closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
if debug_product == 1:
print('product before try:' + repr(closer_to_product_2))
try:
found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
except AttributeError:
found_product = ''
if debug_product == 1:
print('cleared product: ', '>>>' + repr(found_product) + '<<<')
if not found_product:
print(product_detail_page, found_product)
items['products'] = 'default'
else:
items['products'] = found_product
Details
product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]
Where is a problem (Python 3.8.3)?
import html
strings = [
'Bols Watermelon Lik\u00f6r 0,7l',
'Hayman\u00b4s Sloe Gin',
'Ron Zacapa Edici\u00f3n Negra',
'Havana Club A\u00f1ejo Especial',
'Caol Ila 13 Jahre (G&M Discovery)',
'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
for str in strings:
print( html.unescape(str).
encode('raw_unicode_escape').
decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L
Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings
My list is formatted like:
gymnastics_school,participant_name,all-around_points_earned
I need to divide it up by schools but keep the scores.
import collections
def main():
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
These are examples of how they're set up:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3
Carrollton Boys,Cameron M.,61.6
Carrollton Boys,Zachary W.,58.7
Carrollton Boys,Samuel B.,58.6
La Fayette Boys,Nate S.,63
La Fayette Boys,Kaden C.,62
La Fayette Boys,Cohan S.,59.1
La Fayette Boys,Cooper J.,56.101
La Fayette Boys,Avi F.,53.401
La Fayette Boys,Frederic T.,53.201
Columbus,Noah B.,50.3
Savannah Metro,Levi B.,52.801
Savannah Metro,Taylan T.,52
Savannah Metro,Jacob S.,51.5
SAAB Gymnastics,Dawson B.,58.1
SAAB Gymnastics,Dean S.,57.901
SAAB Gymnastics,William L.,57.101
SAAB Gymnastics,Lex L.,52.501
Suwanee Gymnastics,Colin K.,57.3
Suwanee Gymnastics,Matthew B.,53.201
After processing it should look like:
Lanier City Gymnastics:participants(4)
as it own list
Carrollton Boys(3)
as it own list
La Fayette Boys(6)
etc.
I would recommend putting them in dictionaries:
data = {}
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
Then access schools could be done in the following way:
>>> data['Lanier City Gymnastics']
[['Ben W.',55.301],['Alex W.',54.801],['Sky T'.,51.2],['William G.',47.3]
EDIT:
Assuming you need the whole dataset as a list first, then you want to divide it into smaller lists you can generate the dictionary from the list:
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(items)
#perform median or other operation on your data
nested_data = {}
for items in data:
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
nested_data[item[0]]
When you need to get a subset of a list you can use slicing:
mylist[start:stop:step]
where start, stop and step are optional (see link for more comprehensive introduction)
names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
I have such list with names, i need to remove prefix like ('lic', 'c.p.n' etc) from name (this is just sample there is a lot of prefixes in such format)
output shell be like this :
'SEBASTIÁN LASTIRI'
I have tried to :
for i in names:
if '.' in i:
i.split('.')[1]
But it works only when there is one dot in prefix
How to solve this
Here is the solution for your issue:
import re
names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
new_names = [re.sub("^\s+", "", i.split(".")[-1]) for i in names]
print new_names # [SEBASTIÁN LASTIRI', ROBERTO DANIEL RODRÍGUEZ', JULIO DOMINGO BURAK', 'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO']
You can use the following code:
import re
names = [
'LIC. SEBASTIAN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRIGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRAULICO VICTOR AGUSTIN PORRINO'
]
for i in names:
res = re.split(r'\.\s*(?=[^.]+$)', i)
if len(res) > 1:
print res[1]
else:
print res[0]
Output:
SEBASTIAN LASTIRI
ROBERTO DANIEL RODRIGUEZ
JULIO DOMINGO BURAK
INGENIERO HIDRAULICO VICTOR AGUSTIN PORRINO
A simple filter to only use words without a dot in the end.
names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
names_formated = [' '.join([sub for sub in name.split() if sub[-1] != '.']) for name in names]
In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof: