tuple index out of range for regexp_replace - pyspark-sql - python

SELECT url,
regexp_replace(title, '(http|ftp|file|https)://[-a-z0-9+&##/\%?=~_-|!:,.;/]*|\<.*?\>|(=+)\s*(.*?)\s*(=+)|&\w+;', '') AS text_body
FROM df_table_doc
0 https://demo.com New Arch {Onboarding}..Lets (Onboard) it..
1 https://example.com New Arch (Onboarding)
Adding the pattern \{.*?\} to replace anything within {} is failing with :
IndexError: tuple index out of range
IndexError Traceback (most recent call last)
<ipython-input-1-20460659c049> in <module>
----> 1 get_ipython().run_cell_magic('spark_sql', '--limit 200', "select url, regexp_replace(title, '(http|ftp|file|https)://[-a-z0-9+&##/\\%?=~_-|!:,.;/]*|\\<.*?\\>|\\{.*?\\}|(=+)\\s*(.*?)\\s*(=+)|&\\w+;', '') as text_body\n from df_table_doc\n")

Related

sequence item 0: expected str instance, tuple found(2)

I analyzed the data in the precedent and tried to use topic modeling. Here is a
syntax I am using:
According to the error, I think it means that the string should go in when
joining, but the tuple was found. I don't know how to fix this part.
class FacebookAccessException(Exception): pass
def get_profile(request, token=None):
...
response = json.loads(urllib_response)
if 'error' in response:
raise FacebookAccessException(response['error']['message'])
access_token = response['access_token'][-1]
return access_token
#Join the review
word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
word_list = word_list.split(",")
This is Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_13792\3474859476.py in <module>
1 #Join the review
----> 2 word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
3 word_list = word_list.split(",")
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_13792\3474859476.py in <listcomp>(.0)
1 #Join the review
----> 2 word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
3 word_list = word_list.split(",")
TypeError: sequence item 0: expected str instance, tuple found
This is print of 'sexual homicide'
print(sexualhomicide['cleaned_text'])
print("="*30)
print(twitter.pos(sexualhomicide['cleaned_text'][0],Counter('word')))
I can't upload the results of this syntax. Error occurs because it is classified as spam during the upload process.

Trying to form DataFrame from API, but the function is getting Name error

import requests # get connection
import pandas as pd
import json
def get_info(data):
data=[]
source=[]
published_date=[]
adx_keywords=[]
byline=[]
title=[]
abstract=[]
des_facet=[]
per_facet=[]
media=[]
Api_Key=''
url='https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key=' # key redacted
response=requests.get(url).json()
for i in response['results']:
source.append(i['source'])
published_date.append(i['published_date'])
adx_keywords.append(i['adx_keywords'])
byline.append(i['byline'])
title.append(i['title'])
abstract.append(i['abstract'])
des_facet.append(i['des_facet'])
per_facet.append(i['per_facet'])
media.append(i['media'])
data=data.append({'source':source,'published_date':published_date,'adx_keywords':adx_keywords,byline':byline, 'title':title,'abstract':abstract,'des_facet':des_facet,
'per_facet':per_facet,'media':media})
df=df.append(d)
return df
df NameError
Traceback (most recent call last)
<ipython-input-292-00cf07b74dcd> in <module>()
----> 1 df
NameError: name 'df' is not defined
your hyphens are in the the wrong place
before:
data=data.append({'source':source,'published_date':published_date,'adx_keywords':adx_keywords,byline':byline, 'title':title,'abstract':abstract,'des_facet':des_facet,
'per_facet':per_facet,'media':media})
after:
data=data.append({'source':source,'published_date':published_date,'adx_keywords':adx_keywords,'byline':byline, 'title':title, 'abstract':abstract,'des_facet':des_facet,
'per_facet':per_facet,'media':media})

Looping a Bloomberg function over a list of tickers

I would like to loop a Bloomberg IntraDayBar request over a dynamic list of 22 tickers and then combined the result into one dataframe:
This code generates the following list of tickers:
bquery = blp.BlpQuery().start()
dates = pd.bdate_range(end='today', periods=31)
time = datetime.datetime.now()
bcom_info = bquery.bds("BCOM Index", "INDX_MEMBERS")
bcom_info['ticker'] = bcom_info['Member Ticker and Exchange Code'].astype(str) + ' Comdty'
I would like to create a dataframe that returns the volume for each ticker, contained in the 'TRADE' event_type. Effectively looping the below code over each of the tickers in bcom_info.
bquery.bdib(bcom_info['ticker'], event_type='TRADE', interval=60, start_datetime=dates[0], end_datetime=time)
I tried this but couldn't get it to work:
def bloom_func(x, func):
bloomberg = bquery
return bquery.bdib(x, func, event_type='TRADE', interval=60, start_datetime=dates[0], end_datetime=time)
for d in bcom_info['ticker']:
x[d] = bloom_func(d)
It generates the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-80-fcbf4acd6840> in <module>
2
3 for d in tickers:
----> 4 x[d] = bloom_func(d)
TypeError: bloom_func() missing 1 required positional argument: 'func'

My script doesn't scrape all of Yelps restaurants

My script stops scraping after 449th Yelp restaurant.
Entire Code: https://pastebin.com/5U3irKZp
for idx, item in enumerate(yelp_containers, 1):
print("--- Restaurant number #", idx)
restaurant_title = item.h3.get_text(strip=True)
restaurant_title = re.sub(r'^[\d.\s]+', '', restaurant_title)
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
The error I am getting is:
Traceback (most recent call last):
File "/Users/kenny/MEGA/Python/yelp scraper.py", line 41, in
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
IndexError: list index out of range
The problem is that some restaurants are missing the address, for example this one:
What you should do is check first, if the address has enough elements before indexing it. Change this line of code:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
to these:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')
restaurant_address = restaurant_address[1] if len(restaurant_address) > 1 else restaurant_address[0]
I ran your parser for all pages and it worked.

Appending traceback.format_exc() to a list is adding "\" to single quotes (')

I am assigning the stack trace variable traceback.format_exc() to a list as below ,strange thing I notice is after appending ,all the single quotes(') get escaped (\') as can be seen from the output below.
I looked on google #https://github.com/behave/behave/issues/336 and tried to assign (traceback.format_exc(), sys.getfilesystemencoding() which didn't work either,am very curious why is this happening and how to fix this?
import traceback
clonedRadarsdetailslist = []
clonedRadardetails = {}
try:
#raise
(updateproblemoutput,updateproblempassfail) = r.UpdateProblem(problemID=newRadarID, componentName=componentName, componentVersion=componentVersion,assigneeID=assignee,state=state,substate=substate,milestone=milestone, category=category,priority=priority,resolution=re_solution )
except:
clonedRadardetails['updatedFailedReason'] = traceback.format_exc()
clonedRadarsdetailslist.append(clonedRadardetails)
print clonedRadarsdetailslist
OUTPUT:-
['{\'clonedRadar\': 40171867, \'clonedStatus\': \'PASS\', \'clonedRadarFinalStatus\': \'PASS\', \'updatedFailedReason\': \'Traceback (most recent call last):\\n File "./cloneradar.py", line 174, in clone\\n (updatetitleoutput,updatetitlepassfail) = r.UpdateProble(problemID=newRadarID,title=title )\\nAttributeError: \\\'RadarWS\\\' object has no attribute \\\'UpdateProble\\\'\\n\', \'clonedRadarFinalStatusReason\': \'N/A\', \'updateStatus\': \'FAIL\', \'clonedStatusfailReason\': \'N/A\'}', '{\'clonedRadar\': 40171867, \'clonedStatus\': \'PASS\', \'clonedRadarFinalStatus\': \'PASS\', \'updatedFailedReason\': \'Traceback (most recent call last):\\n File "./cloneradar.py", line 174, in clone\\n (updatetitleoutput,updatetitlepassfail) = r.UpdateProble(problemID=newRadarID,title=title )\\nAttributeError: \\\'RadarWS\\\' object has no attribute \\\'UpdateProble\\\'\\n\', \'clonedRadarFinalStatusReason\': \'N/A\', \'updateStatus\': \'FAIL\', \'clonedStatusfailReason\': \'N/A\'}']

Categories

Resources