rdd.first() does not give an error but rdd.collect() does - python

I am working in pyspark and have the following code, where I am processing tweet and making an RDD with the user_id and text. Below is the code
"""
# Construct an RDD of (user_id, text) here.
"""
import json
def safe_parse(raw_json):
try:
json_object = json.loads(raw_json)
if 'created_at' in json_object:
return json_object
else:
return;
except ValueError as error:
return;
def get_usr_txt (line):
tmp = safe_parse (line)
return ((tmp.get('user').get('id_str'),tmp.get('text')));
usr_txt = text_file.map(lambda line: get_usr_txt(line))
print (usr_txt.take(5))
and the output looks okay (as shown below)
[('470520068', "I'm voting 4 #BernieSanders bc he doesn't ride a CAPITALIST PIG adorned w/ #GoldmanSachs $. SYSTEM RIGGED CLASS WAR "), ('2176120173', "RT #TrumpNewMedia: .#realDonaldTrump #America get out & #VoteTrump if you don't #VoteTrump NOTHING will change it's that simple!\n#Trump htt…"), ('145087572', 'RT #Libertea2012: RT TODAY: #Colorado’s leading progressive voices to endorse #BernieSanders! #Denver 11AM - 1PM in MST CO State Capitol…'), ('23047147', '[VID] Liberal Tears Pour After Bernie Supporter Had To Deal With Trump Fans '), ('526506000', 'RT #justinamash: .#tedcruz is the only remaining candidate I trust to take on what he correctly calls the Washington Cartel. ')]
However, as soon as I do
print (usr_txt.count())
I get an error like below
Py4JJavaError Traceback (most recent call last)
<ipython-input-60-9dacaf2d41b5> in <module>()
8 usr_txt = text_file.map(lambda line: get_usr_txt(line))
9 #print (usr_txt.take(5))
---> 10 print (usr_txt.count())
11
/usr/local/spark/python/pyspark/rdd.py in count(self)
1054 3
1055 """
-> 1056 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1057
1058 def stats(self):
What am I missing? Is the RDD not created properly? or there is something else? how do I fix it?

You have returned None from safe_parse method when there is no created_at element in the parsed json line or when there is an error in parsing. This created error while getting elements from the parsed jsons in (tmp.get('user').get('id_str'),tmp.get('text')). That caused the error to occur
The solution is to check for None in get_usr_txt method
def get_usr_txt (line):
tmp = safe_parse(line)
if(tmp != None):
return ((tmp.get('user').get('id_str'),tmp.get('text')));
Now the question is why print (usr_txt.take(5)) showed the result and print (usr_txt.count()) caused the error
Thats because usr_txt.take(5) considered only the first five rdds and not the rest and didn't have to deal with None datatype.

Related

sequence item 0: expected str instance, tuple found(2)

I analyzed the data in the precedent and tried to use topic modeling. Here is a
syntax I am using:
According to the error, I think it means that the string should go in when
joining, but the tuple was found. I don't know how to fix this part.
class FacebookAccessException(Exception): pass
def get_profile(request, token=None):
...
response = json.loads(urllib_response)
if 'error' in response:
raise FacebookAccessException(response['error']['message'])
access_token = response['access_token'][-1]
return access_token
#Join the review
word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
word_list = word_list.split(",")
This is Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_13792\3474859476.py in <module>
1 #Join the review
----> 2 word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
3 word_list = word_list.split(",")
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_13792\3474859476.py in <listcomp>(.0)
1 #Join the review
----> 2 word_list = ",".join([",".join(i) for i in sexualhomicide['tokens']])
3 word_list = word_list.split(",")
TypeError: sequence item 0: expected str instance, tuple found
This is print of 'sexual homicide'
print(sexualhomicide['cleaned_text'])
print("="*30)
print(twitter.pos(sexualhomicide['cleaned_text'][0],Counter('word')))
I can't upload the results of this syntax. Error occurs because it is classified as spam during the upload process.

How can i get my files to be opened?

Hi there im working on a function that merges two separate .txt files and outputs a personalized letter. The problem is, is that i can include my text within the funciton module and it works perfectly. But when i try to open them in the function and to be used by the function i get this
error message:
Traceback (most recent call last):
File "/Users/nathandavis9752/CP104/davi0030_a10/src/q2_function.py", line 25, in
data = cleanData(q2)
File "/Users/nathandavis9752/CP104/davi0030_a10/src/q2_function.py", line 17, in cleanData
return [item.strip().split('\n\n') for item in query.split('--')]
AttributeError: 'file' object has no attribute 'split'
code:
letter = open('letter.txt', 'r')
q2 = open('q2.txt', 'r')
def cleanData(query):
return [item.strip().split('\n\n') for item in query.split('--')]
def writeLetter(template, variables, replacements):
# replace ith variable with ith replacement variable
for i in range(len(variables)):
template = template.replace(variables[i], replacements[i])
return template
data = cleanData(q2)
print (data)
variables = ['[fname]', '[lname]', '[street]', '[city]']
letters = [writeLetter(letter, variables, person) for person in data]
for i in letters:
print (i)
q2.txt file:
Michael
dawn
lock hart ln
Dublin
--
kate
Nan
webster st
king city
--
raj
zakjg
late Road
Toronto
--
dave
porter
Rock Ave
nobleton
letter.txt file:
[fname] [lname]
[street]
[city]
Dear [fname]:
As a fellow citizen of [city], you and all your neighbours
on [street] are invited to a celebration this Saturday at
[city]'s Central Park. Bring beer and food!
You are trying to split a file buffer rather than a string.
def cleanData(query):
return [item.strip().split('\n\n') for item in query.read().split('--')]

Not iterating through whole dictionary

So basically, I have an api from which i have several dictionaries/arrays. (http://dev.c0l.in:5984/income_statements/_all_docs)
When getting the financial information for each company from the api (e.g. sector = technology and statement = income) python is supposed to return 614 technology companies, however i get this error:
Traceback (most recent call last):
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 83, in <module>
user_input1()
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 75, in user_input1
income_statement_fn()
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 51, in income_statement_fn
if is_response ['sector'] == user_input3:
KeyError: 'sector'
on a random company (usually on one of the 550-600th ones)
Here is the function for income statements
def income_statement_fn():
user_input3 = raw_input("Which sector would you like to iterate through in Income Statement?: ")
print 'Starting...'
for item in income_response['rows']:
is_url = "http://dev.c0l.in:5984/income_statements/" + item['id']
is_request = urllib2.urlopen(is_url).read()
is_response = json.loads(is_request)
if is_response ['sector'] == user_input3:
csv.writerow([
is_response['company']['name'],
is_response['company']['sales'],
is_response['company']['opening_stock'],
is_response['company']['purchases'],
is_response['company']['closing_stock'],
is_response['company']['expenses'],
is_response['company']['interest_payable'],
is_response['company']['interest_receivable']])
print 'loading...'
print 'done!'
print end - start
Any idea what could be causing this error?
(I don't believe that it is the api itself)
Cheers
Well, on testing the url you pass in the urlopen call, with a random number, I got this:
{"error":"not_found","reason":"missing"}
In that case, your function will return exactly the error you get. If you want your program to handle the error nicely and add a "missing" line instead of actual data, you could do that for instance:
def income_statement_fn():
user_input3 = raw_input("Which sector would you like to iterate through in Income Statement?: ")
print 'Starting...'
for item in income_response['rows']:
is_url = "http://dev.c0l.in:5984/income_statements/" + item['id']
is_request = urllib2.urlopen(is_url).read()
is_response = json.loads(is_request)
if is_response.get('sector', False) == user_input3:
csv.writerow([
is_response['company']['name'],
is_response['company']['sales'],
is_response['company']['opening_stock'],
is_response['company']['purchases'],
is_response['company']['closing_stock'],
is_response['company']['expenses'],
is_response['company']['interest_payable'],
is_response['company']['interest_receivable']])
print 'loading...'
else:
csv.writerow(['missing data'])
print 'done!'
print end - start
The problem seems to be with the final row of your income_response data
{"id":"_design/auth","key":"_design/auth","value":{"rev":"1-3d8f282ec7c26779194caf1d62114dc7"}}
This does not have a sector value. You need to alter your code to handle this line, for example by ignoring any line where the sector key is not present.
You could easily have debugged this with a few print statements - for example insert
print item['id'], is_response.get('sector', None)
into your code before the part that outputs the CSV.
A KeyError means that the key you tried to use does not exist in the dictionary. When checking for a key, it is much safer to use .get(). So you would replace this line:
if is_response['sector'] == user_input3:
With this:
if is_response.get('sector') == user_input3:

IndexError: list index out of range [python irc bot]

For my IRC bot when I try to use the `db addcol command I will get that Index error but I got no idea what is wrong with it.
#hook.command(adminonly=True, autohelp=False)
def db(inp,db=None):
split = inp.split(' ')
action = split[0]
if "init" in action:
result = db.execute("create table if not exists users(nick primary key, host, location, greeting, lastfm, fines, battlestation, desktop, horoscope, version)")
db.commit()
return result
elif "addcol" in action:
table = split[1]
col = split[2]
if table is not None and col is not None:
db.execute("ALTER TABLE {} ADD COLUMN {}".format(table,col))
db.commit
return "Added Column"
That is the command I am trying to execute and here is the error:
Unhandled exeption in thread started by <function run at 0xb70d6844
Traceback (most recent call last):
File "core/main.py", line 68, in run
out = func(input.inp, **kw)
File "plugins/core_admin_global.py", :ine 308, in db
col = split[2]
IndexError: list index out of range
You will find the whole code at my GIT repository.
Edit: This bot is just a little thing I have been playing around with while learning python so don't expect me to be too knowledgeable about this.
Yet another edit:
The command I am trying to add, just replace desktop with mom.
#hook.command(autohelp=False)
def desktop(inp, nick=None, conn=None, chan=None,db=None, notice=None):
"desktop http://url.to/desktop | # nick -- Shows a users Desktop."
if inp:
if "http" in inp:
database.set(db,'users','desktop',inp.strip(),'nick',nick)
notice("Saved your desktop.")
return
elif 'del' in inp:
database.set(db,'users','desktop','','nick',nick)
notice("Deleted your desktop.")
return
else:
if '#' in inp: nick = inp.split('#')[1].strip()
else: nick = inp.strip()
result = database.get(db,'users','desktop','nick',nick)
if result:
return '{}: {}'.format(nick,result)
else:
if not '#' in inp: notice(desktop.__doc__)
return 'No desktop saved for {}.'.format(nick)

NodeBox error for a verb in python

I downloaded the package http://nodebox.net/code/index.php/Linguistics#verb_conjugation
I'm getting an error even when I tried to get a tense of a verb .
import en
print en.is_verb('use')
#prints TRUE
print en.verb.tense('use')
KeyError Traceback (most recent call last)
/home/cse/version2_tense.py in <module>()
----> 1
2
3
4
5
/home/cse/en/__init__.pyc in tense(self, word)
124
125 def tense(self, word):
--> 126 return verb_lib.verb_tense(word)
127
128 def is_tense(self, word, tense, negated=False):
/home/cse/en/verb/__init__.pyc in verb_tense(v)
175
176 infinitive = verb_infinitive(v)
--> 177 a = verb_tenses[infinitive]
178 for tense in verb_tenses_keys:
179 if a[verb_tenses_keys[tense]] == v:
KeyError: ''
The reason you are getting this error is because there is a mistake in the ~/Library/Application Support/NodeBox/en/verb/verb.txt file they are using to create the dictionary.
use is the infinitive form, however, "used" is entered as the infinitive.
at line 5857:
used,,,uses,,using,,,,,used,used,,,,,,,,,,,,
should be:
use,,,uses,,using,,,,,used,used,,,,,,,,,,,,
after editing and saving the file:
import en
print en.is_verb("use")
print en.verb.infinitive('use')
print en.verb.tense('use')
gives:
True
use
infinitive
extra:
import en
print 'use %s' % en.verb.tense("use")
print 'uses %s' % en.verb.tense("uses")
print 'using %s' % en.verb.tense('using')
print 'used %s' % en.verb.tense('used')
use infinitive
uses 3rd singular present
using present participle
used past

Categories

Resources