extract tweets from a text file (python) - python

Sorry, I am just trying to store 'id_str' from each tweet to a new list called ids[]..
but getting the following error:
Traceback (most recent call last):
File "extract_tweet.py", line 17, in
print tweet['id_str']
KeyError: 'id_str'
My code is:
import json
import sys
if __name__ == '__main__':
tweets = []
for line in open (sys.argv[1]):
try:
tweets.append(json.loads(line))
except:
pass
ids = []
for tweet in tweets:
ids.append(tweet['id_str'])

The json data from tweets are sometimes missing fields. Try something like this,
ids = []
for tweet in tweets:
if 'id_str' in tweet:
ids.append(tweet['id_str'])
or equivalently,
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]

import json
tweets = []
tweets.append(
json.loads('{"a": 1}')
)
tweet = tweets[0]
print(tweet)
print( tweet['id_str'] )
--output:--
{'a': 1}
Traceback (most recent call last):
File "1.py", line 9, in <module>
print( tweet['id_str'] )
KeyError: 'id_str'
And:
my_dict = {u"id_str": 1}
print my_dict["id_str"]
--output:--
1

Related

ValueError with NLTK

Using NLTK, I'm trying to print a line of text if the last word of the line has an "NN" POS tag, but I'm getting: "ValueError: too many values to unpack" on the following code. Any ideas why? Thanks in advance.
import nltk
from nltk.tokenize import word_tokenize
def end_of_line():
filename = raw_input("Please enter a text file.> ")
with open(filename) as f:
for line in f:
linewords = nltk.tokenize.word_tokenize(line)
lw_tagged = nltk.tag.pos_tag(linewords)
last_lw_tagged = lw_tagged.pop()
for (word, tag) in last_lw_tagged:
if tag == "NN":
print line
end_of_line()
Traceback (most recent call last):
File "/private/var/folders/ly/n5ph6rcx47q8zz_j4pcj3b880000gn/T/Cleanup At Startup/endofline-477697124.590.py", line 15, in <module>
end_of_line()
File "/private/var/folders/ly/n5ph6rcx47q8zz_j4pcj3b880000gn/T/Cleanup At Startup/endofline-477697124.590.py", line 11, in end_of_line
for (word, tag) in last_lw_tagged:
ValueError: too many values to unpack
logout
Instead of this:
for (word, tag) in last_lw_tagged:
if tag == "NN":
Do this:
if last_lw_tagged[1] == "NN:

Can't get rid of TypeError: 'str' object is not callable

I'm trying to make/train a twitter sentiment analyser in ipython notebook and am having serious problems with one section of code:
import csv
#Read the tweets one by one and process it
inpTweets = csv.reader(open('SampleTweets.csv', 'rb'), delimiter=',', quotechar='|')
tweets = []
for row in inpTweets:
sentiment = row[0]
tweet = row[1]
processedTweet = processTweet(tweet)
featureVector = getFeatureVector(processedTweet, stopwords)
tweets.append((featureVector, sentiment));
#end loop
And I'm getting this error:
TypeError Traceback (most recent call last)
<ipython-input-10-bbcb1b9f05f4> in <module>()
7 sentiment = row[0]
8 tweet = row[1]
----> 9 processedTweet = processTweet(tweet)
10 featureVector = getFeatureVector(processedTweet, stopwords)
11 tweets.append((featureVector, sentiment));
TypeError: 'str' object is not callable
And help would be seriously great, thanks!
Here your processedTweet should be a str hence you can't call it.
Example -
>>> a = 'apple'
>>> a(0)
Traceback (most recent call last):
File "<pyshell#212>", line 1, in <module>
a(0)
TypeError: 'str' object is not callable
But when I use index, it's fine. Callable means you are using that as a function like sum etc.
>>> a[0]
'a'

KeyErrors while reading Twitter json files in Python

I am trying to analyze a json file with data I have collected from twitter, but when I try to search for a keyword it says it is not found, but I can see it is there. I tried this two different ways. I'll post them below. Any advice would be great.
Attempt #1:
import sys
import os
import numpy as np
import scipy
import matplotlib.pyplot as plt
import json
import pandas as pan
tweets_file = open('twitter_data.txt', "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets = pan.DataFrame()
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
Attempt #2: Same previous steps, but did a loop instead
t=tweets[0]
tweet_text = [t['text'] for t in tweets]
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
KeyError: 'text'
If I print tweets_data, this I is what I see. 'text',etc, is definitely, there. Am I missing a character?
>>> print(tweet_data[0])
{u'contributors': None, u'truncated': False, u'text': u'RT
#iHippieVibes: \u2b50\ufe0fFAV For This Lace Cardigan \n\nUSE Discount
code for 10% off: SOLO\n\nFree Shipping\n\nhttp://t.co/d8kiIt3J5f
http://t.c\u2026', u'in_reply_to_status....
(pasted only part of the output)
Thanks! Any suggestions would be greatly appreciated.
Not all your tweets have a 'text' key. Filter those out or use dict.get() to return a default:
tweet_text = [t['text'] for t in tweets if 'text' in t]
or
tweet_text = [t.get('text', '') for t in tweets]

PyMarc Invalid Literal Error

I'm trying to parse a MARC file downloaded from the Library of Congress. I've successfully downloaded the record using the PyZ3950, but when I try to parse the file using PyMarc, I get the following error:
Traceback (most recent call last):
File "test.py", line 13, in <module>
for record in reader:
File "build/bdist.macosx-10.9-intel/egg/pymarc/reader.py", line 83, in next
ValueError: invalid literal for int() with base 10: '<PyZ3'
And here is my full code:
from PyZ3950 import zoom, zmarc
from pymarc import MARCReader
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
conn.preferredRecordSyntax = 'USMARC'
query = zoom.Query('CCL', 'ti="1066 and all that"')
res = conn.search(query)
reader = MARCReader(str(res))
for record in reader:
print record.title()
conn.close()
Your statement:
res = conn.search(query)
return a ResultSet, accordingly to http://www.panix.com/~asl2/software/PyZ3950/zoom.html
Each record r in the resultSet have the data in r.data
So, you have to feed the MARCReader with each r.data or with them all concatenated.
This will work:
from PyZ3950 import zoom, zmarc
from pymarc import MARCReader
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
conn.preferredRecordSyntax = 'USMARC'
query = zoom.Query('CCL', 'ti="1066 and all that"')
res = conn.search(query)
marc = ''
for r in res:
marc = marc + r.data
reader = MARCReader(marc)
for record in reader:
print record.title()
conn.close()

Creating loops from xml data

Please look at the following code:
from xml.dom import minidom
xmldoc = minidom.parse("C:\Users\...\xml") #This is just the address to the document
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
competition = soccerdocument.getElementsByTagName("Competition")[0]
country = competition.getElementsByTagName("Country")[0].firstChild.data
name = competition.getElementsByTagName("Name")[0].firstChild.data
season = competition.getElementsByTagName("Stat")[1].firstChild.data
matchday = competition.getElementsByTagName('Stat')[3].firstChild.data
lst = [country, name, season, "matchday: "+ matchday]
print lst
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
for MatchInfo in MatchData:
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Attendance = MatchInfo.getElementsByTagName("Attendance")[0].firstChild.data
Result = MatchInfo.getElementsByTagName("Result")[0]
print (MatchInfo, "Attendance: "+ Attendance)
So I just wrote this code to parse some data from a xml file. I keep getting the following error:
Traceback (most recent call last):
File "C:\Users\Javi\Desktop\csvfile.py", line 28, in <module>
for MatchInfo in MatchData:
TypeError: iteration over non-sequence
How do I fix this?
Loop over return value of getElementsByTagName.
Replace following line
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
to
MatchData = soccerdocument.getElementsByTagName("MatchData")

Categories

Resources