Extracting data from .txt file - python

For my programming assignment, one of the functions involves taking input from a text file (twitter data) and returning a tuple of the tweet information (see doctests for correct results on a sample file).
Sample text file: http://pastebin.com/z5ZkN3WH
Full description of function is as follows:
The parameter is the full name of a file. Open the file specified by the parameter, which is formatted as described in the data files section, and read all of the data from it. The keys of the dictionary should be the names of the candidates, and the items in the list associated with each candidate are the tweets they have sent. A tweet tuple should have the form (candidate, tweet text, date, source, favorite count, retweet count). The date, favorite count, and retweet count should be integers, and the rest of the items in the tuple should be strings.
My code so far is below:
def extract_data(metadata):
""" list of str -> tuple of str/int
Return extracted metadata in specified format.
"""
date = int(metadata[1])
source = metadata[3]
favs = int(metadata[4])
retweets = int(metadata[5])
return date, source, favs, retweets
def read_tweets(file):
""" (filename) -> dict of {str: list of tweet tuples}
Read tweets from file and categorize into dictionary.
>>> read_tweets('very_short_data.txt')
{'Donald Trump': [('Donald Trump', 'Join me live in Springfield, Ohio!\\nhttps://t (dot) co/LREA7WRmOx\\n', 1477604720, 'Twitter for iPhone', 5251, 1895)]}
"""
result = {}
with open(file) as data:
tweets = data.read().split('<<<EOT')
for i, tweet in enumerate(tweets):
line = tweet.splitlines()
content = ' '.join(line[2:])
meta = line[1].split(',')
if ':' in line[0]:
author = line[0]
metadata = extract_data(meta)
else:
metadata = extract_data(meta)
candidate = author
result[candidate] = [(candidate, content, metadata)]
return result
This currently results in an error: "date = int(metadata[1]) IndexError: list index out of range". I am not sure why, or what to do next. Any help would be appreciated.
Thanks

I don't think it is a good idea spliting by EOT considering candidates with empty tweets don't have EOT. it is better to loop through the contents instead of reading all the data at once. it makes it a lot easier.
doing same assignment stuck on this func aswell :(

Related

How to build a function that returns elements matching in txt. file and dictionary

I am new to Python, so apologies in advance if my question seems foolish.
I am trying to build a function that searches for keys and values of a nested dictionary (built from info in a csv file) inside a .txt file and returns all matching words. So far this is what I tried:
text = ['da#8970095-v4',
'd#30/04/2019',
'h#2.0',
'power of attorney']
clientlist = {'hong kong co.': {'Client Code': '897',
'Matter Code': '0095',
'Matter Name': 'Incorporation of Brazilian Subsidiary'},
'shanghai co.': {'Client Code': '965',
'Matter Code': '3569',
'Matter Name': 'Corporate Matters'}}
def term_tracker(document, term_variations):
terms = []
#If term_variations is a dictionary
if isinstance(term_variations, dict) == True:
for term in term_variations:
if any([str(term) in i for i in document]):
terms.append(term)
#If term_variations is a list
if isinstance(term_variations, list) == True:
for term in term_variations:
#If we find a term in the document, append that term to a list
if any([str(term) in i for i in document]):
terms.append(term)
return terms
For some reason my output is a blank list:
In: term_tracker(text, clientlist[clientname]) #text = .txt file
Out: []
I could build lists with information collected from my nested dictionary (e.g., only with keys, or only with values), but I am trying to keep my code as clean as possible and therefore want to avoid this.
The following is another part of my code that I am also having issues with. When I use my term_tracker function inside the client_summary variable and then try to write a .txt file with the information included in this variable, my .txt file comes out without the information that the function should return.
def string_cleaner(document):
document = document.replace('[', '')
document = document.replace(']', '')
document = document.replace("'", '')
document = document.replace('"', '')
return document
for documents in samples:
filename = 'Time Sheet-' + time.strftime("%Y%m%d-%H%M%S")
infile = open(path + 'Sample docs' + '/' + documents, 'r')
.
.
.
client_summary = ['Client: ' + str(term_tracker(text, clientlist[clientname]['Client Code']))]
client_summary = string_cleaner(str(client_summary))
outfile = open(path+'Automated work descriptions/'+filename,'w', encoding='utf-8')
outfile.write(client_summary)
outfile.close()
If I run client_summary my editor returns the output I want. However, this information is not being written in my .txt file. I assume this has to do with the problem I am having with my function because if I try the following alternative I get the information I want written in a .txt file:
client_codes_only = [val['Client Code'] for val in clientlist.values()]
>>> ['897', '965']
.
.
.
client_summary = ['Client: ' + str(term_tracker(text, client_codes_only))]
client_summary = string_cleaner(str(client_summary))
>>> 'Client: 965'
Can anyone help me to identify why is my code not giving the expected result (or suggest another efficient way to achieve my goal)?
Thanks in advance!
Your script is returning the key of the dictionary, and you want the values.
Substitute this:
if any([str(term_variations[term]) in i for i in document]):
Wherever you have "term" replace it with term_variations[term].
It's worth noting, that your logic matches '0095', in your example data, with 'da#8970095-v4' in your "text" list.
2nd part of question:
For starters, if Hong Kong Co is your client lookup, then this line of code: client_summary = ['Client: ' + str(term_tracker(text, clientlist[clientname]['Client Code']))]
is passing term_tracker(text,'897') into your function, which will return the empty list from term_tracker(). Which will then write nothing to your file.

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)
The code I have is below:
dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
with open (file_path, "r") as myfile:
text=myfile.read()
for words in text.lower().split():
dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)
The list (dictionary_arr) contains the list of all words across all the file, I then use gensim corpora.Dictionary to process the list. However I face a error.
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
I cant understand whats a problem, A little guidance would be appreciated.
In dictionary.py, the initialize function is:
def __init__(self, documents=None):
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
if documents is not None:
self.add_documents(documents)
Function add_documents Build dictionary from a collection of documents. Each document is a list
of tokens:
def add_documents(self, documents):
for docno, document in enumerate(documents):
if docno % 10000 == 0:
logger.info("adding document #%i to %s" % (docno, self))
_ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
logger.info("built %s from %i documents (total %i corpus positions)" %
(self, self.num_docs, self.num_pos))
So ,if you initialize Dictionary in this way, you must pass documents but not a single document. For example,
dic = corpora.Dictionary([a.split()])
is OK.
Dictionary needs a tokenized strings for its input:
dataset = ['driving car ',
'drive car carefully',
'student and university']
# be sure to split sentence before feed into Dictionary
dataset = [d.split() for d in dataset]
vocab = Dictionary(dataset)
Hello everyone i ran into the same problem. This is what worked for me
#Tokenize the sentence into words
tokens = [word for word in sentence.split()]
#Create dictionary
dictionary = corpora.Dictionary([tokens])
print(dictionary)

MongoDB field value to variable in python

I am taking entries from MongoDB and I want to do some modifications, data crunching etc and updating. In this particular example Iam trying for every document in collection
{u'time': 1405694995.310651, u'text': u'HOHO,r\u012bt ar evitu uz positivus ar vip bi\u013ceti kabat\u0101:)', u'_id': ObjectId('53cd621d51f4fbe9f6e04da4'), u'name': u'Madara B\u013cas\u0101ne', u'screenName': u'miumiumadara'} take its text value as a string, count its keyword values and after add to exact particular document field with keyword value.
I am struggling with taking text field as string so it can be operated. And also I havent found solution in python how to add new field to document with count variable. In a Mongo shell comands are easy, but here i dont know. Anything for me to look for?
db = conn.posit2014
collection = db.ceturtdiena
cursor = db.all.find()
for text_fromDB in cursor:
print text_fromDB
source_text = text_fromDB.translate(None, '#!#£$%^&*()_:""?><.,/\|+-')
source_text = source_text.lower()
source_words = source_text.split()
count = 0
word_list = []
with open('pozit.txt') as inputfile:
for line in inputfile:
word_list.append(line.strip())
for word in word_list:
if word in source_words:
count += 1
#add count variable to each document
# {$set : {value:'count'}}
AFAIK text_fromDB is just a dict so you can do this. (If you mean to update document)
text_fromDB['count'] = value
collection.update({'_id':text_fromDB['_id']}, {"$set": text_fromDB})
I'm not sure if I understand everything you're ask. Let's go one piece at a time. To get the text field from your collection as a normal string try this:
collection = db.centurtdiena
for doc in collection.find():
text = str(doc['text'])
print(text)

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!

JSON to python dictionary: Printing values

Noob here. I have a large number of json files, each is a series of blog posts in a different language. The key-value pairs are meta data about the posts, e.g. "{'author':'John Smith', 'translator':'Jane Doe'}. What I want to do is convert it to a python dictionary, then extract the values so that I have a list of all the authors and translators across all the posts.
for lang in languages:
f = 'posts-' + lang + '.json'
file = codecs.open(f, 'rt', 'utf-8')
line = string.strip(file.next())
postAuthor[lang] = []
postTranslator[lang]=[]
while (line):
data = json.loads(line)
print data['author']
print data['translator']
When I tried this method, I keep getting a key error for translator and I'm not sure why. I've never worked with the json module before so I tried a more complex method to see what happened:
postAuthor[lang].append(data['author'])
for translator in data.keys():
if not data.has_key('translator'):
postTranslator[lang] = ""
postTranslator[lang] = data['translator']
It keeps returning an error that strings do not have an append function. This seems like a simple task and I'm not sure what I'm doing wrong.
See if this works for you:
import json
# you have lots of "posts", so let's assume
# you've stored them in some list. We'll use
# the example text you gave as one of the entries
# in said list
posts = ["{'author':'John Smith', 'translator':'Jane Doe'}"]
# strictly speaking, the single-quotes in your example isn't
# valid json, so you'll want to switch the single-quotes
# out to double-quotes, you can verify this with something
# like http://jsonlint.com/
# luckily, you can easily swap out all the quotes programmatically
# so let's loop through the posts, and store the authors and translators
# in two lists
authors = []
translators = []
for post in posts:
double_quotes_post = post.replace("'", '"')
json_data = json.loads(double_quotes_post)
author = json_data.get('author', None)
translator = json_data.get('translator', None)
if author: authors.append(author)
if translator: translators.append(translator)
# and there you have it, a list of authors and translators

Categories

Resources