I am taking entries from MongoDB and I want to do some modifications, data crunching etc and updating. In this particular example Iam trying for every document in collection
{u'time': 1405694995.310651, u'text': u'HOHO,r\u012bt ar evitu uz positivus ar vip bi\u013ceti kabat\u0101:)', u'_id': ObjectId('53cd621d51f4fbe9f6e04da4'), u'name': u'Madara B\u013cas\u0101ne', u'screenName': u'miumiumadara'} take its text value as a string, count its keyword values and after add to exact particular document field with keyword value.
I am struggling with taking text field as string so it can be operated. And also I havent found solution in python how to add new field to document with count variable. In a Mongo shell comands are easy, but here i dont know. Anything for me to look for?
db = conn.posit2014
collection = db.ceturtdiena
cursor = db.all.find()
for text_fromDB in cursor:
print text_fromDB
source_text = text_fromDB.translate(None, '#!#£$%^&*()_:""?><.,/\|+-')
source_text = source_text.lower()
source_words = source_text.split()
count = 0
word_list = []
with open('pozit.txt') as inputfile:
for line in inputfile:
word_list.append(line.strip())
for word in word_list:
if word in source_words:
count += 1
#add count variable to each document
# {$set : {value:'count'}}
AFAIK text_fromDB is just a dict so you can do this. (If you mean to update document)
text_fromDB['count'] = value
collection.update({'_id':text_fromDB['_id']}, {"$set": text_fromDB})
I'm not sure if I understand everything you're ask. Let's go one piece at a time. To get the text field from your collection as a normal string try this:
collection = db.centurtdiena
for doc in collection.find():
text = str(doc['text'])
print(text)
Related
My data from a loop generates a series of strings which are sentences retrieved from a database. However, my data structure in the database needs to have duplicates but I want to omit the duplicates in the output. Assuming my loop and results is as follow:
for text in document:
print(text)
Output:
He goes to school.
He works here.
we are friends.
He goes to school.
they are leaving us alone.
..........
How can I set up a condition so that the program reads all the output generated and if find duplicate results (eg. He goes to school) it will only show one record of to me instead of multiple similar records?
already_printed = set()
for text in document:
if text not in already_printed:
print(text)
already_printed.add(text)
You can use set. Like:
values = set(document)
for text in values:
print(text)
Or can use list:
temp_list = []
for text in document:
if text not in temp_list:
temp_list.append(text)
print(text)
Or you can use dict:
temp_dict={}
for text in document:
if text not in temp_dict.keys():
temp_dict[text]=1
print(text)
Split document by '\n' or read by rows to arr = []. I.e. in for loop store arr += row.lowercase().
arr = list(set(arr)) will remove the duplicates.
If the case does not matter, you can take set of the list.
for text in set(i.lower() for i in document):
print (text)
Use built in option SET of python to remove duplicates
documents = ["He goes to school", "He works here. we are friends", "He goes to school", "they are leaving us alone"]
list(set(document))
I am trying to retrieve items from a SQLite db table in Python where there is a match in a particular field. In other words if I search for 'rabbit' I want to retrieve all entries that have the string 'rabbit' in a particular column. My code looks like this:
Python server code for endpoint:
if self.path=='/getOne':
form = cgi.FieldStorage(
fp=self.rfile,
headers=self.headers,
environ={'REQUEST_METHOD':'POST',
'CONTENT_TYPE':self.headers['Content-Type'],
})
value = []
for key in form.keys():
value.append("%" + form.getvalue(key) + "%")
print 'LOOK my value', value
c.execute('select * from appointments where description=?' , value)
res = c.fetchall()
# _json = json.dumps(res)
# print 'This is res from _get_all_appts', res
# print 'From line 18: ', _json
# self.wfile.write(_json)
print "I'm ya huckleberry", res
return
What is printing in console:
From line 18: [["15:01", "asdf", "2020-05-07"], ["14:01", "test",
"2020-04-04"]]
LOOK my value ['%test%']
I'm ya huckleberry []
LOOK my value ['%test%']
I'm ya huckleberry []
As you can see what is printing out on line 18 are the entries on my table.
My value ['%test%] should return the second entry since I want to return any entry that contains the string test in that particular column but I get nothing. I come from a JS background and would easily do this with string interpolation/template strings. Is there anything anyone can suggest that would help bring the desired effect? I would greatly appreciate it. Thank you ahead of time!
For the answer to your first part of the question, the LIKE operator might do the trick:
SELECT *
FROM appointments
WHERE description LIKE '%rabbit%';
If this does not meet your expectations, then you could try looking into full text search with SQLite. One reason the above query might fall short for you is that it would match the substring rabbit occurring anywhere. For example, it would also match rabbits. Full text search would get around most of these edge cases.
To make the term inside the LIKE expression dynamic, you would use Python code along these lines:
param = 'rabbit'
t = ('%'+param+'%',)
c.execute('select * from appointments where description like ?', t)
c.fetchall()
For my programming assignment, one of the functions involves taking input from a text file (twitter data) and returning a tuple of the tweet information (see doctests for correct results on a sample file).
Sample text file: http://pastebin.com/z5ZkN3WH
Full description of function is as follows:
The parameter is the full name of a file. Open the file specified by the parameter, which is formatted as described in the data files section, and read all of the data from it. The keys of the dictionary should be the names of the candidates, and the items in the list associated with each candidate are the tweets they have sent. A tweet tuple should have the form (candidate, tweet text, date, source, favorite count, retweet count). The date, favorite count, and retweet count should be integers, and the rest of the items in the tuple should be strings.
My code so far is below:
def extract_data(metadata):
""" list of str -> tuple of str/int
Return extracted metadata in specified format.
"""
date = int(metadata[1])
source = metadata[3]
favs = int(metadata[4])
retweets = int(metadata[5])
return date, source, favs, retweets
def read_tweets(file):
""" (filename) -> dict of {str: list of tweet tuples}
Read tweets from file and categorize into dictionary.
>>> read_tweets('very_short_data.txt')
{'Donald Trump': [('Donald Trump', 'Join me live in Springfield, Ohio!\\nhttps://t (dot) co/LREA7WRmOx\\n', 1477604720, 'Twitter for iPhone', 5251, 1895)]}
"""
result = {}
with open(file) as data:
tweets = data.read().split('<<<EOT')
for i, tweet in enumerate(tweets):
line = tweet.splitlines()
content = ' '.join(line[2:])
meta = line[1].split(',')
if ':' in line[0]:
author = line[0]
metadata = extract_data(meta)
else:
metadata = extract_data(meta)
candidate = author
result[candidate] = [(candidate, content, metadata)]
return result
This currently results in an error: "date = int(metadata[1]) IndexError: list index out of range". I am not sure why, or what to do next. Any help would be appreciated.
Thanks
I don't think it is a good idea spliting by EOT considering candidates with empty tweets don't have EOT. it is better to loop through the contents instead of reading all the data at once. it makes it a lot easier.
doing same assignment stuck on this func aswell :(
I am newbie in python and i am stuck with kind of database engine problem in python
I have text file database table(person.text) with a delimiter(|) separated in python. For example
sue|45|Happy Lane|456-3245
John|43|67 Drury Lane|897-3456
Mark|21|Stuffy Street|345-7896
Now i want a functionality that take queries from user and fetch data from this text file and display. queries can be select,update (with or without where clause).
For example If user give input as "select name from person " then output would be
sue
John
Mark
I am confuse which data structure should i use ??
Instead of data = f.split("|") it should be data = line.split("|").
I need to write query in user input and then it must be interpreted by code. If i use above solution then how can a match my SQL queries with dictionary ?
Thanks
If you want to maintain your current text based database you would probably need to parse it manually then store it do a local dictionary (defaultdict makes it easy) to allow for keyword searching. I added a numeric primary key to help store the data in a keyword-searchable form
from collections import defaultdict
person = defaultdict(lambda: dict())
with open("path/to/person.txt", "r") as f:
primary_key = 0
for line in f:
data = line.split("|")
person[primary_key]["name"] = data[0]
person[primary_key]["age"] = data[1]
person[primary_key]["address"] = data[2]
person[primary_key]["pnum"] = data[3]
primary_key += 1
Then you would have a local dictionary named person that can be searched through using keywords inputted by the end user.
Searching:
query = "select name from person"
query_items = query.split(" ")
if query_items[0] == "select":
table = eval(query_items[3])
for value in table.values():
print(value[query_items[1]])
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.