I have a log file containing search queries entered into my site's search engine. I'd like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.
Example data:
dog food
good dog trainer
cat food
veterinarian
Groups should include:
cat:
cat food
dog:
dog food
good dog trainer
food:
dog food
cat food
etc...
Ideas? Some sort of "indexing algorithm" perhaps?
f = open('data.txt', 'r')
raw = f.readlines()
#generate set of all possible groupings
groups = set()
for lines in raw:
data = lines.strip().split()
for items in data:
groups.add(items)
#parse input into groups
for group in groups:
print "Group \'%s\':" % group
for line in raw:
if line.find(group) is not -1:
print line.strip()
print
#consider storing into a dictionary instead of just printing
This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:
Group 'trainer':
good dog trainer
Group 'good':
good dog trainer
Group 'food':
dog food
cat food
Group 'dog':
dog food
good dog trainer
Group 'cat':
cat food
Group 'veterinarian':
veterinarian
Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.
SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.
The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.
Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.
So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".
Your algorithm needs the following parts (if done by yourself)
a parser for the data, breaking up in lines, breaking up the lines in words.
A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)
in pseudocode (generation):
create empty set S for name value pairs.
for each line L parsed
for each word W in line L
seek W in set S -> Item
if not found -> add word W -> (empty array) to set S
add line L reference to array in Ietm
endfor
endfor
(lookup (word: W))
seek W in set S into Item
if found return array from Item
else return empty array.
Modified version of #swanson's answer (not tested):
from collections import defaultdict
from itertools import chain
# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))
# parse input into groups
groups = defaultdict(list)
for line in lines:
for word in words:
if word in line:
groups[word].append(line)
Related
I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]
You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)
First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)
I have a dataframe named final where I have a column named CleanedText in which i have user reviews(Text). A review is of multiple line. I have done preprocessing and removed all commas, fullstops,htmltags,etc. So the data looks like, Review1(row1): pizza extremely delicious delivery late. Just like this, i have 10000 reviews(corresponding to 10000 rows). Now I want a list of list where every review should be in a list. Ex: [['Pizza','extremely','delicious','delivery','late'],['Tommatos','rotten'......[]...[]].
This assumes you've truly stripped the text of all of the 'fun' stuff. Give this a shot.
fulltext = 'stuff with many\nlines and words'
text_array = [line.split() for line in fulltext.splitlines()]
I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.
I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words.
Two links I have tried
Python - Check If Word Is In A String
Phrase matching using regex and Python
the solution here is close, but it doesn't count all occurrences of the same word
How to return the count of words from a list of words that appear in a list of lists?
both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution
# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
for review in reviews:
for word in keywords:
results = re.findall(r'\bword\b',review) #this returns no results, the variable word is not getting picked up
#--also tried variations of this to no avail
#--tried creating the pattern first and passing it
# pattern = "r'\\b" + word + "\\b'"
# results = re.findall(pattern,review) #this errors with the msg: sre_constants.error: multiple repeat at position 9
#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1
I would first do it in brute force rather than overcomplicating it and try to optimize it later.
from collections import defaultdict
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
results = dict()
for i in keywords:
for j in reviews:
results[i] = results.get(i, 0) + j.count(i)
print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}
it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.
If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
None of the options you tried search for the value of word:
results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.
You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.
I have a source.txt file consisting of words. Each word is in a new line.
apple
tree
bee
go
apple
see
I also have a taget_words.txt file, where the words are also in one line each.
apple
bee
house
garden
eat
Now I have to search for each of the target words in the source file. If a target word is found, e.g. apple, a dictionary entry for the target word and each of the 3 preceding and 3 following words should be made. In the example case, that would be
words_dict = {'apple':'tree', 'apple':'bee', 'apple':'go'}
How can I tell python by creating and populating the dictionary to consider these 3 words before and after the entry in the source_file?
My idea was to use lists but ideally the code should be very efficient and fast as the files consists of some million words. I guess, with lists, the computation is very slow.
from collections import defaultdict
words_occ = {}
defaultdict = defaultdict(words_occ)
with open('source.txt') as s_file, open('target_words.txt') as t_file:
for line in t_file:
keys = [line.split()]
lines = s_file.readlines()
for line in lines:
s_words = line.strip()
# if key is found in s_words
# look at the 1st, 2nd, 3rd word before and after
# create a key, value entry for each of them
Later, I have to count the occurrence of each key, value pair and add the number to a separate dictionary, that is why I started with a defaultdict.
I would be glad about any suggestion for the above code.
The first issue you will face is your lack of understanding of dicts. Each key can occur only once, so if you ask the interpreter to give you the value of the one you gave you might get a surprise:
>>> {'apple':'tree', 'apple':'bee', 'apple':'go'}
{'apple': 'go'}
The problem is that there can only be one value associated with the key 'apple'.
You appear to be searching for suitable data structures, but StackOverflow is for improving or fixing problematic code.