Shorter way to code in Python - python

I am Python beginner. Following code does exactly what i want. But it looks a little dump coz of three for loop. Can somebody show me smarter/shorter way to achieve it? may be a single function or parallelizing for loops.
def getWordListAndCounts(text):
words = []
for t in text:
for tt in t:
for ttt in (re.split("\s+", str(tt))):
words.append(str(ttt))
return Counter(words)
text = [['I like Apple' , 'I also like Google']]
getWordListAndCounts(text)

Firstly remove redundat list (it will reduce level in list comprehension):
Since there is not any necessity to store temporary result in list, generators are more preferable and efficient way.
Check this one-line approach:
text = ['I like Apple' , 'I also like Google']
print Counter(str(ttt) for t in text for ttt in (re.split("\s+", str(t))))

Use meaningful variable names. t, tt and ttt can't help the code being readable.
Why not use "for phrase in text" then "for word in phrase"?
Why are you using double encoded strings? Unless it is already in this format when you are reading it, I would suggest you not to do this.

import re
from collections import Counter
def getWordListAndCounts(text):
return Counter(re.split('\s+', str([' '.join(x) for x in text][0])))
text = [['I like Apple' , 'I also like Google']]
print getWordListAndCounts(text)

Related

What is the most efficient way to grab and store part of a given string between keywords with python?

I have an array of keywords:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
I have a string of text:
string_of_text = 'So this is a string of text. I want to talk about anotherWord...and then I'm going to say something I've been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'
I want to return the following:
{'list_word': 'word1', 'string_of_text_after': '...'}, {'list_word': 'anotherWord', 'string_of_text_after': '...and then I'm going to say something I've been meaning to say "'}, {'list_word': 'wordup', 'string_of_text_after': '". But I also wanted to say the following: '}, {list_word: 'word to your papa', 'string_of_text_after':'. And lastly I wanted to talk about '}
As you can see it is a list of dictionaries with the list word and then the text that comes after the list word item but only until the next list word is detected is when it would discontinue.
What would be the most efficient way to do this in python (python 3 or later, 2 is also ok if there are any issues with deprecated methods).
you could try something like this:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
string_of_text = """So this is a string of text. I want to talk about anotherWord...\
and then I'm going to say something I've been meaning to say "wordup".\
But I also wanted to say the following: word to your papa.\
And lastly I wanted to talk about word1..."""
def t(k, t):
ls = len(t)
tmp = {i:len(i) for i in k}
return [{"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0]
from pprint import pprint
pprint(t(keyword_list,string_of_text))
Result:
[{'list_word': 'wordup',
'string_of_text_after': '". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word1', 'string_of_text_after': '...'},
{'list_word': 'anotherWord',
'string_of_text_after': '... and then I\'m going to say something I\'ve been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word to your papa',
'string_of_text_after': '. And lastly I wanted to talk about word1...'}]
ATTENTION
This code has several implications :
the keyword_list has to be of unique elements ...
the call t.find(i) is doubled
the function returns a list, which must be saved in your memory, this could be fixed if you chose to return a generator like this :
return ({"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0) and to call it where und when needed.
Good luck ! :)

How to extract non-matching text in two documents

Lets say I have two strings.
a = 'I am Sam. I love cooking.'
b = 'I am sam. I used to drink a lot.'
I am calculating their similarity score using :
from difflib import SequenceMatcher
s = SequenceMatcher(lambda x: x == " ",a,b)
print s.ratio()
Now I want to print non-matching sentences in both strings. Like this
a = 'I love cooking.'
b = 'I used to drink a lot.'
Any suggestion like what module or approach I can use to do that? I saw one module in difflib https://pymotw.com/2/difflib/ But in this it prints with (+,-,!,...) I don't want output in that format.
It is a very simple script . But i hope it gives you idea of how to do:
a = 'I am Sam. I love cooking.'
b = 'I am sam. I used to drink a lot.'
a= a.split('.')
b=b.split('.')
ca=len(a)
cb=len(b)
if ca>cb:l=cb
else :l=ca
c=0
while c<l:
if a[c].upper() == b[c].upper():pass
else:print b[c]+'.'
c=c+1
Use difflib. You can easily post-process the output of difflib.Differ, to strip off the first two characters of each unit and convert them to any format you want. Or you can work with the alignments returned by SequenceMatcher.get_matching_blocks, and generate your own output.
Here's how you might do it. If that's not what you want, edit your question to provide a less simplistic example of comparison and the output format you need.
differ = difflib.Differ()
for line in differ.compare(list1, list2):
if line.startswith("-"):
print("a="+line[2:])
elif line.startswith("+"):
print("b="+line[2:])
# else just ignore the line

Python string grouping?

Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .

How to find set of most frequently occurring word-pairs in a file using python?

I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from #275365
#275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.
You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a<b (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.
given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.
import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'
If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.
>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example
Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Categories

Resources