Removing duplicate results from the loop output in Python - python

My data from a loop generates a series of strings which are sentences retrieved from a database. However, my data structure in the database needs to have duplicates but I want to omit the duplicates in the output. Assuming my loop and results is as follow:
for text in document:
print(text)
Output:
He goes to school.
He works here.
we are friends.
He goes to school.
they are leaving us alone.
..........
How can I set up a condition so that the program reads all the output generated and if find duplicate results (eg. He goes to school) it will only show one record of to me instead of multiple similar records?

already_printed = set()
for text in document:
if text not in already_printed:
print(text)
already_printed.add(text)

You can use set. Like:
values = set(document)
for text in values:
print(text)
Or can use list:
temp_list = []
for text in document:
if text not in temp_list:
temp_list.append(text)
print(text)
Or you can use dict:
temp_dict={}
for text in document:
if text not in temp_dict.keys():
temp_dict[text]=1
print(text)

Split document by '\n' or read by rows to arr = []. I.e. in for loop store arr += row.lowercase().
arr = list(set(arr)) will remove the duplicates.

If the case does not matter, you can take set of the list.
for text in set(i.lower() for i in document):
print (text)

Use built in option SET of python to remove duplicates
documents = ["He goes to school", "He works here. we are friends", "He goes to school", "they are leaving us alone"]
list(set(document))

Related

How to capture output of function in for-loop

I have a for-loop that calls a function. The function iterates over elements in a list and constructs vectors from it. I am able to print out each of these vectors put since I need to perform operations on them I need the loop to actually return each of them.
for text in corpus:
text_vector = get_vector(lexicon, text)
print(text_vector)
You can store them in a list.
You can create a list by doing:
output = []
Then in the loop you can add the values like this:
for text in corpus:
text_vector = get_vector(lexicon, text)
print(text_vector)
output.append(text_vector)
Now you have saved each item from the for loop in the list.
I think only you need is a list in the global scope and append to it.
vector_texts = []
for text in corpus:
text_vector = get_vector(lexicon, text)
vector_texts.append(text_vector)
print(text_vector)

How do I read a file and convert each line into strings and see if any of the strings are in a user input?

What I am trying to do in my program is to have the program open a file with many different words inside it.
Receive a user input and check if any word inside the file is in user input.
Inside the file redflags.txt:
happy
angry
ball
jump
each word on a different line.
For example if the user input is "Hey I am a ball" then it will print redflag.
If the user input is "Hey this is a sphere" then it will print noflag.
Redflags = open("redflags.txt")
data = Redflags.read()
text_post = raw_input("Enter text you wish to analyse")
words = text_post.split() and text_post.lower()
if data in words:
print("redflag")
else:
print("noflag")
This should do the trick! Sets are generally much faster than lookups for list comparisons. Sets can tell you the intersection in this case (overlapping words), differences, etc. We consume the file that has a list of words, remove newline characters, lowercase, and we have our first list. The second list is created from the user input and split on spacing. Now we can perform our set intersection to see if any common words exist.
# read in the words and create list of words with each word on newline
# replace newline chars and lowercase
words = [word.replace('\n', '').lower() for word in open('filepath_to_word_list.txt', 'r').readlines()]
# collect user input and lowercase and split into list
user_input = raw_input('Please enter your words').lower().split()
# use set intersection to find if common words exist and print redflag
if set(words).intersection(set(user_input)):
print('redflag')
else:
print('noflag')
with open('redflags.txt', 'r') as f:
# See this post: https://stackoverflow.com/a/20756176/1141389
file_words = f.read().splitlines()
# get the user's words, lower case them all, then split
user_words = raw_input('Please enter your words').lower().split()
# use sets to find if any user_words are in file_words
if set(file_words).intersection(set(user_words)):
print('redflag')
else:
print('noredflag')
I would suggest you to use list comprehensions
Take a look at this code which does what you want (I will explain them below):
Redflags = open("redflags.txt")
data = Redflags.readlines()
data = [d.strip() for d in data]
text_post = input("Enter text you wish to analyse:")
text_post = text_post.split()
fin_list = [i for i in data if i in text_post]
if (fin_list):
print("RedFlag")
else:
print("NoFlag")
Output 1:
Enter text you wish to analyse:i am sad
NoFlag
Output 2:
Enter text you wish to analyse:i am angry
RedFlag
So first open the file and read them using readlines() this gives a list of lines from file
>>> data = Redflags.readlines()
['happy\n', 'angry \n', 'ball \n', 'jump\n']
See all those unwanted spaces,newlines (\n) use strip() to remove them! But you can't strip() a list. So take individual items from the list and then apply strip(). This can be done efficiently using list comprehensions.
data = [d.strip() for d in data]
Also why are you using raw_input() In Python3 use input() instead.
After getting and splitting the input text.
fin_list = [i for i in data if i in text_post]
I'm creating a list of items where i (each item) from data list is also in text_pos list. So this way I get common items which are in both lists.
>>> fin_list
['angry'] #if my input is "i am angry"
Note: In python empty lists are considered as false,
>>> bool([])
False
While those with values are considered True,
>>> bool(['some','values'])
True
This way your if only executes if list is non-empty. Meaning 'RedFlag' will be printed only when some common item is found between two lists. You get what you want.

preprocessing (rstrip and regular expression and simpler code)

I'm trying to read 200 txt files and do some preprocessing.
1) how could i write simpler code instead of writing same code for each of txt files?
2) can i combine regular expression with rstrip?
-> mainly, i want to get rid of "\n" but sometimes they are sticked with other letters.so what i want is remove every \n as well as words that are combined with \n (i.e. "\n?", "!\n" .. and so on)
3) at the last line, is there a way to add all list in one list with simpler code?
​
data = open("job (0).txt", 'r').read()
rows0 = data.split(" ")
rows0 = [item.rstrip('\n?, \n') for item in rows0]
data = open("job (1).txt", 'r').read()
rows1 = data.split(" ")
rows1 = [item.rstrip('\n?, \n') for item in rows1]
​
.....(up to 200th file)
data = open("job (199).txt", 'r').read()
rows199 = data.split(" ")
rows199 = [item.rstrip('\n?, \n') for item in rows199]
ds_l = rows0 + rows1 + ... rows199
First of all, I'm not a python expert. But since the question has been around for a while already... (At least I'm save from downvotes if no one looks at this^^)
1) Use loops, and read a programming tutorial.
See for example this post How do I read a file line-by-line into a list? on how to get a list of all rows. Then you can loop over the list.
2) No idea whether it's possible to use regexes with strip, this brought me here, so tell me if you find out.
It's not clear what exactly you are asking for, do you want to get rid of all (space seperated) words that contain any "/n", or just cut out the "/n","/n?",... parts of the words?
In the first case, a simple, unelegant solution would be to just have two loops over rows and over all words in a row and do something like
# loop over rows with i as index
row = rows[i].split(" ")
for j in range len(row):
if("/n" in row[j])
del row[j]
rows[i] = " ".join(row)
In the latter case, if there's not so many expressions you want to remove, you can probably use re.sub() somehow. Google helps ;)
3) If you have the rows as a list "rows" of strings, you can use join:
ds_1 = "".join(rows)
(For join: Python join: why is it string.join(list) instead of list.join(string)?)

Comparing csv files in python to see what is in both

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !
Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something
You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

A solution to remove the duplicates?

My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])

Categories

Resources