A solution to remove the duplicates? - python

My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.

If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.

Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])

During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.

First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])

Related

Comparing csv files in python to see what is in both

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !
Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something
You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

dissociate firstname/surname by comparing

I've a CSV dataset with 2000 rows, with a messy column about first name/surname. In this column I need to dissociate first names and surnames. For that, I've a base with all surnames given in France in the twenty last years.
So, the source database looks like :
"name"; "town"
"Johnny Aaaaaa"; "Bordeaux"
"Bbbb Tom";"Paris"
"Ccccc Pierre Dddd" ; "Lyon"
...
I want obtain something like :
"surname"; "firstname"; "town"
"Aaaaaa"; "Johnny "; "Bordeaux"
"Bbbb"; "Tom"; "Paris"
"Ccccc Dddd" ; "Pierre"; "Lyon"
...
And, my reference database of first names :
"firstname"; "sex"
"Andre"; "M"
"Bob"; "M"
"Johnny"; "M"
...
Technically, I must compare each row from the first base with each field from the second base, in order to identify which character chain correspond to the first name...
I have no idea about the way to do that.
Any ideas are welcome... thanks.
Looks like you want to
Read the data from file say input.csv
Extract the name and split it into first name and last name
Get the sex using first name
And probably write the data again to a new csv or print it.
You can follow the approach as below. You can get more sophisticated in splitting using regex but here is something basic using strip commands:
inFile=open('input.csv','r')
rows=inFile.readlines()
newData=[]
if len(rows) > 1:
for row in rows[1:]:
#Remove the new line chars at the end of line and split on ;
data=row.rstrip('\n').split(';')
#Remove additional spaces in your data
name=data[0].strip()
#Get rid of quotes
name=name.strip('"').split(' ')
fname=name[1]
lname=name[0]
city=data[1].strip()
city=city.strip('"')
#Now you can get the sex info from your other database save this in a list to get the sex info later
sex='M' #replace this with your db calls
newData.append([fname, lname, sex, city])
inFile.close()
#You can put all of this in the new csv file by something like this (it seperates the fileds using comma):
outFile=open('otput.csv','w')
for row in newData:
outFile.write(','.join(row))
outFile.write('\n')
outFile.close(
Well. Finally, I chose the "brute force" approach : each term of each line is compared with the 11.000 keys of my second base (converted in a dictionnary). Not smart, but efficient.
for row in input:
splitted = row[0].lower().split()
for s in splitted :
for cle, valeur in dict.items() :
if cle == s :
print ("{} >> {}".format(cle, valeur))
All ideas about more pretty solutions are still welcome.

Get Attribute From Text File in Python

So I'm making a Yu-Gi-Oh database program. I have all the information stored in a large text file. Each monster is chategorized in the following way:
|Name|NUM 1|DESC 1|TYPE|LOCATION|STARS|ATK|DEF|DESCRIPTION
Here's an actual example:
|A Feather of the Phoenix|37;29;18|FET;YSDS;CP03|Spell Card}Spell||||Discard 1 card. Select from your Graveyard and return it to the top of your Deck.|
So I made a program that searches this large text file by name and it returns the information from the text file without the '|'. Here it is:
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
Now I'm trying to edit my program so I can search for the name of the monster and choose which attribute I want to display. So it'd appear like
A Feather of the Phoenix
Description:
Discard 1 card. Select from your Graveyard and return it to the top of your Deck.
Any clues as to how I can do this?
First, this is a variant dialect of CSV, and can be parsed with the csv module instead of trying to do it manually. For example:
with open('TEXT.txt') as fd:
rows = csv.reader(fd, delimiter='|')
to_search = {row[1]:row for row in rows}
print('\n'.join(to_search[name]))
You might also prefer to use DictReader, so each row is a dict (keyed off the names in the header row, or manually-specified column names if you don't have one):
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print('\n'.join(to_search[name]))
Then, to select a specific attribute:
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print(to_search[name][attribute])
However… I'm not sure this is a good design in the first place. Do you really want to re-read the entire file for each lookup? I think it makes more sense to read it into memory once, into a general-purpose structure that you can use repeatedly. And in fact, you've almost got such a structure:
with open('TEXT.txt') as fd:
monsters = list(csv.DictReader(fd, delimiter='|'))
monsters_by_name = {monster['Name']: monster for monster in monsters}
Then you can build additional indexes, like a multi-map of monsters by location, etc., if you need them.
All this being said, your original code can almost handle what you want already. to_search[name] is a list. If you just build a map from attribute names to indices, you can do this:
attributes = ['Name', 'NUM 1', 'DESC 1', 'TYPE', 'LOCATION', 'STARS', 'ATK', 'DEF', 'DESCRIPTION']
attributes_by_name = {value: idx for idx, value in enumerate(attributes)}
# ...
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
attribute_index = attributes_by_name[attributes]
print(to_search[name][attribute_index])
You could look at the namedtuple class in collections. You will want to make each entry a namedtuple with your fields as attributes. The namedtuple might look like:
Card = namedtuple('Card', 'name, number, description, whatever_else')
As shown in the collections documentation, namedtuple and csv work well together:
import csv
for card in map(Card._make, csv.reader(open("cards", "rb"))):
print card.name, card.description # format however you want here
The mechanics around search can be very complicated. For example, if you want a really fast search built around an exact match, you could build a dictionary for each attribute you're interested in:
name_map = {card.name: card for card in all_cards}
search_result = name_map[name_you_searched_for]
You could also do a startswith search:
possibles = [card for card in all_cards if card.name.startswith(search_string)]
# here you need to decide what to do with these possibles, in this example, I'm just snagging the first one, and I'm not handling the possibility that you don't find one, you should.
search_result = possibles[0]
I recommend against trying to search the file itself. This is an extremely complex kind of search to do and is typically left up to database systems to implement this kind of functionality. If you need to do this, consider switching the application to sqlite or another lightweight database.

How to Store Rows from CSV File into Python and Print Data with HTML

Basically my problem is this: I have a CSV excel file with info on Southpark characters and I and I have an HTML template and what I have to do is take the data by rows (stored in lists) for each character and using the HTML template given implement that data to create 5 seperate HTML pages with the characters last names.
Here is an image of the CSV file: i.imgur.com/rcIPW.png
This is what I have so far:
askfile = raw_input("What is the filename?")
southpark = []
filename = open(askfile, 'rU')
for row in filename:
print row[0:105]
filename.close()
The above prints out all the info on the IDLE shell in five rows but I have to find a way to separate each row AND column and store it into a list (which I don't know how to do). It's pretty rudimentary code I know I'm trying to figure out a way to store the rows and columns first, then I will have to use a function (def) to first assign the data to the HTML template and then create an HTML file from that data/template..and I'm so far a noob I tried searching through the net but I just don't understand the stuff.
I am not allowed to use any downloadable modules but I can use things built in Python like import csv or whatnot, but really its supposed to be written with a couple functions, list, strings, and loops..
Once I figure out how to separate the rows and columns and store them then I can work on implementing into HTML template and creating the file.
I'm not trying to have my HW done for me it's just that I pretty much suck at programming so any help is appreciated!
BTW I am using Python 2.7.2 and if you want to DL the CSV file click here.
UPDATE:
Okay, thanks a lot! That helped me understand what each row was printing and what info is being read by the program. Now since I have to use functions in this program somehow this is what I was thinking.
Each row (0-6) prints out separate values, but just the print row function prints out one character and all his corresponding values which is what I need. What I want is to print out data like "print row" would but I have to store each of those 5 characters in a separate list.
Basically "print row" prints out all 5 characters with each of their corresponding attributes, how can I split each of them into 5 variables and store them as a list?
When I do print row[0] it only prints out the names, or print row1 only prints the DOB. I was thinking of creating a def function that takes only print "row" and splits into 5 variables in a loop and then another def function takes those variables/lists of data and combines them with the HTML template, and at the end I have to figure out how to create HTML files in Python..
Sorry if I sound confusing just trying to make sense of it all. This is my code right now it gives an error that there are too many values to unpack but I am just trying to fiddle around and try different things and see if they work. Based on what I wanted to do above I will probably have to delete all most of this code and find a way to rewrite it with list type functions like .append or .strip, etc which I am not very familiar with..
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
# List of Data
name, dob, descript, phrase, personality, character, apparel = []
count = 0
def southparkinfo():
for row in reader:
count += 1
if count == 0:
row[0] = name
print row[0] # Name (ex. Stan Marsh)
print "----------------"
elif count == 1:
row[1] = dob
print row[1] # DOB
print "----------------"
elif count == 2:
row[2] = descript
print row[2] # Descriptive saying (ex. Respect My Authoritah!)
print "----------------"
elif count == 3:
row[3] = phrase
print row[3] # Catch Phrase (ex. Mooom!)
print "----------------"
elif count == 4:
row[4] = personality
print row[4] # Personality (ex. Jewish)
print "----------------"
elif count == 5:
row[5] = character
print row[5] # Characteristic (ex. Politically incorrect)
print "----------------"
elif count == 6:
row[6] = apparel
print row[6] # Apparel (ex. red gloves)
return
reader.close()
First and foremost, have a look at the CSV docs.
Once you understand the basics take a look at this code. This should get you started on the right path:
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
for row in reader:
#will print each row by itself (all columns from names up to what they wear)
print row
print "-----------------"
#will print first column (character names only)
print row[0]
You want to import csv module so you can work with the CSV filetype. Open the file in universal newline mode and read it with csv.reader. Then you can use a for loop to begin iterating through the rows depending on what you want. The first print row will print a single line of all a single character's data (ie: everything from their name up to their clothing type) like so:
['Stan Marsh', 'DOB: October 19th', 'Dude!', 'Aww #$%^!', 'Star Quarterback', 'Wendy', 'red gloves']
-----------------
['Kyle Broflovski', 'DOB: May 26th', 'Kick the baby!', 'You ***!', 'Jewish', 'Canadian', 'Ushanka']
-----------------
['Eric Theodore Cartman', 'DOB: July 1', 'Respect My Authroitah!', 'Mooom!', 'Big-boned', 'Political
ly incorrect', 'Knit-cap!']
-----------------
['Kenny McCormick', 'DOB: March 22', 'DOD: Every other week', 'Mmff Mmff', 'MMMFFF!!!', 'Mysterion!'
, 'Orange Parka']
-----------------
['Leopold Butters Stotch', 'DOB:Younger than the others!', 'The 4th friend', 'Professor chaos', 'stu
tter', 'innocent', 'nerdy']
-----------------
Finally, the second statement print row[0] will provide you with the character names only. You can change the number and you'll be able to grab the other data as necessary. Remember, in a CSV file everything starts at 0, so in your case you can only go up to 6 because A=0, B=1, C=2, etc... To see these outputs more clearly, it's probably best if you comment out one of the print statements so you get a clearer picture of what you are grabbing.
-----------------
Stan Marsh
-----------------
Kyle Broflovski
-----------------
Eric Theodore Cartman
-----------------
Kenny McCormick
-----------------
Leopold Butters Stotch
Note I threw in that print "-----------------" so you would be able to see the different outputs.
Hope this helps you get you off to a start.
Edit To answer your second question: The easiest way (although probably not the best way) to grab all of a single character's info would be to do something like this:
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
stan = reader.next()
kyle = reader.next()
eric = reader.next()
kenny = reader.next()
butters = reader.next()
print eric
which outputs:
['Eric Theodore Cartman', 'DOB: July 1', 'Respect My Authroitah!', 'Mooom!', 'Big-boned', 'Politically incorrect', 'Knit-cap!']
Take note that if your CSV is modified such that the order of the characters are moved (ex: butters is moved to top) you will output the info of another character.

Searching CSV Files (Python)

I've made this CSV file up to play with.. From what I've been told before, I'm pretty sure this CSV file is valid and can be used in this example.
Basically I have this CSV file 'book_list.csv':
name,author,year
Lord of the Rings: The Fellowship of the Ring,J. R. R. Tolkien,1954
Nineteen Eighty-Four,George Orwell,1984
Lord of the Rings: The Return of the King,J. R. R. Tolkien,1954
Animal Farm,George Orwell,1945
Lord of the Rings: The Two Towers, J. R. R. Tolkien, 1954
And I also have this text file 'search_query.txt', whereby I put in keywords or search terms I want to search for in the CSV file:
Lord
Rings
Animal
I've currently come up with some code (with the help of stuff I've read) that allows me to count the number of matching entries. I then have the program write a separate CSV file 'results.csv' which just returns either 'Matching' or ' '.
The program then takes this 'results.csv' file and counts how many 'Matching' results I have and it prints the count.
import csv
import collections
f1 = file('book_list.csv', 'r')
f2 = file('search_query.txt', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
input = [row for row in c2]
for booklist_row in c1:
row = 1
found = False
for input_row in input:
results_row = []
if input_row[0] in booklist_row[0]:
results_row.append('Matching')
found = True
break
row = row + 1
if not found:
results_row.append('')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
d = collections.defaultdict(int)
with open("results.csv", "rb") as info:
reader = csv.reader(info)
for row in reader:
for matches in row:
matches = matches.strip()
if matches:
d[matches] += 1
results = [(matches, count) for matches, count in d.iteritems() if count >= 1]
results.sort(key=lambda x: x[1], reverse=True)
for matches, count in results:
print 'There are', count, 'matching results'+'.'
In this case, my output returns:
There are 4 matching results.
I'm sure there is a better way of doing this and avoiding writing a completely separate CSV file.. but this was easier for me to get my head around.
My question is, this code that I've put together only returns how many matching results there are.. how do I modify it in order to return the ACTUAL results as well?
i.e. I want my output to return:
There are 4 matching results.
Lord of the Rings: The Fellowship of the Ring
Lord of the Rings: The Return of the King
Animal Farm
Lord of the Rings: The Two Towers
As I said, I'm sure there's a much easier way to do what I already have.. so some insight would be helpful. :)
Cheers!
EDIT: I just realized that if my keywords were in lower case, it won't work.. is there a way to avoid case-sensitivity?
Throw away the query file and get your search terms from sys.argv[1:] instead.
Throw away your output file and use sys.stdout instead.
Append matched booklist titles to a result_list. The result_row that you currently have has a rather misleading name. The count that you want is len(result_list). Print that. Then print the contents of result_list.
Convert your query words to lowercase once (before you start reading the input file). As you read each book_list row, convert its title to lowercase. Do your your matching with the lowercase query words and the lowercase title.
Overall plan:
Read in the entire book list csv into a dictionary of {title: info}.
Read in the questions csv. For each keyword, filter the dictionary:
[key for key, value in books.items() if "Lord" in key]
say. Do what you will with the results.
If you want, put the results in another csv.
If you want to deal with casing issues, try turning all the titles to lowercase ("FOO".lower()) when you store them in the dictionary.

Categories

Resources