Advance compare two or more csv files - python

I'm just learning a Python, and as everyone knows, the best way is practice ;)
And now I have a job, and I want to try to do it in python, but I need some advice.
Well... I have a few CSV files. The structure looks like:
1st CVS
workerID, workerName, workerPhoneNumber
2th and the other CSVs contains a subset of this first set.
I mean, in the first file there are, for example, 10,000 employees, and in each of them, there is a section of the same employees.
For example:
in the first file, I have
00001 Randal 555555
00002 Tom 66666
00003 Anthony 77775
00004 Mark 3424435
00005 Anna 3443223
00006 Monica 412415415
.....
in second file:
00001 Randal 555555
00004 Mark 3424435
00006 Monica 412415415
....
and 3th file:
00001 Randal 555555
00004 Mark 3424435
00005 Anna 3443223
....
I have to check the validity of all users in all files. I mean: check than Anna form all files have the same ID and phone in other filers and same for all results (that's huge file 100k rows). Then I will return all mismatches.
An addition problem is some "NA" in rows.
I've just finished a numpy tutorial, but i don't know how to bite it. I don't even know that a good practice to use a numpy. So I need your advice... how I can handle with this problem?
EDIT: Workes have unique names :) Its random string actually not a name :D just example :D in single file IDs is unique too

The use of standard functions and data structures will be enough.
Let's represents your files by a list of dictionaries using list comprehensions:
header = ('id', 'name', 'phone_number')
records_1 = [{k:v for k, v in zip(header, line.strip().split(' ')} } for line in open('path_to_file1', 'r')]
records_2 = [{k:v for k, v in zip(header, line.strip().split(' ')} } for line in open('path_to_file2', 'r')]
Then, if you want to check your records based on the user name, use a dictionary with the name as key and the record as value:
records_1 = {rec['name']: rec for rec in records_1}
records_2 = {rec['name']: rec for rec in records_2}
and check for each name if you have duplicated ids. If so, save it to output:
seen = {}
output = []
for records, others in [(records_1, records_2), (records_2, records_1)]:
for name, rec in records:
if name in seen:
continue
if rec['id'] != others['name']['id']:
output.append((name, rec, others['name']))
Note we could deduce the list of permutations using permutations from itertools:
https://docs.python.org/3/library/itertools.html
Hope this helps!

Related

How to check how close a number is to another number?

I have a TEXT FILE that looks like:
John: 27
Micheal8483: 160
Mary Smith: 57
Adam 22: 68
Patty: 55
etc etc. They are usernames that is why their names contain numbers occasionally. What I want to do is check each of their numbers (the ones after the ":") and get the 3 names that have the numbers that are closest in value to a integer (specifically named targetNum). It will always be positive.
I have tried multiple things but I am new to Python and I am not really sure how to go about this problem. Any help is appreciated!
You can parse the file into a list of name/number pairs. Then sort the list by difference between a number and targetNum. The first three items of the list will then contain the desired names:
users = []
with open("file.txt") as f:
for line in f:
name, num = line.split(":")
users.append((name, int(num)))
targetNum = 50
users.sort(key=lambda pair: abs(pair[1] - targetNum))
print([pair[0] for pair in users[:3]]) # ['Patty', 'Mary Smith', 'Adam 22']
You could use some regex recipe here :
import re
pattern=r'(\w.+)?:\s(\d+)'
data_1=[]
targetNum = 50
with open('new_file.txt','r') as f:
for line in f:
data=re.findall(pattern,line)
for i in data:
data_1.append((int(i[1])-targetNum,i[0]))
print(list(map(lambda x:x[1],data_1[-3:])))
output:
['Mary Smith', 'Adam 22', 'Patty']

Python: read a text file into a dictionary

I really hope you can help me since I'm quite new to Python.
I have a simple text file, without any columns, just rows. Something like this:
Bob
Opel
Mike
Ford
Rodger
Renault
Mary
Volkswagen
Note that in the text file the names and the cars are without the additional enter. I had to this, otherwise, StackOverflow would project the names next to each other.
The idea is to create a dictionary out of the text file to get a format like this:
{[Bob : Opel], [Mike : Ford], [Rodger : Renault], [Mary : Volkswagen]}
Can you guys help me out and give an example on how to do this? Would be much appreciated!
you can open a file and iterate through the lines with readlines
lines, dict = open('file.dat', 'r').readlines(), {}
for i in range(len(lines)/2):
dict[lines[i*2]] = lines[i*2+1]
print dict
with open('test.txt', 'r') as file_:
dict = {}
array = list(filter(lambda x: x != '', file_.read().split('\n')))
for i in range(0, len(array), 2):
dict[array[i]] = array[i+1]
print(dict)

Ascending python text

I have simple problem, i created game and in the end I append score to textfile. Now i have something like this in this file:
John: 11
Mike: 5
John: 78
John: 3
Steve: 30
i want give user possibility to read top 3 scores. Now i created this:
with open(r'C:/path/to/scores.txt', 'r') as f:
for line in f:
data = line.split()
print '{0[0]:<15}{0[1]:<15}'.format(data)
I have this:
John: 11
Mike: 5
John: 78
John: 3
Steve: 30
It looks better but how can i show only three best results with place and highest first etc?
Something like that:
1. John: 78
2. Steve: 30
3. John: 11
You can edit your code a little bit to store the scores in a list, then sort them using the sorted function. Then you can just take the first three scores of your sorted list.
with open(r'doc.txt', 'r') as f:
scores = []
for line in f:
data = line.split()
scores.append(data)
top3 = sorted(scores, key = lambda x: int(x[1]), reverse=True)[:3]
for score in top3:
print '{0[0]:<15}{0[1]:<15}'.format(score)
As in my answer to a very similar question, the answer could be just used sorted; slicing the result to get only three top scores is trivial.
That said, you could also switch to using heapq.nlargest over sorted in this case; it takes a key function, just like sorted, and unlike sorted, it will only use memory to store the top X items (and has better theoretical performance when the set to extract from is large and the number of items to keep is small). Aside from not needing reverse=True (because choosing nlargest already does that), heapq.nlargest is a drop in replacement for sorted from that case.
Depending on what else you might want to do with the data I think pandas is a great option here. You can load it into pandas like so:
import pandas as pd
df = []
with open(r'C:/path/to/scores.txt', 'r') as f:
for line in f:
data = line.split()
df.append({'Name': data[0], 'Score': data[1]})
df = pd.DataFrame(df)
Then you can sort by score and show the top three
df.sort('Score', ascending=False)[:3]
I recommend reading all of the pandas documentation to see everything it can do
EDIT: For easier reading you can do something like
df = pd.read_table('C:/path/to/scores.txt')
But this would require you to put column headings in that file first
with open(r'C:/path/to/scores.txt', 'r') as f:
scores = []
for line in f:
line = line.strip().split()
scores.append((line[0], int(line[1]))
sorted_scores = sorted(scores, key=lambda s: s[1], reverse=True)
top_three = sorted_scores[:3]
This will read every line, strip extra whitespace, and split the line, then append it to the scores list. Once all scores have been added, the list gets sorted using the key of the 2nd item in the (name, score) tuple, in reverse, so that the scores run from high-to-low. Then the top_three slices the first 3 items from the sorted scores.
This would work, and depending on your coding style, you could certainly consolidate some of these lines. For the sake of the example, I simply have the contents of your score file in a string:
score_file_contents = """John: 11
Mike: 5
John: 78
John: 3
Steve: 30"""
scores = []
for line in score_file_contents.splitlines(): # Simulate reading your file
name, score = line.split(':') # Extract name and score
score = int(score) # Want score as an integer
scores.append((score, name)) # Make my list
scores.sort(reverse=True) # List of tuples sorts on first tuple element
for ranking in range(len(scores)): # Iterate using an index
if ranking < 3: # How many you want to show
score = scores[ranking][0] # Extract score
name = scores[ranking][1] # Extract name
print("{}. {:<10} {:<3}".format(ranking + 1, name + ":", score))
Result:
1. John: 78
2. Steve: 30
3. John: 11

Python: Parsing Multiple .txt Files into a Single .csv File?

I'm not very experienced with complicated large-scale parsing in Python, do you guys have any tips or guides on how to easily parse multiple text files with different formats, and combining them into a single .csv file and ultimately entering them into a database?
An example of the text files is as follows:
general.txt (Name -- Department (DEPT) Room # [Age]
John Doe -- Management (MANG) 205 [Age: 40]
Equipment: Laptop, Desktop, Printer, Stapler
Experience: Python, Java, HTML
Description: Hardworking, awesome
Mary Smith -- Public Relations (PR) 605 [Age: 24]
Equipment: Mac, PC
Experience: Social Skills
Description: fun to be around
Scott Lee -- Programmer (PG) 403 [Age: 25]
Equipment: Personal Computer
Experience: HTML, CSS, JS
Description: super-hacker
Susan Kim -- Programmer (PG) 504 [Age: 21]
Equipment: Desktop
Experience: Social Skills
Descriptions: fun to be around
Bob Simon -- Programmer (PG) 101 [Age: 29]
Equipment: Pure Brain Power
Experience: C++, C, Java
Description: never comes out of his room
cars.txt (a list of people who own cars by their department/room #)
Programmer: PG 403, PG 101
Management: MANG 205
house.txt
Programmer: PG 504
The final csv should preferably tabulate to something like:
Name | Division | Division Abbrevation | Equipment | Room | Age | Car? | House? |
Scott Lee Programming PG PC 403 25 YES NO
Mary Smith Public Rel. PR Mac, PC 605 24 NO NO
The ultimate goal is to have a database, where searching "PR" would return every row where a person's Department is "PR," etc. There's maybe 30 text files total, each representing one or more columns in a database. Some columns are short paragraphs, which include commas. Around 10,000 rows total. I know Python has built in csv, but I'm not sure where to start, and how to end with just 1 csv. Any help?
It looks like you're looking for someone who will solve a whole problem for you. Here I am :)
General idea is to parse general info to dict (using regular expressions), then append additional fields to it and finally write to CSV. Here's Python 3.x solution (I think Python 2.7+ should suffice):
import csv
import re
def read_general(fname):
# Read general info to dict with 'PR 123'-like keys
# Gerexp that will split row into ready-to-use dict
re_name = re.compile(r'''
(?P<Name>.+)
\ --\ # Separator + space
(?P<Division>.+)
\ # Space
\(
(?P<Division_Abbreviation>.*)
\)
\ # Space
(?P<Id>\d+)
\ # Space
\[Age:\ # Space at the end
(?P<Age>\d+)
\]
''', re.X)
general = {}
with open(fname, 'rt') as f:
for line in f:
line = line.strip()
m = re_name.match(line)
if m:
# Name line, start new man
man = m.groupdict()
key = '%s %s' % (m.group('Division_Abbreviation'), m.group('Id'))
general[key] = man
elif line:
# Non empty lines
# Add values to dict
key, value = line.split(': ', 1)
man[key] = value
return general
def add_bool_criteria(fname, field, general):
# Append a field with YES/NO value
with open(fname, 'rt') as f:
yes_keys = set()
# Phase one, gather all keys
for line in f:
line = line.strip()
_, keys = line.split(': ', 1)
yes_keys.update(keys.split(', '))
# Fill data
for key, man in general.items(): # iteritems() will be faster in Python 2.x
man[field] = 'YES' if key in yes_keys else 'NO'
def save_csv(fname, general):
with open(fname, 'wt') as f:
# Gather field names
all_fields = set()
for value in general.values():
all_fields.update(value.keys())
# Write to csv
w = csv.DictWriter(f, all_fields)
w.writeheader()
w.writerows(general.values())
def main():
general = read_general('general.txt')
add_bool_criteria('cars.txt', 'Car?', general)
add_bool_criteria('house.txt', 'House?', general)
from pprint import pprint
pprint(general)
save_csv('result.csv', general)
if __name__ == '__main__':
main()
I wish you lot of $$$ for this ;)
Side note
CSV is a history, you could use JSON for storage and further use, because it's simpler to use, more flexible and human readable.
You just have a function which parses one file, and returns a list of dictionaries containing {'name': 'Bob Simon', 'age': 29, ...} etc. Then call this on each of your files, extending a master list. Then write this master list of dicts as a CSV file.
More elaborately:
First you need to parse the input files, you'd have a function which takes a file, and returns a list of "things".
def parse_txt(fname):
f = open(fname)
people = []
# Here, parse f. Maybe using a while loop, and calling
# f.readline() until there is an empty line Construct a
# dictionary from each person's block, and append it to allpeople
return people
This returns something like:
people = [
{'name': 'Bob Simon', 'age': 29},
{'name': 'Susan Kim', 'age': 21},
]
Then, loop over each of your input files (maybe by using os.listdir, or optparse to get a list of args):
allpeople = []
for curfile in args:
people = parse_txt(fname = curfile)
allpeople.extend(people)
So allpeople is a long list of all the people from all files.
Finally you can write this to a CSV file using the csv module (this bit usually involves another function to reorganise the data into a format more compatible with the writer module)
I'll do it backwards, I'll start by loading all those house.txt and cars.txt each one into a dict, that could look like:
cars = {'MANG': [205], 'PG': [403, 101]}
Since you said to have like 30 of them, you could easily use a nested dict without making things too complicated:
data = {'house': {'PG': 504}, 'cars': {...}}
Once the data dict will be complete, load general.txt and while building the dict for each employee (or whatever they are) do a dict look-up see if they have a house or not, or a car, etc..
For example for John Doe you'll have to check:
if data['house']['PG'].get(205):
# ...
and update his dict accordingly. Obviously you don't have to hard code all the possible look-ups, just build a couple of lists of the ['house', 'cars', ...] or something like that and iterate over it.
At the end you should have a big list of dict with all the info merged, so just write each one of them to a csv file.
Best possible advise: Don't do that.
Your cars and house relations are, ummmm, interesting. Owning a house or a car is an attribute of a person or other entity (company, partnership, joint tenancy, tenancy in common, etc, etc). It is NOT an attribute of a ("division", room) combination. The first fact in your cars file is "A programmer in room 403 owns a car". What happens in the not unlikely event that there 2 or more programmers in the same room?
The equipment shouldn't be in a list.
Don't record age, record date or year of birth.
You need multiple tables in a database, not 1 CSV file. You need to study a book on elementary database design.

Searching CSV Files (Python)

I've made this CSV file up to play with.. From what I've been told before, I'm pretty sure this CSV file is valid and can be used in this example.
Basically I have this CSV file 'book_list.csv':
name,author,year
Lord of the Rings: The Fellowship of the Ring,J. R. R. Tolkien,1954
Nineteen Eighty-Four,George Orwell,1984
Lord of the Rings: The Return of the King,J. R. R. Tolkien,1954
Animal Farm,George Orwell,1945
Lord of the Rings: The Two Towers, J. R. R. Tolkien, 1954
And I also have this text file 'search_query.txt', whereby I put in keywords or search terms I want to search for in the CSV file:
Lord
Rings
Animal
I've currently come up with some code (with the help of stuff I've read) that allows me to count the number of matching entries. I then have the program write a separate CSV file 'results.csv' which just returns either 'Matching' or ' '.
The program then takes this 'results.csv' file and counts how many 'Matching' results I have and it prints the count.
import csv
import collections
f1 = file('book_list.csv', 'r')
f2 = file('search_query.txt', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
input = [row for row in c2]
for booklist_row in c1:
row = 1
found = False
for input_row in input:
results_row = []
if input_row[0] in booklist_row[0]:
results_row.append('Matching')
found = True
break
row = row + 1
if not found:
results_row.append('')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
d = collections.defaultdict(int)
with open("results.csv", "rb") as info:
reader = csv.reader(info)
for row in reader:
for matches in row:
matches = matches.strip()
if matches:
d[matches] += 1
results = [(matches, count) for matches, count in d.iteritems() if count >= 1]
results.sort(key=lambda x: x[1], reverse=True)
for matches, count in results:
print 'There are', count, 'matching results'+'.'
In this case, my output returns:
There are 4 matching results.
I'm sure there is a better way of doing this and avoiding writing a completely separate CSV file.. but this was easier for me to get my head around.
My question is, this code that I've put together only returns how many matching results there are.. how do I modify it in order to return the ACTUAL results as well?
i.e. I want my output to return:
There are 4 matching results.
Lord of the Rings: The Fellowship of the Ring
Lord of the Rings: The Return of the King
Animal Farm
Lord of the Rings: The Two Towers
As I said, I'm sure there's a much easier way to do what I already have.. so some insight would be helpful. :)
Cheers!
EDIT: I just realized that if my keywords were in lower case, it won't work.. is there a way to avoid case-sensitivity?
Throw away the query file and get your search terms from sys.argv[1:] instead.
Throw away your output file and use sys.stdout instead.
Append matched booklist titles to a result_list. The result_row that you currently have has a rather misleading name. The count that you want is len(result_list). Print that. Then print the contents of result_list.
Convert your query words to lowercase once (before you start reading the input file). As you read each book_list row, convert its title to lowercase. Do your your matching with the lowercase query words and the lowercase title.
Overall plan:
Read in the entire book list csv into a dictionary of {title: info}.
Read in the questions csv. For each keyword, filter the dictionary:
[key for key, value in books.items() if "Lord" in key]
say. Do what you will with the results.
If you want, put the results in another csv.
If you want to deal with casing issues, try turning all the titles to lowercase ("FOO".lower()) when you store them in the dictionary.

Categories

Resources