Comparing two documents and writing output to a third [Python?]

Comparing two documents and writing output to a third [Python?] - python

I am seeking some advice whether it be in terms of a script (possibly python?) that I could use to do the following.
I basically have two documents, taken from a DB:
document one contains :
hash / related username.
example:
fb4aa888c283428482370 username1
fb4aa888c283328862370 username2
fb4aa888c283422482370 username3
fb4aa885djsjsfjsdf370 username4
fb4aa888c283466662370 username5
document two contains:
hash : plaintext
example:
fb4aa888c283428482370:plaintext
fb4aa888c283328862370:plaintext2
fb4aa888c283422482370:plaintext4
fb4aa885djsjsfjsdf370:plaintextetc
fb4aa888c283466662370:plaintextetc
can anyone think of an easy way for me to match up the hashes in document two with the relevant username from document one into a new document (say document three) and add the plain so it would look like the following...
Hash : Relevant Username : plaintext
This would save me a lot of time having to cross reference two files, find the relevant hash manually and the user it belongs to.
I've never actually used python before, so some examples would be great!
Thanks in advance

I don't have any code for you but a very basic way to do this would be to whip up a script that does the following:
Read the first doc into a dictionary with the hashes as keys.
Read the second doc into a dictionary with the hashes as keys.
Iterate through both dictionaries, by key, in the same loop, writing out the info you want into the third doc.

You didn't really specify how you wanted the output, but this should get you close enough to modify to your liking. There are guys out there good enough to shorten this into a fey lines of code - but I think the readability of keeping it long may be helpful to you just getting started.
Btw, I would probably avoid this altogether and to the join in SQL before creating the file -- but that wasn't really your question : )
usernames = dict()
plaintext = dict()
result = dict()
with open('username.txt') as un:
for line in un:
arry = line.split() #Turns the line into an array of two parts
hash, user = arry[0], arry[1]
usernames[hash] = user.rsplit()[0] # add to dictionary
with open('plaintext.txt') as un:
for line in un:
arry = line.split(':')
hash, txt = arry[0], arry[1]
plaintext[hash] = txt.rsplit()[0]
for key, val in usernames.items():
hash = key
txt = plaintext[hash]
result[val] = txt
with open("dict.txt", "w") as w:
for name, txt in result.items():
w.write('{0} = {1}\n'.format(name, txt))
print(usernames) #{'fb4aa888c283466662370': 'username5', 'fb4aa888c283422482370': 'username3' ...................
print(plaintext) #{'fb4aa888c283466662370': 'plaintextetc', 'fb4aa888c283422482370': 'plaintext4' ................
print(result) #{'username1': 'plaintext', 'username3': 'plaintext4', .....................

Related

Can I put file content checks in switches?

I'm checking if there are various strings present in file contents at the same time and I am curious if this could be done in switches, which I'm moving the code to right now, because I have a lot of lines that look as the one below.
What larsks also mentioned in the comments, is, if I mean the match statement. Yes, I'm aiming for results like this statement, but I've also found another solution, which works for me in cases, where I am looking only for one substring.
My current code looks like this:
f = open('somesortoffilename')
if "string" in f.read() and "otherstring" in f.read(): variable = 'value'
And I would like something like this:
f = open('somesortoffilename')
def f(variable):
return {
'string' and 'otherstring' in f.read(): 'value'
}
Is it possible in any way?

First, we need to make sure that we read the file stream only once. First call to f.read() will already have used all the bytes in the file.
Let's store the contents in a string instead.
with open('somesortoffilename') as file:
contents = file.read()
The with form makes sure that the file stream is closed after we fetch its contents.
The “switch” pattern can be implemented in Python with dictionaries (the dict type).
switches = {
'term1': ['string', 'other_string'],
'term2': ['another_string']
}
We can use this lookup to check if any string corresponding a term is found in the file.
def f(contents):
for term, values in switches.items():
if any(x in contents for x in values):
return term
return None

Finding lines by keywords and getting the value of said line in python

i am quite new to Python and i would like to ask the following:
Let's say for example i have the following .txt file:
USERNAME -- example.name
SERVER -- server01
COMPUTERNAME -- computer01
I would like to search through this document for the 3 keywords: USERNAME, SERVER and COMPUTERNAME and when I find these, I would like to extract their values, a.i "example.name", "server01" and "computer01" respectively for each line.
Is this possible? I have already tried for looking by line numbers, but I would much prefer searching by keywords instead.
Question 2: Somewhere in this .txt file exists a line with the keyword Adresses: which has multiple values but listed in different lines, like such:
Adresses:
6001:8000:1080:142::12
8002:2013:2380:110::53
9007:2013:2380:117::80
.
.
Would there be any way to get all of the listed addresses as a result, not just the first one? The number of said addresses is dynamic, so it may change in the future.
To this i have honestly no idea how to begin. I appreciate any kind of hints or pointing me in the right direction.
Thank you very much for your time and attention!

Like this:
with open("filename.txt") as f:
for x in f:
a = x.split(" -- ")
print(a[1])

If line with given value always starts with keyword you can try something like this
with open('file.txt', 'r') as file:
for line in file:
if line.startswith('keyword'):
keyword, value = line.split(' -- ')
and to gather all the addresses i'd initiate list of addresses beforehand, then add line
addresses.append(value)
inside of if statement

Your best friend for this kind of task will be str.split function. You can put your data in a dict which will map keywords to values :
data = {} # Create a new dict
with open('data.txt') as file: # Open the file containing the data
lines = file.read().split('\n') # Split the file in lines
for line in lines: # For each line
keyword, value = line.split(' -- ') # Extract keyword and value
data[keyword] = value # Put in the dict
Then, you can access your values with data['USERNAME'] for example. This method will work on any document containing a key-value association on each line (even if you have more than 3 keywords). However, it will not work if the same text file contains the addresses in the format you mentionned.
If you want to include the addresses in the same file, you'll need to adapt the code. You can for example check if splitted_line contains two elements (= key-value on the same line, like USERNAME) or only one (= key-value on multiple lines, like Addresses:). Then, you can store in a list all the different addresses, and bound this list to the data dict. It's not a problem to have a dict in the form :
{
'USERNAME': 'example.name',
'Addresses': ['6001:8000:1080:142::12', '8002:2013:2380:110::53']
}

How do i 'professionally' store small data in python? [duplicate]

I need to store basic data of customer's and cars that they bought and payment schedule of these cars. These data come from GUI, written in Python. I don't have enough experience to use a database system like sql, so I want to store my data in a file as plain text. And it doesn't have to be online.
To be able to search and filter them, first I convert my data (lists of lists) to the string then when I need the data re-convert to the regular Python list syntax. I know it is a very brute-force way, but is it safe to do like that or can you advice me to another way?

It is never safe to save your database in a text format (or using pickle or whatever). There is a risk that problems while saving the data may cause corruption. Not to mention risks with your data being stolen.
As your dataset grows there may be a performance hit.
have a look at sqlite (or sqlite3) which is small and easier to manage than mysql. Unless you have a very small dataset that will fit in a text file.
P/S: btw, using berkeley db in python is simple, and you don't have to learn all the DB things, just import bsddb

The answer to use pickle is good, but I personally prefer shelve. It allows you to keep variables in the same state they were in between launches and I find it easier to use than pickle directly. http://docs.python.org/library/shelve.html

I agree with the others that serious and important data would be more secure in some type of light database but can also feel sympathy for the wish to keep things simple and transparent.
So, instead of inventing your own text-based data-format I would suggest you use YAML
The format is human-readable for example:
List of things:
- Alice
- Bob
- Evan
You load the file like this:
>>> import yaml
>>> file = open('test.yaml', 'r')
>>> list = yaml.load(file)
And list will look like this:
{'List of things': ['Alice', 'Bob', 'Evan']}
Of course you can do the reverse too and save data into YAML, the docs will help you with that.
At least another alternative to consider :)

very simple and basic - (more # http://pastebin.com/A12w9SVd)
import json, os
db_name = 'udb.db'
def check_db(name = db_name):
if not os.path.isfile(name):
print 'no db\ncreating..'
udb = open(db_name,'w')
udb.close()
def read_db():
try:
udb = open(db_name, "r")
except:
check_db()
read_db()
try:
dicT = json.load(udb)
udb.close()
return dicT
except:
return {}
def update_db(newdata):
data = read_db()
wdb = dict(data.items() + newdata.items())
udb = open(db_name, 'w')
json.dump(wdb, udb)
udb.close()
using:
def adduser():
print 'add user:'
name = raw_input('name > ')
password = raw_input('password > ')
update_db({name:password})

You can use this lib to write an object into a file http://docs.python.org/library/pickle.html

Writing data in a file isn't a safe way for datastorage. Better use a simple database libary like sqlalchemy. It is a ORM for easy database usage...

You can also keep simple data in plain text file. Then you have not much support, however, to check consistency of data, double values etc.
Here is my simple 'card file' type data in text file code snippet using namedtuple so that you can access values not only by index in line but by they header name:
# text based data input with data accessible
# with named fields or indexing
from __future__ import print_function ## Python 3 style printing
from collections import namedtuple
import string
filein = open("sample.dat")
datadict = {}
headerline = filein.readline().lower() ## lowercase field names Python style
## first non-letter and non-number is taken to be the separator
separator = headerline.strip(string.lowercase + string.digits)[0]
print("Separator is '%s'" % separator)
headerline = [field.strip() for field in headerline.split(separator)]
Dataline = namedtuple('Dataline',headerline)
print ('Fields are:',Dataline._fields,'\n')
for data in filein:
data = [f.strip() for f in data.split(separator)]
d = Dataline(*data)
datadict[d.id] = d ## do hash of id values for fast lookup (key field)
## examples based on sample.dat file example
key = '123'
print('Email of record with key %s by field name is: %s' %
(key, datadict[key].email))
## by number
print('Address of record with key %s by field number is: %s' %
(key ,datadict[key][3]))
## print the dictionary in separate lines for clarity
for key,value in datadict.items():
print('%s: %s' % (key, value))
input('Ready') ## let the output be seen when run directly
""" Output:
Separator is ';'
Fields are: ('id', 'name', 'email', 'homeaddress')
Email of record with key 123 by field name is: gishi#mymail.com
Address of record with key 123 by field number is: 456 happy st.
345: Dataline(id='345', name='tony', email='tony.veijalainen#somewhere.com', homeaddress='Espoo Finland')
123: Dataline(id='123', name='gishi', email='gishi#mymail.com', homeaddress='456 happy st.')
Ready
"""

How to merge few lines with filtering some text

I have a text file with the following format.
The first line includes "USERID"=12345678 and the other lines include the user groups for each application:
For example:
User with user T-number T12345 has WRITE access to the APP1 and APP2 and READ-ONLY access to APP1.
T-Number is just some other kind of ID.
00001, 00002 and so on are sequence numbers and can be ignored.
T12345;;USERID;00001;12345678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
I need to do some filtering and merge the line containing USERID with all the lines having user groups, matching t-number with userid (T12345 = 12345678)
So the output should look like this.
12345678;APPLICATION;WRITE;APP1
12345678;APPLICATION;WRITE;APP2
12345678;APPLICATION;READ-ONLY;APP1
Should I use csv python module to accomplish this?

I do not see any advantage in using the csv module for reading and parsing the input text file. The number of fields varies: 6 fields in the USERID line, with 2 of them empty, but 5 non-empty fields in the other lines. The fields look very simple, so there is no need for csv's handling of the separator character hidden away in quotes and the like. There is no header line as in a csv file, but rather many headers sprinkled in among the data lines.
A simple routine that reads each line, splits each on the semicolon character, and parses the line, and combines related lines would suffice.
The output file is another matter. The lines have the same format, with the same number of fields. So creating that output may be a good use for csv. However, the format is so simple that the file could also be created without csv.

I am not so sure if you should use the csv module here - it has mixed data, possibly more than just users and user group rights? In the case of a user declaration, you only need to retrieve its group and id, while for the application rights you need to extract the group, app name and right. The more differing data you have, the more issues you will encounter - with manual parsing of the data you are always able to just continue when you met certain criterias.
So far i must say you are better off with a manual, line-by-line parsing of the lines, structure it into something meaningful, then output the data. For instance
from StringIO import StringIO
from pprint import pprint
feed = """T12345;;USERID;00001;12345678;
T12345;;USERID;00001;2345678;
T12345;;USERID;00002;345678;
T12345;;USERID;00002;45678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
T12345;APPLICATION;WRITE;00002;APP1
T12345;APPLICATION;WRITE;00002;APP2"""
buf = StringIO(feed)
groups = {}
# Read all data into a dict of dicts
for line in buf:
values = line.strip().split(";")
if values[3] not in groups:
groups[values[3]] = {"users": [], "apps": {}}
if values[2] == "USERID":
groups[values[3]]['users'].append(values[4])
continue
if values[1] == "APPLICATION":
if values[4] not in groups[values[3]]["apps"]:
groups[values[3]]["apps"][values[4]] = []
groups[values[3]]["apps"][values[4]].append(values[2])
print("Structured data with group as root")
pprint(groups)
print("Output data")
for group_id, group in groups.iteritems():
# Order by user, app
for user in group["users"]:
for app_name, rights in group["apps"].iteritems():
for right in rights:
print(";".join([user, "APPLICATION", right, app_name]))
Online demo here

Which of the following datastructure is the best regarding frequent searching?

I have a text file with some content. I need to search this content frequently. I have the following two options, which one is the best (by means of faster execution) ?
METHOD 1:
def search_list(search_string):
if search_word in li:
print "found at line ",li.indexOf(search_word)+1
if __name__="__main__":
f=open("input.txt","r")
li=[]
for i in f.readlines():
li.append(i.rstrip("\n"))
search_list("appendix")
METHOD 2:
def search_dict(search_string):
if d.has_key(search_word):
print "found at line ",d[search_word]
if __name__="__main__":
f=open("input.txt","r")
d={}
for i,j in zip(range(1,len(f.readlines())),f.readlines()):
d[j.rstrip("\n")]=i
search_dict("appendix")

For frequent searching, a dictionary is definitely better (provided you have enough memory to store the line numbers also) since the keys are hashed and looked up in O(1) operations. However, your implementation won't work. The first f.readlines() will exhaust the file object and you won't read anytihng with the second f.readlines().
What you're looking for is enumerate:
with open('data') as f:
d = dict((j[:-1],i) for i,j in enumerate(f,1))
It should also be pointed out that in both cases, the function which does the searching will be faster if you use try/except provided that the index you're looking for is typically found. (In the first case, it might be faster anyway since in is an order N operation and so is .index for a list).
e.g.:
def search_dict(d, search_string):
try:
print "found at line {0}".format(d[search_string])
except KeyError:
print "string not found"
or for the list:
def search_list(search_string):
try:
print "found at line {0}".format(li.indexOf(search_word)+1)
except ValueError:
print "string not found"

If you do it really frequently, then the second method will be faster (you've built something like an index).
Just adapt it a little bit:
def search_dict(d, search_string):
line = d.get(search_string)
if line:
print "found at line {}".format(line)
else:
print "string not found"
d = {}
with open("input.txt", "r") as f:
for i, word in enumerate(f.readlines(), 1):
d[word.rstrip()] = i
search_dict(d, "appendix")

I'm posting this after reading the answers of eumiro and mgilson.
If you compare your two methods on the command line, I think you'll find that the first one is faster. The other answers that say the second method is faster, but they are based on the premise that you'll do several searches on the file after you've built your index. If you use them as-is from the command line, you will not.
The building of the index is slower than just searching for the string directly, but once you've built an index, searches can be done very quickly, making up for the time spent building it. This extra time is wasted if you just use it once, because when the program is complete, the index is discarded and has to be rebuilt the next run. You need to keep the created index in memory between queries for this to pay off.
There are several ways of doing this, one is making a daemon to hold the index and use a front-end script to query it. Searching for something like python daemon client communication on google will give you pointers on implementing this -- here's one method.

First one is O(n); second one is O(1), but it requires searching on the key. I'd pick the second one.
Neither one will work if you're ad hoc searches in the document. For that you'll need to parse and index using something like Lucene.

Another option to throw in is using the FTS provided by SQLite3... (untested and making the assumption you're looking for wholewords, not substrings of words or other such things)
import sqlite3
# create db and table
db = sqlite3.connect(':memory:') # replace with file on-disk?
db.execute('create virtual table somedata using fts4(line)')
# insert the data
with open('yourfile.txt') as fin:
for lineno, line in enumerate(fin):
# You could put in a check here I guess...
if somestring in line:
print lineo # or whatever....
# put row into FTS table
db.execute('insert into somedata (line) values (?)', (line,))
# or possibly more efficient
db.executemany('insert into somedata (line) values (?)', fin)
db.commit()
look_for = 'somestring'
matches = db.execute('select rowid from somedata where line match ?', (look_for,) )
print '{} is on lines: {}'.format(look_for, ', '.join(match[0] for match in matches))
If you only wanted the first line, then add limit 1 to the end of the query.
You could also look at using mmap to map the file, then use the .find method to get the earliest offset of the string, then assuming it's not -1 (ie, not found - let's say 123456), then do mapped_file[:123456].count('\n') + 1 to get the line number.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two documents and writing output to a third [Python?] - python

Related

Can I put file content checks in switches?

Finding lines by keywords and getting the value of said line in python

How do i 'professionally' store small data in python? [duplicate]

How to merge few lines with filtering some text

Which of the following datastructure is the best regarding frequent searching?

Categories

Resources