Checking for Duplicates twice over in a File - Python

Checking for Duplicates twice over in a File - Python - python

config.yml example,
DBtables:
CurrentMinuteLoad:
CSV_File: trend.csv
Table_Name: currentminuteload
GUI image,
This may not be the cleanest route to take.
I'm making a GUI that creates a config.yml file for another python script I'm working with.
Using pysimplegui, My button isn't functioning the way I'd expect it to. It currently and accurately checks for the Reference name (example here would be CurrentMinuteLoad) and will kick it back if it exists, but will skip the check for the table (so the ELIF statement gets skipped). Adding the table still works, I'm just not getting the double-check that I want. Also, I have to hit the Okay button twice in the GUI for it to work?? A weird quirk that doesn't quite make sense to me.
def add_table():
window2.read()
with open ("config.yml","r") as h:
if values['new_ref'] in h.read():
sg.popup('Reference name already exists')
elif values['new_db'] in h.read():
sg.popup('Table name already exists')
else:
with open("config.yml", "a+") as f:
f.write("\n " + values['new_ref'] +":")
f.write("\n CSV_File:" + values['new_csv'])
f.write("\n Table_Name:" + values['new_db'])
f.close()
sg.popup('The reference "' + values['new_ref'] + '" has been included and will add the table "' + values['new_db'] + '" to PG Admin during the next scheduled upload')

When you use h.read(), you should save the value since it will read it like a stream, and subsequent calls for this method will result in an empty string.
Try editing the code like this:
with open ("config.yml","r") as h:
content = h.read()
if values['new_ref'] in content:
sg.popup('Reference name already exists')
elif values['new_db'] in content:
sg.popup('Table name already exists')
else:
# ...

You should update the YAML file using a real YAML parser, that will allow you
to check on duplicate values, without using in, which will give you false
positives when a new value is a substring of an existing value (or key).
In the following I add values twice, and show the resulting YAML. The
first time around the check on new_ref and new_db does not find
a match although it is a substring of existing values. The second time
using the same values there is of course a match on the previously added
values.
import sys
import ruamel.yaml
from pathlib import Path
def add_table(filename, values, verbose=False):
error = False
yaml = ruamel.yaml.YAML()
data = yaml.load(filename)
dbtables = data['DBtables']
if values['new_ref'] in dbtables:
print(f'Reference name "{values["new_ref"]}" already exists') # use sg.popup in your code
error = True
for k, v in dbtables.items():
if values['new_db'] in v.values():
print(f'Table name "{values["new_db"]}" already exists')
error = True
if error:
return
dbtables[values['new_ref']] = d = {}
for x in ['new_cv', 'new_db']:
d[x] = values[x]
yaml.dump(data, filename)
if verbose:
sys.stdout.write(filename.read_text())
values = dict(new_ref='CurrentMinuteL', new_cv='trend_csv', new_db='currentminutel')
add_table(Path('config.yaml'), values, verbose=True)
print('========')
add_table(Path('config.yaml'), values, verbose=True)
which gives:
DBtables:
CurrentMinuteLoad:
CSV_File: trend.csv
Table_Name: currentminuteload
CurrentMinuteL:
new_cv: trend_csv
new_db: currentminutel
========
Reference name "CurrentMinuteL" already exists
Table name "currentminutel" already exists

Related

Avoid existing folders and bring only folders that don't exist

I have the code below which is bringing attachments into parent_directory using api connection.
Problem: The code works great but the only problem with this code is it gets stuck when there're existing folders.
Solution: How can make this code bypass the existing folders. So if the folder exists, then don't do anything just move to the next loop.
import pandas as pd
import os
import zipfile
parent_directory = "folderpath"
csv_file_dir = "myfilepath.csv"
user = "API_username"
key = "API_password"
os.chdir(parent_directory)
bdr_data = pd.read_csv(csv_file_dir)
api_first = "… " + user + ":" + key + "…"
for index, row in bdr_data.iterrows():
#print(row['url_attachment'])
name = row['Ref_Num']
os.makedirs(parent_directory + name)
os.chdir(parent_directory + name)
url = api_first + row['url_attachment'] + " -o attachments.zip"
os.system(url)
os.chdir(parent_directory)

You can do it like this.
for index, row in bdr_data.iterrows():
name = row['Ref_Num']
child_dir = (parent_directory + name)
if os.path.exists(child_dir): # check if folder exist.
print(f'{child_dir} already exist') # you may want to know what is skipped
continue # skip iteration.
os.makedirs(child_dir) # if folder not found, do what you need.

Weird sequence when writing into txt document

Hi everyone this is my first time here, and I am a beginner in Python. I am in the middle of writing a program that returns a txt document containing information about a stock (Watchlist Info.txt), based on the input of another txt document containing the company names (Watchlist).
To achieve this, I have written 3 functions, of which 2 functions reuters_ticker() and stock_price() are completed as shown below:
def reuters_ticker(desired_search):
#from company name execute google search for and return reuters stock ticker
try:
from googlesearch import search
except ImportError:
print('No module named google found')
query = desired_search + ' reuters'
for j in search(query, tld="com.sg", num=1, stop=1, pause=2):
result = j
ticker = re.search(r'\w+\.\w+$', result)
return ticker.group()
Stock Price:
def stock_price(company, doc=None):
ticker = reuters_ticker(company)
request = 'https://www.reuters.com/companies/' + ticker
raw_main = pd.read_html(request)
data1 = raw_main[0]
data1.set_index(0, inplace=True)
data1 = data1.transpose()
data2 = raw_main[1]
data2.set_index(0, inplace=True)
data2 = data2.transpose()
stock_info = pd.concat([data1,data2], axis=1)
if doc == None:
print(company + '\n')
print('Previous Close: ' + str(stock_info['Previous Close'][1]))
print('Forward PE: ' + str(stock_info['Forward P/E'][1]))
print('Div Yield(%): ' + str(stock_info['Dividend (Yield %)'][1]))
else:
from datetime import date
with open(doc, 'a') as output:
output.write(date.today().strftime('%d/%m/%y') + '\t' + str(stock_info['Previous Close'][1]) + '\t' + str(stock_info['Forward P/E'][1]) + '\t' + '\t' + str(stock_info['Dividend (Yield %)'][1]) + '\n')
output.close()
The 3rd function, watchlist_report(), is where I am getting problems with writing the information in the format as desired.
def watchlist_report(watchlist):
with open(watchlist, 'r') as companies, open('Watchlist Info.txt', 'a') as output:
searches = companies.read()
x = searches.split('\n')
for i in x:
output.write(i + ':\n')
stock_price(i, doc='Watchlist Info.txt')
output.write('\n')
When I run watchlist_report('Watchlist.txt'), where Watchlist.txt contains 'Apple' and 'Facebook' each on new lines, my output is this:
26/04/20 275.03 22.26 1.12
26/04/20 185.13 24.72 --
Apple:
Facebook:
Instead of what I want and would expect based on the code I have written in watchlist_report():
Apple:
26/04/20 275.03 22.26 1.12
Facebook:
26/04/20 185.13 24.72 --
Therefore, my questions are:
1) Why is my output formatted this way?
2) Which part of my code do I have to change to make the written output in my desired format?
Any other suggestions about how I can clean my code and any libraries I can use to make my code nicer are also appreciated!

You handle two different file-handles - the file-handle inside your watchlist_report gets closed earlier so its being written first, before the outer functions file-handle gets closed, flushed and written.
Instead of creating a new open(..) in your function, pass the current file handle:
def watchlist_report(watchlist):
with open(watchlist, 'r') as companies, open('Watchlist Info.txt', 'a') as output:
searches = companies.read()
x = searches.split('\n')
for i in x:
output.write(i + ':\n')
stock_price(i, doc = output) # pass the file handle
output.write('\n')
Inside def stock_price(company, doc=None): use the provided filehandle:
def stock_price(company, output = None): # changed name here
# [snip] - removed unrelated code for this answer for brevity sake
if output is None: # check for None using IS
print( ... ) # print whatever you like here
else:
from datetime import date
output.write( .... ) # write whatever you want it to write
# output.close() # do not close, the outer function does this
Do not close the file handle in the inner function, the context handling with(..) of the outer function does that for you.
The main takeaway for file handling is that things you write(..) to your file are not neccessarily placed there immediately. The filehandler chooses when to actually persist data to your disk, the latests it does that is when it goes out of scope (of the context handler) or when its internal buffer reaches some threshold so it "thinks" it is now prudent to alter to data on your disc. See How often does python flush to a file? for more infos.

How to replace an specific item of a dict in a .txt file database

I want to use a .txt file to store API tokens for an application, but I got stuck trying to find a way to Replace the API key/token if it's found in the file. This is the code tried (Python 3.5):
data_to_save = {}
data_to_save['savetime'] = str(datetime.datetime.now())[:19]
data_to_save['api_key'] = key_submitted
data_to_save['user'] = uniqueid
api_exists = False
user_exists = False
with open("databases/api_keys.txt", 'r+') as f:
database = json.loads(f.read())
for i in database:
if i['api_key'] == key_submitted:
send_text_to_user(userid, "[b]Error: That API key is already in use.[/b]", "red")
api_exists = True
if i['user'] == uniqueid:
user_exists = True
if user_exists == True:
if api_exists = True:
send_text_to_user(userid, "[b]Error: Your API key was already saved at another time.[/b]", "red")
else:
f.write(json.dumps(data_to_save)) #Here, StackOverflow
send_text_to_user(userid, "[b]Okay, I replaced your API key.[/b]", "green")
f.close()
if user_exists == False:
writing = open("databases/api_keys.txt", 'a')
writing.write(json.dumps(data_to_save))
writing.close()
I also want to know if this is the best way to do it or the code could be optimized and how.
Thank you, it has been done. Final code:
data_to_save = {'savetime': str(datetime.datetime.now())[:19], 'api_key': key_submitted, 'user': uniqueid}
with open("databases/api_keys.txt", 'r') as f:
database = json.loads(f.read())
for i in database:
if i['user'] == uniqueid:
database.remove(i)
if i['api_key'] == key_submitted:
send_text_to_user(userid, "[b]Error: That API key is already in use.[/b]", "red")
api_exists = True
break
if not api_exists:
database.append(data_to_save)
f.write(json.dumps(database)
send_text_to_user(userid, "[b]Okay, your API key was succesfully stored.[/b]")
With this approach we don't even need to write different saves just in case the user exists or doesn't exists because it deletes it if it's found, so it never exists when the code runs and it just need to save a "new" record every single time, except if the API key already belongs to another user.

There are many issues with given code, so let's start from the beginning:
We don't need to create empty dict object to fill it on the next lines
data_to_save = {}
data_to_save['savetime'] = str(datetime.datetime.now())[:19]
data_to_save['api_key'] = key_submitted
data_to_save['user'] = uniqueid
when we can just create it filled like
data_to_save = {'savetime': str(datetime.datetime.now())[:19],
'api_key': key_submitted,
'user': uniqueid}
Assignments are not allowed in if statements (more at docs)
if api_exists = True:
so this line will cause SyntaxError (i guess this is a typo).
Checks like
if user_exists == True:
...
are redundant, we can just write
if user_exists:
...
and have the same effect.
We don't need to explicitly close file when use with statement, this is what context managers for: cleanup after exiting with statement block.
Your databases/api_key.txt file after first iteration will have invalid JSON object, because you are simply writing new serialized data_to_save object at the end of the file, while you should modify database object (which seems to be a list of dictionaries) and write serialized new version of it, so we don't need r+ mode as well.
Let's define utility function which saves new API key data like
def save_database(database,
api_keys_file_path="databases/api_keys.txt"):
with open(api_keys_file_path, 'w') as api_keys_file:
api_keys_file.write(json.dumps(database))
then we can have something like
data_to_save = {'savetime': str(datetime.datetime.now())[:19],
'api_key': key_submitted,
'user': uniqueid}
api_exists = False
user_exists = False
with open("databases/api_keys.txt", 'r') as api_keys_file:
database = json.loads(api_keys_file.read())
# database object should be iterable,
# containing dictionaries as elements,
# so only possible solution -- it is a list of dictionaries
for i in database:
if i['api_key'] == key_submitted:
send_text_to_user(userid, "[b]Error: That API key is already in use.[/b]", "red")
api_exists = True
if i['user'] == uniqueid:
user_exists = True
if user_exists:
if api_exists:
send_text_to_user(userid, "[b]Error: Your API key was already saved at another time.[/b]", "red")
else:
# looking for user record in database
user_record = next(record for record in database
if record['user'] == uniqueid)
# setting new API key
user_record['api_key'] = key_submitted
save_database(database)
send_text_to_user(userid, "[b]Okay, I replaced your API key.[/b]", "green")
if not user_exists:
database.append(data_to_save)
save_database(database)
Example
I've created directory databases with api_keys.txt file which contains single line
[]
because at the beginning we have no API keys.
Let's assume our missed objects are defined like
key_submitted = '699aa2c2f9fc41f880d6ec79a9d55f29'
uniqueid = 3
userid = 42
def send_text_to_user(userid, msg, color):
print(msg)
so with above code it gives me at first script execution empty output, and on the second one:
[b]Error: That API key is already in use.[/b]
[b]Error: Your API key was already saved at another time.[/b]
Further improvements
Should we break from for-loop if one of conditions (API key or user has been already registered in database) is satisfied?
Maybe it will be better to use already written database instead of reinventing the wheel? If you need JSON objects you should take a look at TinyDB.

Creating loop for main

I am new to Python, and I want your advice on something.
I have a script that runs one input value at a time, and I want it to be able to run a whole list of such values without me typing the values one at a time. I have a hunch that a "for loop" is needed for the main method listed below. The value is "gene_name", so effectively, i want to feed in a list of "gene_names" that the script can run through nicely.
Hope I phrased the question correctly, thanks! The chunk in question seems to be
def get_probes_from_genes(gene_names)
import json
import urllib2
import os
import pandas as pd
api_url = "http://api.brain-map.org/api/v2/data/query.json"
def get_probes_from_genes(gene_names):
if not isinstance(gene_names,list):
gene_names = [gene_names]
#in case there are white spaces in gene names
gene_names = ["'%s'"%gene_name for gene_name in gene_names]**
api_query = "?criteria=model::Probe"
api_query= ",rma::criteria,[probe_type$eq'DNA']"
api_query= ",products[abbreviation$eq'HumanMA']"
api_query= ",gene[acronym$eq%s]"%(','.join(gene_names))
api_query= ",rma::options[only$eq'probes.id','name']"
data = json.load(urllib2.urlopen(api_url api_query))
d = {probe['id']: probe['name'] for probe in data['msg']}
if not d:
raise Exception("Could not find any probes for %s gene. Check " \
"http://help.brain- map.org/download/attachments/2818165/HBA_ISH_GeneList.pdf? version=1&modificationDate=1348783035873 " \
"for list of available genes."%gene_name)
return d
def get_expression_values_from_probe_ids(probe_ids):
if not isinstance(probe_ids,list):
probe_ids = [probe_ids]
#in case there are white spaces in gene names
probe_ids = ["'%s'"%probe_id for probe_id in probe_ids]
api_query = "? criteria=service::human_microarray_expression[probes$in%s]"% (','.join(probe_ids))
data = json.load(urllib2.urlopen(api_url api_query))
expression_values = [[float(expression_value) for expression_value in data["msg"]["probes"][i]["expression_level"]] for i in range(len(probe_ids))]
well_ids = [sample["sample"]["well"] for sample in data["msg"] ["samples"]]
donor_names = [sample["donor"]["name"] for sample in data["msg"] ["samples"]]
well_coordinates = [sample["sample"]["mri"] for sample in data["msg"] ["samples"]]
return expression_values, well_ids, well_coordinates, donor_names
def get_mni_coordinates_from_wells(well_ids):
package_directory = os.path.dirname(os.path.abspath(__file__))
frame = pd.read_csv(os.path.join(package_directory, "data", "corrected_mni_coordinates.csv"), header=0, index_col=0)
return list(frame.ix[well_ids].itertuples(index=False))
if __name__ == '__main__':
probes_dict = get_probes_from_genes("SLC6A2")
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

whoa, first things first. Python ain't Java, so do yourself a favor and use a nice """xxx\nyyy""" string, with triple quotes to multiline.
api_query = """?criteria=model::Probe"
,rma::criteria,[probe_type$eq'DNA']
...
"""
or something like that. you will get white spaces as typed, so you may need to adjust.
If, like suggested, you opt to loop on the call to your function through a file, you will need to either try/except your data-not-found exception or you will need to handle missing data without throwing an exception. I would opt for returning an empty result myself and letting the caller worry about what to do with it.
If you do opt for raise-ing an Exception, create your own, rather than using a generic exception. That way your code can catch your expected Exception first.
class MyNoDataFoundException(Exception):
pass
#replace your current raise code with...
if not d:
raise MyNoDataFoundException(your message here)
clarification about catching exceptions, using the accepted answer as a starting point:
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
#keep track of your input data
search_data = line.strip()
try:
probes_dict = get_probes_from_genes(search_data)
except MyNoDataFoundException, e:
#and do whatever you feel you need to do here...
print "bummer about search_data:%s:\nexception:%s" % (search_data, e)
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

You may want to create a file with Gene names, then read content of the file and call your function in the loop. Here is an example below
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
probes_dict = get_probes_from_genes(line.strip())
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

Python refresh file from disk

I have a python script that calls a system program and reads the output from a file out.txt, acts on that output, and loops. However, it doesn't work, and a close investigation showed that the python script just opens out.txt once and then keeps on reading from that old copy. How can I make the python script reread the file on each iteration? I saw a similar question here on SO but it was about a python script running alongside a program, not calling it, and the solution doesn't work. I tried closing the file before looping back but it didn't do anything.
EDIT:
I already tried closing and opening, it didn't work. Here's the code:
import subprocess, os, sys
filename = sys.argv[1]
file = open(filename,'r')
foo = open('foo','w')
foo.write(file.read().rstrip())
foo = open('foo','a')
crap = open(os.devnull,'wb')
numSolutions = 0
while True:
subprocess.call(["minisat", "foo", "out"], stdout=crap,stderr=crap)
out = open('out','r')
if out.readline().rstrip() == "SAT":
numSolutions += 1
clause = out.readline().rstrip()
clause = clause.split(" ")
print clause
clause = map(int,clause)
clause = map(lambda x: -x,clause)
output = ' '.join(map(lambda x: str(x),clause))
print output
foo.write('\n'+output)
out.close()
else:
break
print "There are ", numSolutions, " solutions."

You need to flush foo so that the external program can see its latest changes. When you write to a file, the data is buffered in the local process and sent to the system in larger blocks. This is done because updating the system file is relatively expensive. In your case, you need to force a flush of the data so that minisat can see it.
foo.write('\n'+output)
foo.flush()

I rewrote it to hopefully be a bit easier to understand:
import os
from shutil import copyfile
import subprocess
import sys
TEMP_CNF = "tmp.in"
TEMP_SOL = "tmp.out"
NULL = open(os.devnull, "wb")
def all_solutions(cnf_fname):
"""
Given a file containing a set of constraints,
generate all possible solutions.
"""
# make a copy of original input file
copyfile(cnf_fname, TEMP_CNF)
while True:
# run minisat to solve the constraint problem
subprocess.call(["minisat", TEMP_CNF, TEMP_SOL], stdout=NULL,stderr=NULL)
# look at the result
with open(TEMP_SOL) as result:
line = next(result)
if line.startswith("SAT"):
# Success - return solution
line = next(result)
solution = [int(i) for i in line.split()]
yield solution
else:
# Failure - no more solutions possible
break
# disqualify found solution
with open(TEMP_CNF, "a") as constraints:
new_constraint = " ".join(str(-i) for i in sol)
constraints.write("\n")
constraints.write(new_constraint)
def main(cnf_fname):
"""
Given a file containing a set of constraints,
count the possible solutions.
"""
count = sum(1 for i in all_solutions(cnf_fname))
print("There are {} solutions.".format(count))
if __name__=="__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print("Usage: {} cnf.in".format(sys.argv[0]))

You take your file_var and end the loop with file_var.close().
for ... :
ga_file = open(out.txt, 'r')
... do stuff
ga_file.close()
Demo of an implementation below (as simple as possible, this is all of the Jython code needed)...
__author__ = ''
import time
var = 'false'
while var == 'false':
out = open('out.txt', 'r')
content = out.read()
time.sleep(3)
print content
out.close()
generates this output:
2015-01-09, 'stuff added'
2015-01-09, 'stuff added' # <-- this is when i just saved my update
2015-01-10, 'stuff added again :)' # <-- my new output from file reads
I strongly recommend reading the error messages. They hold quite a lot of information.
I think the full file name should be written for debug purposes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking for Duplicates twice over in a File - Python - python

Related

Avoid existing folders and bring only folders that don't exist

Weird sequence when writing into txt document

How to replace an specific item of a dict in a .txt file database

Creating loop for main

Python refresh file from disk

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking for Duplicates twice over in a File - Python - python

Related

Avoid existing folders and bring only folders that don't exist

Weird sequence when writing into txt document

How to replace an specific item of a dict in a .txt file database

Creating loop for __main__

Python refresh file from disk

Categories

Resources

Creating loop for main