I have this structure:
├── app
│ ├── __init__.py
│ └── views.py
├── requirements.txt
├── sources
│ └── passport
│ ├── field_mapping.
│ ├── listener.py
│ ├── main.py
this is my init file:
from flask import Flask
app = Flask(__name__)
from app import views
my views file. Is this the best way to send plain text?
from app import app
from flask import Response
from sources.app_metrics import meters
# from sources.passport.main import subscription_types
#app.route('/metrics')
def metrics():
def generateMetrics():
metrics = ""
for subscription in ["something", "some other thing"]:
metrics += "thing_{}_count {}\n".format(subscription, meters[subscription].get()['count'])
return metrics
print(generateMetrics())
return Response(generateMetrics(), mimetype='text/plain')
My sources/passport/main file looks like this:
subscription_types = ["opportunity", "account", "lead"]
if __name__ == "__main__":
loop = asyncio.get_event_loop()
...
for subscription in subscription_types():
I also ran export FLASK_ENV=app/__init__.py before running flask app
When I visit /metrics I get an error that looks like some kind of circular dependency error.
When I uncomment that import comment in my views, file, the error occurs.
Pulling out subscription_types into a variable and importing it seems to be causing the problem.
My stack trace:
File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 235, in locate_app
__import__(module_name)
File "/Users/jwan/extract/app/__init__.py", line 5, in <module>
from app import views
File "/Users/jwan//extract/app/views.py", line 5, in <module>
from sources.passport.main import subscription_types
File "/Users/jwan/extract/sources/passport/main.py", line 3, in <module>
from sources.passport.listener import subscribe, close_subscriptions
File "/Users/jwan/extract/sources/passport/listener.py", line 18, in <module>
QUEUE = boto3.resource("sqs").get_queue_by_name(QueueName=CONFIG["assertions_queue"][ENV])
botocore.errorfactory.QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue) when calling the GetQueueUrl operation: The specified queue does not exist for this wsdl versio
My sources/passport/listener file has this on line 18:
import gzip
import log
from os import getenv
from sources.passport.normalizer import normalize_message
from sources.app_metrics import meters
QUEUE = boto3.resource("sqs").get_queue_by_name(QueueName=CONFIG["assertions_queue"][ENV])
Related
I am working with python 3.7 with anaconda and visual code.
I have a project folder called providers.
Inside providers I have two folders:
├── Providers
├── __init__.py
|
├── _config
| ├── database_connection.py
| ├── __init__.py
├── src
| ├── overview.py
| ├── __init__.py
├── utils
| ├── pandas_functions.py
| ├── __init__.py
I want to import a class named DatabaseConnection that is inside the file database_connection.py in the file ovewview.py
# Overview.py
from config.database_connection import DatabaseConnection
This works as expected.
I want to run some tests, I am using pytest, and the script looks like this:
from unittest.mock import patch, Mock
import pandas as pd
from config.database_connection import DatabaseConnection
from providers.utils.pandas_functions import get_df
#patch("providers.utils.pandas_functions.pd.read_sql")
def test_get_df(read_sql: Mock):
read_sql.return_value = pd.DataFrame({"foo_id": [1, 2, 3]})
results = get_df()
read_sql.assert_called_once()
pd.testing.assert_frame_equal(results, pd.DataFrame({"bar_id": [1, 2, 3]}))
But is giving me this error:
plugins: hypothesis-5.5.4, arraydiff-0.3, astropy-header-0.1.2, doctestplus-0.5.0, mock-3.4.0, openfiles-0.4.0, remotedata-0.3.2
collected 0 items / 1 error
===================================================================================================== ERRORS =====================================================================================================
________________________________________________________________________________ ERROR collecting tests/test_pandas_functions.py _________________________________________________________________________________
ImportError while importing test module 'C:\Users\jordi_adm\Documents\GitHub\mcf-pipelines\cpke-cash-advance\providers\tests\test_pandas_functions.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests\test_pandas_functions.py:3: in <module>
from config.database_connection import DatabaseConnection
E ModuleNotFoundError: No module named 'config'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================================ 1 error in 0.35s ================================================================================================
It cannot find the module config, while if I run the overview.py file it can import it without problems.
If import config.database_connection import DatabaseConnection with providers at the beginning, pytest is able to call this function:
from unittest.mock import patch, Mock
import pandas as pd
from providers.config.database_connection import DatabaseConnection
from providers.utils.pandas_functions import get_df
#patch("providers.utils.pandas_functions.pd.read_sql")
def test_get_df(read_sql: Mock):
read_sql.return_value = pd.DataFrame({"foo_id": [1, 2, 3]})
results = get_df()
read_sql.assert_called_once()
pd.testing.assert_frame_equal(results, pd.DataFrame({"bar_id": [1, 2, 3]}))
plugins: hypothesis-5.5.4, arraydiff-0.3, astropy-header-0.1.2, doctestplus-0.5.0, mock-3.4.0, openfiles-0.4.0, remotedata-0.3.2
collected 1 item
tests\test_pandas_functions.py . [100%]
=============================================================================================== 1 passed in 0.29s ================================================================================================
If I try to run a script inside of this project with providers at the beginning, for example modifying overview.py:
from providers.config.database_connection import DatabaseConnection
I obtain this error:
from providers.config.database_connection import DatabaseConnection
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
c:\Users\jordi_adm\Documents\GitHub\mcf-pipelines\cpke-cash-advance\providers\run.py in
----> 2 from providers.config.database_connection import DatabaseConnection
ModuleNotFoundError: No module named 'providers'
What is the reason to pytest needs to specify providers at the begging?
As much as I think I understand python's import system, I still find my self lost...
I want to change a file (which is my programs main entry point) into a directory, yet I can't get the imports to run successfully
I can't seem to understand how to get sys.path to match.
$ cat > prog.py << EOF
> import sys
> pprint(sys.path[0])
> EOF
$ python3 prog.py
/home/me/pyprogram
$ mkdir prog
$ mv prog.py prog/__main__.py
$ python3 prog
prog
$ mv prog/__main__.py prog/__init__.py
$ python3 prog/__init__.py
/home/me/pyprogram/prog
for a bit more context on what I am trying to achieve, (and I might be designing my program wrong, feedback gladly accepted)
$ tree --dirsfirst
.
├── prog
│ ├── data_process.py
│ └── __init__.py
├── destination.py
└── source.py
1 directory, 4 files
$ cat source.py
def get():
return 'raw data'
$ cat destination.py
def put(data):
print(f"{data} has ',
'/usr/lib/python37.zip',
'/usr/lib/python3.7',
'/usr/lib/python3.7/lib-dynload',
'/home/me/.local/lib/python3.7/site-packages',
'/usr/local/lib/python3.7/dist-packages',
'/usr/lib/python3/dist-packages']
been passed successfully")
$ cat prog/__init__.py
#!/usr/bin/env python
import os
class Task:
def __init__(self, func, args=None, kwargs=None):
self.func = func
self.args = args if args else []
self.kwargs = kwargs if kwargs else {}
def run(self):
self.func(*self.args, **self.kwargs)
tasks = []
def register_task(args=None, kwargs=None):
def registerer(func):
tasks.append(Task(func, args, kwargs))
return func
return registerer
for module in os.listdir(os.path.dirname(os.path.abspath(__file__))):
if module.startswith('_') or module.startswith('.'):
continue
__import__(os.path.splitext(module)[0])
del module
for task in tasks:
task.run()
$ cat prog/data_process.py
from source import get
from destination import put
from . import register_task
#register_task(kwargs={'replace_with': 'cleaned'})
def process(replace_with):
raw = get()
cleaned = raw.replace('raw', replace_with)
put(cleaned)
$ python3 prog/__init__.py
Traceback (most recent call last):
File "prog/__init__.py", line 27, in <module>
__import__(os.path.splitext(module)[0])
File "/home/me/pyprogram/prog/data_process.py", line 1, in <module>
from source import get
ModuleNotFoundError: No module named 'source'
$ mv prog/__init__.py prog/__main__.py
$ python3 prog/
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "prog/__main__.py", line 27, in <module>
__import__(os.path.splitext(module)[0])
File "prog/data_process.py", line 1, in <module>
from source import get
ModuleNotFoundError: No module named 'source'
Project structure update
I changed the structure;
1. Placing all libraries into utils.
2. Placing all projects into projects (using __init__.py to allow for easy import of all created projects in the folder).
3. Main program script program.py in the top project directory.
Project structure:
$ tree
.
├── utils
│ ├── source.py
│ ├── remote_dest.py
│ ├── local_dest.py
│ └── __init__.py
├── projects
│ ├── process2.py
│ ├── process1.py
│ └── __init__.py
└── program.py
Contents of libraries defined in utils directory:
$ cat utils/source.py
"""
Emulates expensive resource to get,
bringing the need to cache it for all client projects.
"""
import time
class _Cache:
def __init__(self):
self.data = None
_cache = _Cache()
def get():
"""
Exposed source API for getting the data,
get from remote resource or returns from available cache.
"""
if _cache.data is None: # As well as cache expiration.
_cache.data = list(_expensive_get())
return _cache.data
def _expensive_get():
"""
Emulate an expensive `get` request,
prints to console if it was invoked.
"""
print('Invoking expensive get')
sample_data = [
'some random raw data',
'which is in some raw format',
'it is so raw that it will need cleaning',
'but now it is very raw'
]
for row in sample_data:
time.sleep(1)
yield row
$ cat utils/remote_dest.py
"""
Emulate limited remote resource.
Use thread and queue to have the data sent in the backround.
"""
import time
import threading
import queue
_q = queue.Queue()
def put(data):
"""
Exposed remote API `put` method
"""
_q.put(data)
def _send(q):
"""
Emulate remote resource,
prints to console when data is processed.
"""
while True:
time.sleep(1)
data = q.get()
print(f"Sending {data}")
threading.Thread(target=_send, args=(_q,), daemon=True).start()
$ cat utils/local_dest.py
"""
Emulate second source of data destination.
Allowing to demonstrate need from shared libraries.
"""
import datetime
import os
# Create `out` dir if it doesn't yet exist.
_out_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'out')
if not os.path.exists(_out_dir):
os.makedirs(_out_dir)
def save(data):
"""
Exposed API to store data locally.
"""
out_file = os.path.join(_out_dir, 'data.txt')
with open(out_file, 'a') as f:
f.write(f"[{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {data}\n")
Main program execution script contents:
$ cat program.py
#!/usr/bin/env python
import os
class Task:
"""
Class storing `func` along with its `args` and `kwargs` to be run with.
"""
def __init__(self, func, args=None, kwargs=None):
self.func = func
self.args = args if args else []
self.kwargs = kwargs if kwargs else {}
def run(self):
"""
Executes stored `func` with its arguments.
"""
self.func(*self.args, **self.kwargs)
def __repr__(self):
return f"<Task({self.func.__name__})>"
# List that will store the registered tasks to be executed by the main program.
tasks = []
def register_task(args=None, kwargs=None):
"""
Registers decorated function along with the passed `args` and `kwargs` in the `tasks` list
as a `Task` for maintained execution.
"""
def registerer(func):
print(f"Appending '{func.__name__}' in {__name__}")
tasks.append(Task(func, args, kwargs)) # Saves the function as a task.
print(f"> tasks in {__name__}: {tasks}")
return func # returns the function untouched.
return registerer
print(f"Before importing projects as {__name__}. tasks: {tasks}")
import projects
print(f"After importing projects as {__name__}. tasks: {tasks}")
print(f"Iterating over tasks: {tasks} in {__name__}")
while True:
for task in tasks:
task.run()
break # Only run once in the simulation
Contents of the individual projects defined in the projects directory:
$ cat projects/process1.py
"""
Sample project that uses the shared remote resource to get data
and passes it on to another remote resource after processing.
"""
from utils.source import get
from utils.remote_dest import put
from program import register_task
#register_task(kwargs={'replace_with': 'cleaned'})
def process1(replace_with):
raw = get()
for record in raw:
put(record.replace('raw', replace_with))
$ cat projects/process2.py
"""
Sample project that uses the shared remote resource to get data
and saves it locally after processing.
"""
from utils.source import get
from utils.local_dest import save
from program import register_task
#register_task()
def process2():
raw = get()
for record in raw:
save(record.replace('raw', '----'))
Content of __init__.py file in the projects directory:
$ cat projects/__init__.py
"""
use __init__ file to import all projects
that might have been registered with `program.py` using `register_task`
"""
from . import process1, process2
# TODO: Dynamically import all projects (whether file or directory (as project)) that wil be created in the `projects` directory automatically (ignoring any modules that will start with an `_`)
# Something in the sense of:
# ```
# for module in os.listdir(os.path.dirname(os.path.abspath(__file__))):
# if module.startswith('_') or module.startswith('.'):
# continue
# __import__(os.path.splitext(module)[0])
# ```
Yet when I run the program I see that;
1. program.py gets executed twice (once as __main__ and once as program).
2. The tasks are appended (in the second execution run).
Yet when iterating over the tasks, none are found.
$ python3 program.py
Before importing projects as __main__. tasks: []
Before importing projects as program. tasks: []
After importing projects as program. tasks: []
Iterating over tasks: [] in program
Appending 'process1' in program
> tasks in program: [<Task(process1)>]
Appending 'process2' in program
> tasks in program: [<Task(process1)>, <Task(process2)>]
After importing projects as __main__. tasks: []
Iterating over tasks: [] in __main__
I don't understand;
Why is the main (program.py) file being executed twice, I thought that there can't be circular imports as python caches the imported modules?
(I took the idea of the circular imports used in flask applications, i.e. app.py imports routes, models etc. which all of them import app and use it to define the functionality, and app.py imports them back so that the functionality is added (as flask only runs app.py))
Why is the tasks list empty after the processes are appended to it?
After comparing my circular import to a flask based app that does circular imports as follows
Sample flask program that uses circular imports
Flask app structure
(venv) $ echo $FLASK_APP
mgflask.py
(venv) $ tree
.
├── app
│ ├── models
│ │ ├── __init__.py
│ │ ├── post.py
│ │ └── user.py
│ ├── templates/
│ ├── forms.py
│ ├── __init__.py
│ └── routes.py
├── config.py
└── mgflask.py
(venv) $ cat mgflask.py
#!/usr/bin/env python
from app import app
# ...
(venv) $ cat app/__init__.py
from flask import Flask
from config import Config
# ... # config imports
app = Flask(__name__) # <---
# ... # config setup
from . import routes, models, errors # <---
(venv) $ cat app/routes.py
from flask import render_template, flash, redirect, url_for, request
# ... # import extensions
from . import app, db # <---
from .forms import ...
from .models import ...
#app.route('/')
def index():
return render_template('index.html', title='Home')
(venv) $ flask run
* Serving Flask app "mgflask.py" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: on
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: ???-???-???
I restructured my app by;
I moved the Task class, tasks list, register_task decorator function into projects/__init__.py and in the bottom of the init.py file I import the projects defined in the directory
In the program.py file I just from projects import tasks and everything works as desired.
the only question that stays is what is the difference between running prog.py vs prog/ (which contains __main__.py) (first iteration of my question here...)
I am working on writing some data tests. Super simple nothing crazy.
Here is what my current directory looks like.
.
├── README.md
├── hive_tests
│ ├── __pycache__
│ ├── schema_checks_hive.py
│ ├── test_schema_checks_hive.py
│ └── yaml
│ └── job_output.address_stats.yaml
└── postgres
├── __pycache__
├── schema_checks_pg.py
├── test_schema_checks_pg.py
└── yaml
When I cd in to postgres and run pytest all my tests pass.
When I cd in to hive_test and run pytest I am getting an import error.
Here is my schema_checks_hive.py file.
from pyhive import hive
import pandas as pd
import numpy as np
import os, sys
import yaml
def check_column_name_hive(schema, table):
query = "DESCRIBE {0}.{1}".format(schema, table)
df = pd.read_sql_query(query, conn)
print(df.columns)
return df.columns
check_column_name_hive('myschema', 'mytable')
Here is my test_schema_checks_hive.py file where the tests are located.
import schema_checks_hive as sch
import pandas as pd
import yaml
import sys, os
def test_column_names_hive():
for filename in os.listdir('yaml'):
data = ""
with open("yaml/{0}".format(filename), 'r') as stream:
data = yaml.safe_load(stream)
schema = data['schema']
table = data['table']
cols = data['columns']
df = sch.check_column_name_hive(schema, table)
assert len(cols) == len(df)
assert cols == df.tolist()
When I run Pytest I get an error that says:
mportError while importing test module '/Usersdata/
tests/hive_tests/test_schema_checks_hive.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
test_schema_checks_hive.py:1: in <module>
import schema_checks_hive as sch
schema_checks_hive.py:1: in <module>
from pyhive import hive
E ModuleNotFoundError: No module named 'pyhive
I would love any help! Thanks so much.
I want to try some method in my spider.
For example in my project, I have this schema:
toto/
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
├── spiders
│ ├── __init__.py
│ └── mySpider.py
└── Unitest
└── unitest.py
my unitest.py look like that:
# -*- coding: utf-8 -*-
import re
import weakref
import six
import unittest
from scrapy.selector import Selector
from scrapy.crawler import Crawler
from scrapy.utils.project import get_project_settings
from unittest.case import TestCase
from toto.spiders import runSpider
class SelectorTestCase(unittest.TestCase):
sscls = Selector
def test_demo(self):
print "test"
if __name__ == '__main__':
unittest.main()
and my mySpider.py, look like that:
import scrapy
class runSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
In my unitest.py file, How I can call my spider ?
I tried to add from toto.spiders import runSpider in my unitest.py file, but but it does not...
I've got this error:
Traceback (most recent call last): File "unitest.py", line 10, in
from toto.spiders import runSpider ImportError: No module named toto.spiders
How I can fix It?
Try:
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(os.path.realpath(__file__)), '../..')) #2 folder back from current file
from spiders.mySpider import runSpider
I'm getting an error message that i'm unable to tackle. I don't get what's the issue with the multiprocessing library and i don't understand why it says that it is impossible to import the build_database module but in the same time it executes perfectly a function from that module.
Could somebody tell me is he sees something. Thank you.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
Traceback (most recent call last):
File "<string>", line 1, in <module>
prepare(preparation_data)
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
File "C:\Python27\lib\multiprocessing\forking.py", line 495, in prepare
prepare(preparation_data)
'__parents_main__', file, path_name, etc
File "C:\Python27\lib\multiprocessing\forking.py", line 495, in prepare
File "C:\Users\Comp3\Desktop\User\Data\main.py", line 4, in <module>
'__parents_main__', file, path_name, etc
import database.build_database
File "C:\Users\Comp3\Desktop\User\Data\main.py", line 4, in <module>
ImportError : import database.build_database
NImportErroro module named build_database:
No module named build_database
This is what i have in my load_bigquery.py file:
# Send CSV to Cloud Storage
def load_send_csv(table):
job = multiprocessing.current_process().name
print '[' + table + '] : job starting (' + job + ')'
bigquery.send_csv(table)
#timer.print_timing
def send_csv(tables):
jobs = []
build_csv(tables)
for t in tables:
if t not in csv_targets:
continue
print ">>>> Starting " + t
# Load CSV in BigQuery, as parallel jobs
j = multiprocessing.Process(target=load_send_csv, args=(t,))
jobs.append(j)
j.start()
# Wait for jobs to complete
for j in jobs:
j.join()
And i call it like this from my main.py :
bigquery.load_bigquery.send_csv(tables)
My folder is like this:
src
| main.py
|
├───bigquery
│ │ bigquery.py
│ │ bigquery2.dat
│ │ client_secrets.json
│ │ herokudb.py
│ │ herokudb.pyc
│ │ distimo.py
│ │ flurry.py
│ │ load_bigquery.py
│ │ load_bigquery.pyc
│ │ timer.py
│ │ __init__.py
│ │ __init__.pyc
│ │
│ │
├───database
│ │ build_database.py
│ │ build_database.pyc
│ │ build_database2.py
│ │ postgresql.py
│ │ timer.py
│ │ __init__.py
│ │ __init__.pyc
That function works perfectly if i execute load_bigquery.py alone but if i import it into main.py it fails with the errors given above.
UPDATE :
Here are my import, maybe it might help:
main.py
import database.build_database
import bigquery.load_bigquery
import views.build_analytics
import argparse
import getopt
import sys
import os
load_bigquery.py
import sys
import os
import subprocess
import time
import timer
import distimo
import flurry
import herokudb
import bigquery
import multiprocessing
import httplib2
bigquery.py
import sys
import os
import subprocess
import json
import time
import timer
import httplib2
from pprint import pprint
from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import AccessTokenRefreshError
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.client import flow_from_clientsecrets
from oauth2client.tools import run
from apiclient.errors import HttpError
Maybe the issue is with the fact that load_bigquery.py imports multiprocessing and then main.py imports load_bigquery.py ?
You are probably missing the __init__.py inside src/bigquery/. So your source folders should be:
> src/main.py
> src/bigquery/__init__.py
> src/bigquery/load_bigquery.py
> src/bigquery/bigquery.py
The __init__.py just needs to be empty and is only there so that Python knows that bigquery is a Python package.
UPDATED: Apparently the __init__.py file is present. The actual error message talks about a different error, which is it cannot import database.build_database.
My suggestion is to look into that. It is not mentioned as being in the src folder...
UPDATE 2: I think you have a clash with your imports. Python 2 has a slightly fuzzy relative import, which sometimes catches people out. You have both a package at the same level of main.py called database and one inside bigquery called database. I think somehow you are ending up with the one inside bigquery, which doesn't have build_database. Try renaming one of them.