One python script executes, another does not, when using cron - python

I have a python script, followback.py, which I am trying to run by using cron.
The script runs fine on its own i.e. when run by command 'python followback.py'.
But the script is never run when using cron.
My crontab file:
* * * * * python /home/ubuntu/./followback.py
* * * * * python /home/ubuntu/./test.py
I am using test.py as a simple testing measure by writing to a file to let me know that it have been run.
followback.py:
import io, json
def save_json(filename, data):
with io.open('{0}.json'.format(filename),
'w', encoding='utf-8') as f:
f.write(unicode(json.dumps(data, ensure_ascii=False)))
def load_json(filename):
with io.open('{0}.json'.format(filename),
encoding='utf-8') as f:
return f.read()
CONSUMER_KEY = xx
CONSUMER_SECRET = xx
OAUTH_TOKEN = xx
OAUTH_TOKEN_SECRET = xx
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)
q = 'followback'
count = 20
page = 1
results = []
maxResults = 50
filename = 'attempted_accounts'
try:
usedUsers = json.loads(load_json(filename))
except IOError:
usedUsers = []
usedList = [used['id'] for used in usedUsers]
# search for 'followback' and follow the ones with 'followback' in description
while len(results) < maxResults:
users = twitter_api.users.search(q=q, count=count, page=page)
results += [user for user in users if 'followback' in user['description'] and user['id'] not in usedList]
page += 1
[twitter_api.friendships.create(user_id=user['id'], follow='true') for user in results]
out = usedUsers + [{'id' : e['id']} for e in results]
save_json(filename, out)
The script above simply searches twitter for users with followback in the description and follows them.
The test.py script runs fine through cron but followback.py does not and I have no clue as to what could be wrong.
Any suggestions?

Check if the followback.py file ex executable, if not use chmod +x and it should work. That's the most common issue with this. Look at similar case here.

Related

Can't get correct path to roaming, when it is running as windows service

I want to get path to roaming, which should end up like this in my case:
A:\Users\Mitja\AppData\Roaming
But when program is ran as windows service all I get is:
C:\Windows\System32\config\systemprofile\AppData\Roaming
I tried multiple libraries, but all did the same. Does anyone maybe know how would I get the path?
I already tried all of these:
roaming_folder = os.getenv('APPDATA')
roaming_folder = os.path.expanduser('~\\AppData\\Roaming')
roaming_folder = os.environ['APPDATA']
def get_appdata():
CSIDL_APPDATA = 26
SHGFP_TYPE_CURRENT = 0
buf = ctypes.create_unicode_buffer(ctypes.wintypes.MAX_PATH)
ctypes.windll.shell32.SHGetFolderPathW(None, CSIDL_APPDATA, None, SHGFP_TYPE_CURRENT, buf)
return buf.value
roaming_folder = get_appdata()
import win32com.client
shell = win32com.client.Dispatch("WScript.Shell")
roaming_folder = shell.SpecialFolders("AppData")

Threading NNTP, how? (Newbie here)

I can't wrap my head around how I could possibly rewrite my code to be multi-threaded.
The code I'm writing is made to automatically archive every single article in a list of newsgroups that exist, but I wanna be able to utilize my newsgroup plan and make it up to 20 threads. I've never coded threading before and my attempts were in vein.
Here's my code, excluding the username and pass ( but you can get a free account with max 5 threads if you really want to at https://my.xsusenet.com )
Please don't judge me too hard :(
import nntplib
import sys
import datetime
import os
basetime = datetime.datetime.today()
#daysback = int(sys.argv[1])
#date_list = [basetime - datetime.timedelta(days=x) for x in range(daysback)]
s = nntplib.NNTP('free.xsusenet.com', user='USERNAME', password='PASSWORD') # I am only allowed 5 connections at a time, so try for 4.
groups = []
resp, groups_list_tuple = s.list()
def remove_non_ascii_2(string):
return string.encode('ascii', errors='ignore').decode()
for g_tuple in groups_list_tuple:
#print(g_tuple) # DEBUG_LINE
# Parse group_list info
group = g_tuple[0]
last = g_tuple[1]
first = g_tuple[2]
flag = g_tuple[3]
# Parse newsgroup info
resp, count, first, last, name = s.group(group)
for message_id in range(first, last):
resp, number, mes_id = s.next()
resp, info = s.article(mes_id)
if os.path.exists('.\\' + group):
pass
else:
os.mkdir('.\\' + group)
print(f"Downloading: {message_id}")
outfile = open('.\\' + group + '\\' + str(message_id), 'a', encoding="utf-8")
for line in info.lines:
outfile.write(remove_non_ascii_2(str(line)) + '\n')
outfile.close()
Tried threading using a ThreadPoolExecutor, to cause it to use 20 threads, and failed, caused it to repeat the same process to the same message id. The expected result was to download 20 different messages at a time.
Here's the code I tried with threading, mind you I did like 6-8 variations of it to try and get it to work, this was the last one before I gave up to ask on here.
import nntplib
import sys
import datetime
import os
import concurrent.futures
basetime = datetime.datetime.today()
#daysback = int(sys.argv[1])
#date_list = [basetime - datetime.timedelta(days=x) for x in range(daysback)]
s = nntplib.NNTP('free.xsusenet.com', user='USERNAME', password='PASSWORD') # I am only allowed 5 connections at a time, so try for 4.
groups = []
resp, groups_list_tuple = s.list()
def remove_non_ascii_2(string):
return string.encode('ascii', errors='ignore').decode()
def download_nntp_file(mess_id):
resp, count, first, last, name = s.group(group)
message_id = range(first, last)
resp, number, mes_id = s.next()
resp, info = s.article(mes_id)
if os.path.exists('.\\' + group):
pass
else:
os.mkdir('.\\' + group)
print(f"Downloading: {mess_id}")
outfile = open('.\\' + group + '\\' + str(mess_id), 'a', encoding="utf-8")
for line in info.lines:
outfile.write(remove_non_ascii_2(str(line)) + '\n')
outfile.close()
for g_tuple in groups_list_tuple:
#print(g_tuple) # DEBUG_LINE
# Parse group_list info
group = g_tuple[0]
last = g_tuple[1]
first = g_tuple[2]
flag = g_tuple[3]
# Parse newsgroup info
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = executor.submit(download_nntp_file)
I can't test it with XSUseNet.
I wouldn't use global variables because when processes work at the same time then they may get the same values from these variables.
You should rather send values as parameters to functions.
Something like this:
def download_nntp_file(g_tuple):
# ... code which uses `g_tuple` instead of global variables ...
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for g_tuple in groups_list_tuple:
executor.submit(download_nntp_file, g_tuple)
But I would be simpler to use map() instead of submit() because it gets list with arguments and it doesn't need for-loop
def download_nntp_file(g_tuple):
# ... code which uses `g_tuple` instead of global variables ...
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(download_nntp_file, groups_list_tuple)

Python on Crontab does not execute bash script

import subprocess as sub
import re
import os
from datetime import datetime as influx_timestap
from influxdb import InfluxDBClient
from collections import OrderedDict
insert_json = []
hostname = str(sub.check_output('hostname')).strip()
location = str(sub.check_output(['ps -ef | grep mgr'], shell=True)).split()
current_dir = os.getcwd()
print("script executed")
gg_location_pattern = re.compile(r'mgr\.prm$')
gg_process_pattertn = re.compile(r'^REPLICAT|^EXTRACT')
for index in location:
if gg_location_pattern.search(index) != None:
gg_location = index[:-14]
os.chdir(gg_location)
print("checkpoint1")
get_lag = sub.check_output(str(current_dir) + '/ggsci_test.sh', shell=True)
print("checkpoint2")
processes = get_lag.split("\n")
for process in processes:
if gg_process_pattertn.search(process) != None:
lag_at_chkpnt = int((process.split()[3]).split(":")[0]) * 3600 + int((process.split()[3]).split(":")[1]) *60 + int((process.split()[3]).split(":")[2])
time_since_chkpnt = int((process.split()[4]).split(":")[0]) * 3600 + int((process.split()[4]).split(":")[1]) *60 + int((process.split()[4]).split(":")[2]
)
process_dict = OrderedDict({"measurement": "GoldenGate_Mon_" + str(hostname) + "_Graph",
"tags": {"hostname": hostname, "process_name": process.split()[2]},
"time": influx_timestap.now().isoformat('T'),
"fields": {"process_type": process.split()[0], "process_status": process.split()[1],
"lag_at_chkpnt": lag_at_chkpnt, "time_since_chkpnt": time_since_chkpnt}})
insert_json.append(process_dict)
host = 'xxxxxxxx'
port = 'x'
user = 'x'
password = 'x'
dbname = 'x'
print("before client")
client = InfluxDBClient(host, port, user, password, dbname)
client.write_points(insert_json)
print("after client")
This code works manually perfect, but on the crontab it is not working. After searching on the internet I found that they say change or set your "PATH" variable on the crontab. I changed my "PATH" variable and it is still not working.
Crontab log file write "checkpoint1" after that there is nothing. So, line not working is "get_lag = sub.check_output(str(current_dir) + '/ggsci_test.sh', shell=True)"
What can I do here afterwards?
Take care,
it looks like your external script (ggsci_test.sh) has some issues with the paths / general failure.
From the Python subprocess documentation about subprocess.check_output:
If the return code was non-zero it raises a CalledProcessError. The
CalledProcessError object will have the return code in the returncode
attribute and any output in the output attribute.
So thats the reason why you see the error when catching it, but not being able to continue.
You should check therefore if your shell script has any issues that need to be solved before.

Manage Python Multiprocessing with MongoDB

I'm trying to run my code with a multiprocessing function but mongo keep returning
"MongoClient opened before fork. Create MongoClient with
connect=False, or create client after forking."
I really doesn't understand how i can adapt my code to this.
Basically the structure is:
db = MongoClient().database
db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
collectionW = db['words']
collectionT = db['sinMemo']
collectionL = db['sinLogic']
def findW(word):
rows = collectionw.find({"word": word})
ind = 0
for row in rows:
ind += 1
id = row["_id"]
if ind == 0:
a = ind
else:
a = id
return a
def trainAI(stri):
...
if findW(word) == 0:
_id = db['words'].insert(
{"_id": getNextSequence(db.counters, "nodeid"), "word": word})
story = _id
else:
story = findW(word)
...
def train(index):
# searching progress
progFile = "./train/progress{0}.txt".format(index)
trainFile = "./train/small_file_{0}".format(index)
if os.path.exists(progFile):
f = open(progFile, "r")
ind = f.read().strip()
if ind != "":
pprint(ind)
i = int(ind)
else:
pprint("No progress saved or progress lost!")
i = 0
f.close()
else:
i = 0
#get the number of line of the file
rangeC = rawbigcount(trainFile)
#fix unicode
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
files = io.open(trainFile, "r", encoding="utf8")
str1 = ""
str2 = ""
filex = open(progFile, "w")
with progressbar.ProgressBar(max_value=rangeC) as bar:
for line in files:
line = line.replace("\n", "")
if i % 2 == 0:
str1 = line.translate(non_bmp_map)
else:
str2 = line.translate(non_bmp_map)
bar.update(i)
trainAI(str1 + " " + str2)
filex.seek(0)
filex.truncate()
filex.write(str(i))
i += 1
#multiprocessing function
maxProcess = 3
def f(l, i):
l.acquire()
train(i + 1)
l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(maxProcess):
pprint("start " + str(num))
Process(target=f, args=(lock, num)).start()
This code is made for reading 4 different file in 4 different process and at the same time insert the data in the database.
I copied only part of the code for make you understand the structure of it.
I've tried to add connect=False to this code but nothing...
db = MongoClient(connect=False).database
db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
collectionW = db['words']
collectionT = db['sinMemo']
collectionL = db['sinLogic']
then i've tried to move it in the f function (right before train() but what i get is that the program doesn't find collectionW,collectionT and collectionL.
I'm not very expert of python or mongodb so i hope that this is not a silly question.
The code is running under Ubuntu 16.04.2 with python 2.7.12
db.authenticate will have to connect to mongo server and it will try to make a connection. So, even though connect=False is being used, db.authenticate will require a connection to be open.
Why don't you create the mongo client instance after fork? That's look like the easiest solution.
Since db.authenticate must open the MongoClient and connect to the server, it creates connections which won't work in the forked subprocess. Hence, the error message. Try this instead:
db = MongoClient('mongodb://user:password#localhost', connect=False).database
Also, delete the Lock l. Acquiring a lock in one subprocess has no effect on other subprocesses.
Here is how I did it for my problem:
import pathos.pools as pp
import time
import db_access
class MultiprocessingTest(object):
def __init__(self):
pass
def test_mp(self):
data = [[form,'form_number','client_id'] for form in range(5000)]
pool = pp.ProcessPool(4)
pool.map(db_access.insertData, data)
if __name__ == '__main__':
time_i = time.time()
mp = MultiprocessingTest()
mp.test_mp()
time_f = time.time()
print 'Time Taken: ', time_f - time_i
Here is db_access.py:
from pymongo import MongoClient
def insertData(form):
client = MongoClient()
db = client['TEST_001']
db.initialization.insert({
"form": form[0],
"form_number": form[1],
"client_id": form[2]
})
This is happening to your code because you are initiating MongoCLient() once for all the sub-processes. MongoClient is not fork safe. So, initiating inside each function works and let me know if there are other solutions.

how can we write the program by using multiprocessing module of Python?

# -*- coding: utf-8 -*-
from __future__ import print_function
import os, codecs, re, string, mysql
import mysql.connector
'''Reading files with txt extension'''
y_ = ""
for root, dirs, files in os.walk("/Users/Documents/source-document/part1"):
for file in files:
if file.endswith(".txt"):
x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
for lines in x_.readlines():
y_ = y_ + lines
#print(tokenized_docs)
'''Tokenizing sentences of the text files'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)
tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]
'''Removing stop words'''
stopword_removed_sentences = []
from nltk.corpus import stopwords
stopset = stopwords.words("English")
for i in tokenized_docs[0]:
tokenized_docs = ' '.join([word for word in i.split() if word not in stopset])
stopword_removed_sentences.append(tokenized_docs)
''' Removing punctuation marks'''
regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html
nw = []
for review in stopword_removed_sentences:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review += new_token
nw.append(new_review)
'''Lowercasing letters after removing puctuation marks.'''
lw = [] #lw stands for lowercase word.
for i in nw:
k = i.lower()
lw.append(k)
'''Removing number with a dummy symbol'''
nr = []
for j in lw:
string = j
regex = r'[^\[\]]+(?=\])'
# let "#" be the dummy symbol
output = re.sub(regex,'#',string)
nr.append(output)
nrfinal = []
for j in nr:
rem = 0
outr = ''
for i in j:
if ord(i)>= 48 and ord(i)<=57:
rem += 1
if rem == 1:
outr = outr+ '#'
else:
rem = 0
outr = outr+i
nrfinal.append(outr)
'''Inserting into database'''
def connect():
for j in nrfinal:
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()
I am not getting any error with this code. It is doing well for the text files. The problem is only the execution time as I have a lot of text files (nearly 6Gb) for which the program is taking too much time. On inspection i found that it is CPU-bound. So to solve it, multiprocessing is needed. Please help me to write my code with multiprocessing module so that parallel processing can be done.
Thank you all.
there's an example in the python docs which demonstrates the use of multiprocessing:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
You can use this to adapt your code. Once you've obtained the text files, you use the map function to execute the rest in parallel. You'd have to define a function encapsulating the code you want to execute on multiple cores.
However, reading files in parallel may decrease the performance. Also, adding content to the database in asynchronously may not work. So you may want to perform these two tasks in the main thread, still

Categories

Resources