I have a script that processes downloaded log files into csv's per some parsing logic.
I want to write those csv's to a remote directory on a different server. This is due to space constraints on the server where I execute the script.
I have tried a few variations of the below script but I just cant seem to figure this out. I understand "SFTP" and "SSH" commands but I am not sure if that is the right approach in this use case.(I have all the keys and stuff setup to allow for remote connections between the servers)
import os
import re
import string
import csv
import gzip
import extract5
import pickle
import shutil
import pygeoip
def conn():
ssh = paramiko.SSHClient()
ssh.connect(XXXXXXXXXX.XXXX.XXXXXXXX.COM, username=XXXXXXX)
ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(cmd_to_execute)
return conn(self:%s % (['cd /fs/fs01/crmdata/SYWR/AAM/AAM_Master/'])
downloadroot = '/opt/admin/AAM_Scripts/'
outgoing = conn
for sfile in os.listdir(downloadroot):
if sfile.endswith('gz'):
oname = sfile[sfile.find('AAM'):]
extract5.process(downloadroot,sfile,oname,outgoing)
#delete download dictionary and pickle
for date in os.listdir(downloadroot):
#match = re.match('(\d{4})[/.-](\d{2})[/,-](\d{2})$', date)
if date.find('AAM_CDF_1181_') > 0:
os.remove('%s%s' % (downloadroot, date))
os.remove('%sdownload.pkl' % (downloadroot))
Is what I am trying to do possible? I am on the right path or is my approach completely off. I would love some thoughts behind how or if I can accomplish this.
Related
So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
I am a PHP developer and was tasked with writing a Python script. I am lost.
Short summary:
My client wants to log which directory and file, on which URL was visited by which IP
My code:
import sys
import os
import io
import os.path
import socket
import requests
directory = os.getcwd()
ip = os.environ["REMOTE_ADDR"]
hostname = socket.gethostname()
i = socket.gethostbyname(hostname)
If I may ask two questions:
How do I run Python on Cpanel? I have uploaded the file example.py and navigated to the file, but the page only displays the code.
On the above code (in online code testers), I get an error: KeyError: 'REMOTE_ADDR'. What am I doing wrong
How can I read a NetCDF file remotely with Python?
It is easy to read a file from a server with ssh using Python, but how can I replace the command sftp_client.open() by something like netCDF4.Dataset() to store that result into a variable?
In the following example, I am downloading locally and temporarily the file I'd like to read remotely:
import os
import paramiko
import netCDF4
remotefile = 'pathtoremotefile'
localfile = 'pathtolocaltmpfile'
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect('myserver', username="myname", password="mypswd")
sftp_client = ssh_client.open_sftp()
# Something similar than this but for a NetCDF file?
# Or how to use remote_file in netCDF4 afterwards?
# remote_file = sftp_client.open(remotefile)
# Here, I just download it to manipulate it locally...
sftp_client.get(remotefile, localfile)
try:
ncfile = netCDF4.Dataset(localfile)
# Operations...
finally:
sftp_client.close()
ssh_client.close()
os.remove(localfile)
you can mount locally the remote ssh using sshfs and open it as a localfile
import os
import netCDF4
localfile = 'pathtolocalmountpoint/relativepathtofile'
try:
ncfile = netCDF4.Dataset(localfile)
Problem
I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files')). Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen in combination with shutil and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files')). I thought whenever writing to a file is incomplete or will fail the with statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.
Question
How can I prevent this error, i.e. make sure that every files get closed?
Code
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
You may try to use contextlib to close your file as such:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
According to the docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:
ulimit -Hn
Add the following line in your /etc/sysctl.conf file:
fs.file-max = <number>
Where <number> is the new upper limit of open files you want to set.
Close and save the file.
sysctl -p
So that changes take effect.
I believe that file handlers you create are not disposed in time by the system, as it takes some time to close connection. So you end up using all the free file handlers (and that includes network sockets) very quickly.
What you do is setting up FTP connection for each of your files. This is a bad practice. A better way is opening 5-15 connections and reusing them, downloading the files through existing sockets, without the overhead of initial FTP handshaking for each file. See this post for reference.
P.S. Also, as #Tarun_Lalwani mentioned, it is not a good idea to create a folder with more than a ~1000 files in it, as it slows down the file system.
How can I prevent this erorr, i.e. make sure that every files get closed?
To prevent the error you need to either increase open file limit, or, which is more reasonable, decrease concurrency in your thread pool. Connection and file closing is done by the context manager properly.
Your thread pool has 100 threads and opens at least 200 handles (one for FTP connection and another for file). Reasonable concurrency would be about 10-30 threads.
Here's simplified reproduction which shows that the code is okay. Put some content in somefile in current directory.
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
And then in the same directory:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass#$IP/dir/somefile
$ python3 client.py ftp://user:pass#$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test
I am trying to perform a task to transfer files between two different FTP locations. And the simple goal is that I would want to specific file type from FTP Location A to FTP Location B for only last few hours using Python script.
I am using ftplib to perform the task and have put together below code.
So far the file transfer is working fine for single file defined in the from_sock variable, but I am hitting road block when I am wanting to loop through all files which were created within last 2 hours and copy them. So the script I have written is basically copying individual file but I want to I wan't to move all files with particular extension example *.jpg which were created within last 2 hours. I tired to use MDTM to find the file modification time but I am not able to implement in right way.
Any help on this is much appreciated. Below is the current code:
import ftplib
srcFTP = ftplib.FTP("test.com", "username", "pass")
srcFTP.cwd("/somefolder")
desFTP = ftplib.FTP("test2.com", "username", "pass")
desFTP.cwd("/")
from_Sock = srcFTP.transfercmd("RETR Test1.text")
to_Sock = desFTP.transfercmd("STOR test1.text")
state = 0
while 1:
block = from_Sock.recv(1024)
if len(block) == 0:
break
state += len(block)
while len(block) > 0:
sentlen = to_Sock.send(block)
block = block[sentlen:]
print state, "Total Bytes Transferred"
from_Sock.close()
to_Sock.close()
srcFTP.quit()
desFTP.quit()
Thanks,
DD
Here a short code that takes the path and uploads every file with an extension of .jpg via ftp. Its not exactly what you want but I stumbled on your answer and this might help you on your way.
import os
from ftplib import FTP
def ftpPush(filepathSource, filename, filepathDestination):
ftp = FTP(IP, username, password)
ftp.cwd(filepathDestination)
ftp.storlines("STOR "+filename, open(filepathSource+filename, 'r'))
ftp.quit()
path = '/some/path/'
for fileName in os.listdir(path):
if fileName.endswith(".jpg"):
ftpPush(filepathSource=path, filename=fileName, filepathDestination='/some/destination/')
The creation time of a file can be checked on an ftp server using this example.
fileName = "nameOfFile.txt"
modifiedTime = ftp.sendcmd('MDTM ' + fileName)
# successful response: '213 20120222090254'
ftp.quit()
Now you just need to check when the file that have been modified, download it if it is below you wished for threshold and then upload them to the other computer.