Python: how to know if file is finished uploading into hdfs - python

So I have 2 scripts:
script1 for upload file to hdfs
script2 will access the folder and read the files every n seconds
my upload script is like this
from hdfs import InsecureClient
from requests import Session
from requests.auth import HTTPBasicAuth
session = Session()
session.auth = HTTPBasicAuth('hadoop', 'password')
client_hdfs = InsecureClient('http://hadoop.domain.com:50070', user='hadoop', session=session)
client_hdfs.upload(hdfsPath,filePath,overwrite=True)
when I read https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought
or in stackoverflow Accessing a file that is being written.
It seems when I upload using hadoop dfs -put command (or -copyFromLocal or -cp) it will create [filename].COPYING if the file is not finished yet. But in the python script it seems it will create the file with the same name but the size will increasing overtime until it completed (and we can download it before it complete and get corrupted file).
I want to ask if there is a way to upload the file using python so that we knows that the file is finished uploading or not.
Actually I has another work-around to upload them into temporary folder and move them to the correct folder after all is finished (I am still trying to do this), but if there is another idea for this will be appreciated

You can use the same strategy as hDFS
create [filename].COPYING
When data is uploaded rename to [filename]
I feel like you suggested the same thing with a temp file, instead of a name change, but it amounts to the same idea. Just so you know renaming a file is extremely cheap and fast so by all means this is a good strategy.

Related

Python - How to download a file with given name from online repository (given its URL)

I currently have a main python script which is working by analyzing a given csv file present in its own local working folder. With the aim of automatizing the process of analyzing more than one csv file, I'm currently trying to build another script which is performing the following tasks:
Download in local working folder a csv file, identified by its own name among the many in an online repository (a OneDrive folder), for which I have the corresponding URL (for the OneDrive folder, not directly the file).
Run the main script and analyze it.
Remove the analyzed csv file from local folder and repeat the process.
I'm having some issues with the identification and download of the csv files.
I've seen some approaches using 'request' module but they were more related to downloading directly a file corresponding to a given URL, not looking for it and taking it from an online repository. For this reason I'm not even sure about how to start here.
What I'm looking for is something like:
url = 'https://1drv.ms/xxxxxxxxx'
file_name = 'title.csv'
# -> Download(link = url, file = file_name)
Thanks in advance to anyone who'll take some time to read this! :)

How to update bigquery database updated by a scheduled script based on sync status of input csv files?

Short Explanation
Some csv files are incoming on a OneDrive Server which is synced onto a machine where a script is running to read them and push them onto BigQuery. And while the script is running fine now, I intend to run it only after all files are synced completely (i.e. available offline) on that machine since last push...
Long Explanation
So basically I use a local database for sales history of our organization which I want to push to bigquery as well to reflect realtime (lagged) info on dashboards and for other analyses and stuff as a lot of other data besides sales history resides there. Since database is strictly on-premises and cannot be accessed outside organization’s network (So literally no way to link to BigQuery!), I have some people there who export day to time sales (sales from start of the day till time of export) info periodically (1-2hrs) from database and upload to OneDrive. I got OneDrive on a machine where many other scripts are hosted (Its just convenient!) and I run (python) script for reading all csvs, combine them and push to BigQuery. Often there are duplicates so it is necessary to read all the files, remove duplicates and then push them to BigQuery (for which I use:
files = [file for file in os.listdir(input_directory) if file.count('-')<=1]
data = [pd.read_excel(input_directory+file) for file in files if file.endswith('.xlsx')]
all_data = pd.concat(data, ignore_index=True).drop_duplicates()
def upload():
all_data.to_gbq(project_id = project_id,
destination_table = table,
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
progress_bar = True,
if_exists = 'replace')
What I am trying to do is to is only update bigquery table if there are any new changes when script is run since they don’t always got time to do it.
My current approach is I write the length of dataframe in a file at the end of script as:
with open("length.txt", "w") as f:
f.write(len(all_data))
and once all files are read in df, I use:
if len(all_data) > int(open("length.txt","r").readlines()[0]):
upload()
But doing this needs all files to be read in RAM Reading so many files actually make it a bit congested for other scripts on the machine (RAM-wise). So I do not even want to read them all in RAM as per my current approach.
I tried accessing file attributes as well and tried to build a logic based on date modified as well but as long as a new file is added, it got changed even when file is not fully downloaded on machine. I searched as well to access sync status of files and came across: Determine OneDrive Sync Status From Batch File but that did not help. Any help bettering this situation is appreciated!
We have similar workflows to this where we load data from files into a database regularly by script. For us, once a file has been processed, we move it to a different directory as part of the python script. This way, we allow the python script to load all data from all files in the directory as it is definitely new data.
If the files are cumulative (contain old data as well as new data) and therefore you only want to load any rows that are new, this is where it gets tricky. You are definitely on the right track, as we use the modified date to ascertain whether the file has changed since we last processed it. in python you can get this from the os library os.path.getmtime(file_path).
This should give you the last date/time the file was changed in any way, for any operating system.
I recommend just moving the files out of your folder containing new files once they are loaded to make it easier for your python script to handle. I do not know much about OneDrive though so i cannot help with that aspect.
Good luck!

Best approach to handle large files on django website

Good morning all.
I have a generic question about the best approach to handle large files with Django.
I created a python project where the user is able to read a binary file (usually the size is between 30-100MB). Once the file is read, the program processes the file and shows relevant metrics to the user. Basically it outputs the max, min, average, std of the data.
At the moment, you can only run this project from the cmd line. I'm trying to create a user interface so that anyone can use it. I decided to create a webpage using django. The page is very simple. The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.
Working on my local machine I was able to implement it. I upload the files (it saves on the user's laptop and then it processes it). I then created an S3 account, and now the files are all uploaded to S3. The problem that I'm having is that when I try to get the file (I'm using smart_open (https://pypi.org/project/smart-open/)) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.
My question is: What is the best approach to retrieve files from S3, and process them? I'm thinking of simply downloading the file to my server, process it, and then deleting it. I've tried this on my localhost and it's fast. Downloading from S3 takes 5sec and processing takes 4sec.
Would this be a good approach? I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs. Is this something practical, or will I fill up the server?
Thank you for your time!
Edit
To give a bit more of a context, what's making it show is the f.read() line. Due to the format of the binary file. I have to read the file in the following way:
name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length) <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.
All,
After some experimenting, I found a solution that works for me.
with open('temp_file_name', 'wb') as data:
s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)
read_file('temp_file_name')
os.remove('temp_file_name')
I don't know if this is the best approach or what are the possible downfalls of this approach. I'll use it and come back to this post if I end up using a different solution.
The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long. What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file. Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).
If you are working with medium size files (30-50mb) this approach seems to work. My only concern is if we try to download a really large file if the server will run out of disk space.

How to load large xml dataset file in python?

Hi I am working on a project in data analysis with python where I have an XML file of around 2,8GB which is too large to open . I downloaded EmEditor which helped me open the file . The problem is when i try to load the file in python google colaboratory like this :
import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml') //dataset.xml is the name of my file
root = tree.getroot()
I get the result that No such file or directory: 'dataset.xml' exists . I have my dataset.xml file on my desktop and it can be opened using the EmEditor which gives me the idea that it can be edited and loaded via the EmEditor but I don't know . I would appreciate your help with helping me load the data in python
google colab.
Google Colab runs remotely on a computer from Google, and can't access files that are on your desktop.
To open the file in Python, you'll first need to transfer the file to your colab instance. There's multiple ways to do this, and you can find them here: https://colab.research.google.com/notebooks/io.ipynb
The easiest is probably this:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Although keep in mind that every time you start a new colab session you'll need to reupload the file. This is because Google would like to use to the computer for someone else when you are not using it, and thus wipes all the data on the computer.

xlswriter on a EC2 instance

I am running my Python script in which I write excel files to put them into my EC2 instance. However, I have noticed that these excel files, although they are created, are only put into the server once the code stops.
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
Many thanks
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
No. It isn't possible to stream or write a partial xlsx file like a CSV or Html file since the file format is a collection of XML files in a Zip container and it can't be generated until the file is closed.

Categories

Resources