Unzip a file and save its content into a database - python

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()

You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

Related

AWS Lambda - Combine multiple CSV files from S3 into one file

I am trying to understand and learn how to get all my files from the specific bucket into one csv file. I have the files that are like logs and are always in the same format and are kept in the same bucket. I have this code to access them and read them:
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
x = obj.get()['Body'].read().decode('utf-8')
print(x)
It does print them with separation between specific files and also column headers.
The question I have got is, how can I modify my loop to get them into just one csv file?
You should create a file in /tmp/ and write the contents of each object into that file.
Then, when all files have been read, upload the file (or do whatever you want to do with it).
output = open('/tmp/outfile.txt', 'w')
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
output.write(obj.get()['Body'].read().decode('utf-8'))
output.close
Please note that there is a limit of 512MB in the /tmp/ directory.

In Python, how do I create a list of files based on specified file extensions?

Let's say I have a folder with a bunch of files (with different file extensions). I want to create a list of files from this folder. However, I want to create a list of files with SPECIFIC file extensions.
These file extensions are categorized into groups.
File Extensions: .jpg, .png, .gif, .pdf, .raw, .docx, .pptx, .xlsx, .js, .html, .css
Group "image" contains .jpg, .png, .gif.
Group "adobe" contains .pdf, .raw. (yes, I'm listing '.raw' as an adobe file for this example :P)
Group "microsoft" contains .docx, .pptx, .xlsx.
Group "webdev" contains .js, .html, .css.
I want to be able to add these files types to a list. That list will be generated in a ".txt" file and would contain ALL files with the chosen file extensions.
So if my folder has 5 image files, 10 adobe files, 5 microsoft files, 3 webdev files and I select the "image" and "microsoft" groups, this application in Python would create a .txt file that contains a list of filenames with file extensions that belong only in the image and microsoft groups.
The text file would have a list like below:
picture1.jpg
picture2.png
picture3.gif
picture4.jpg
picture5.jpg
powerpoint.pptx
powerpoint2.pptx
spreadsheet.xlsx
worddocument.docx
worddocument2.docx
As of right now, my code creates a text file that generates a list of ALL files in a specified folder.
I could use an "if" statement to get specific file extension, but I don't think this achieves the results I want. In this case, I would have to create a function for each Group (i.e. function for the image, adobe, microsoft and webdev groups). I want to be able to combine these groups freely (i.e. image and microsoft files in a list).
Example of an "if" statement:
for elem in os.listdir(filepath):
if elem.endswith('.jpg'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.png'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.gif'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
else:
continue
Full code without the "if" statement (generates a .txt file with all filenames from a specified folder):
import os
def enterFilePath():
global filepath
filepath = input("Please enter your file path. ")
os.chdir(filepath)
enterFilePath()
def enterFileName():
global name
global listName
name = str(input("Name the txt file. "))
listName = name + ".txt"
enterFileName()
def listGenerator():
for filename in os.listdir(filepath):
listItem = filename + ' \n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
listGenerator()
A pointer before getting into the answer - avoid using global in favor of function parameters and return values. It will make debugging significantly less of a headache and make it easier to follow data flow through your program.
nostradamus is correct in his comment, a dict will be the ideal way to solve your problem here. I've also done similar things as your issue before using itertools.chain.from_iterable and pathlib.Path, which I'll be using here.
First, the dict:
groups = {
'image': {'jpg', 'png', 'gif'},
'adobe': {'pdf', 'raw'},
'microsoft': {'docx', 'pptx', 'xlsx'},
'webdev': {'js', 'html', 'css'}
}
This sets up your extension groups that you mentioned, which you can then access easily with groups['image'], groups['adobe'], etc.
Then, using the Path.glob method, itertools.chain.from_iterable, and a comprehension, you can get your list of desired files in a single statement (or function).
# Set up some variables
target_groups = ['adobe', 'webdev']
# Initialize generator
files = chain.from_iterable(
# Glob pattern for the current extension
Path(filepath).glob(f'*.{ext}')
# Each group in target_groups
for group in target_groups
# Each extension in current group
for ext in groups[group]
)
# Then, just iterate the files
for fpath in files:
# Do stuff with file here
print(fpath.name)
My test directory has one file of each extension you listed, named a, b, etc for each group. Using the above code, my output is:
a.pdf
b.raw
a.js
b.html
c.css
The way the file list/generator is set up means that the list of files will be sorted by extension-group, then by extension, and then by name. If you want to change what groups are being listed, just add/remove values in the target_groups list above (works with a single option as well).
You'll also want to consider parameterizing your targets, such as through input or script arguments, as well as handling cases where a requested group doesn't exist in the groups dictionary. The code above would probably also be more useful as a function, but I'll leave that implementation up to you :)

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.
The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

Why are my `binaryFiles` empty when I collect them in pyspark?

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.
I pass that to "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do that, it gives an empty list:
>> zips_collected
[]
I know that the zips are not empty - they have textfiles. The documentation here says
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?
There can be more than one file per zip file, but the contents are always something like this:
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data
I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.
import io
import gzip
def zip_extract(x):
"""Extract *.gz file in memory for Spark"""
file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
return file_obj.read()
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
.flatMap(lambda zip_file: zip_file.split("\n")) \
.map(lambda line: parse_line(line))
.collect()

MetaData of downloaded zipped file

url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path='D:')
I am writing above code to download a zipped file from a url and have downloaded and extracted all files from it to a specified drive and it is working fine.
Is there a way I can get meta data of all files extracted from z for example.
filenames,file sizes and file extenstions etc?
Zipfile objects actually have built in tools for this that you can use without even extracting anything. infolist returns a list of ZipInfo objects that you can read certain information out of, including full file name and uncompressed size.
import os
url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
info = z.infolist()
data = []
for obj in info:
name = os.path.splitext(obj.filename)
data.append(name[0], name[1], obj.file_size)
I also used os.path.splitext just to separate out the file's name from its extension as you did ask for file type separately from the name.
I don't know of a built-in way to do that using the zipfile module, however it is easily done using os.path:
import os
EXTRACT_PATH = "D:"
z= zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path=EXTRACT_PATH)
extracted_files = [os.path.join(EXTRACT_PATH, filename) for filename in z.namelist()]
for extracted_file in extracted_files:
# All metadata operations here, such as:
print os.path.getsize(extracted_file)

Categories

Resources