can't zip files from Jupyter Notebook - python

I try to zip files, I used the example from https://thispointer.com/python-how-to-create-a-zip-archive-from-multiple-files-or-directory/
with ZipFile('sample2.zip', 'w') as zipObj2:
# Add multiple files to the zip
zipObj2.write('sample_file.csv')
sample2.zip is created, but it is empty. Of course that the csv file exists and is not empty.
I run this code from Jupyter Notebook
edit: I'm using relative paths -
input_dir = "../data/example/"
with zipfile.ZipFile(os.path.join(input_dir, 'f.zip'), 'a') as zipObj2:
zipObj2.write(os.path.join(input_dir, 'f.tif'))

you tried to close zip file to save ?
from zipfile import ZipFile
with ZipFile('sample2.zip', 'w') as zipObj2:
zipObj2.write('sample_file.csv')
zipObj2.close()

I'm a little confused by your question, but if I'm correct it sounds like you're trying to place multiple CSV files within a single zipped file? If so, this is what you're looking for:
#initiate files variable that contains the directory from which you wish to zip csv files
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#initalize empty DataFrame
all_data = pd.DataFrame()
#iterate through the files variable and concatenate them to all_data
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
Then call your new DataFrame(all_data) to verify that contents were transferred.

Related

How do I loop through a file path in glob.glob to create multiple files at once?

I have 10 different folder paths that I run this code through. Instead of changing them manually, I am trying to create a function to loop thru changing the file path to save time. Also, can you show me a way to disable glob.glob package overwriting the file? For example, If I ran this code once, it creates one combined file of the folder path files. If I run this twice (on accident), it duplicates the rows in the csv. For example, .csv1 has 100 rows after running code. After running it twice, it has 200 rows and has a duplication of every row. I am trying to write the code to overwrite the previous file and not have duplications because I store this in a server.
So I have 10 of these codes written out to go to separate file locations. Instead of running them separately, I want to loop them through this code to create multiple files at once.
# Change File Path to personal directory folder
os.chdir("C:/Users/File.csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Using Pandas to combine all files in the list
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "File.csv", index=False, encoding='utf-8')
You should ignore File.csv when processing the list, so you don't append it to itself.
import os
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames if os.path.basename(f) != 'File.csv' ])
Put your code in a function to make it easier to reuse and more readable.
def combine_csvs(path, output="File.csv"):
os.chdir(path)
combined = pd.concat([pd.read_csv(f) for f in glob.glob("*.csv") if f != output])
combined.to_csv(output)
Then the loop is simply:
for path in my_paths:
combine_csvs(path)

Python script to convert multiple txt files into single one

I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester
pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!

'EmptyDataError: No columns to parse from file' in Pandas when concatenating all files in a directory into single CSV

So I'm working on a project that analyzes Covid-19 data from this entire year. I have multiple csv files in a given directory. I am trying to merge all the files' contents from each month into a single, comprehensive csv file. Here's what I got so far as shown below...Specifically, the error message that appears is 'EmptyDataError: No columns to parse from file.' If I were to delete df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file) and simply run print(file) It lists all the correct files that I am trying to merge. However, when trying to merge all data into one I get that error message. What gives?
import pandas as pd
import os
df = pd.read_csv('./csse_covid_19_daily_reports_us/09-04-2020.csv')
files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')]
all_data = pd.DataFrame()
for file in files:
df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file)
all_data = pd.concat([all_data, df])
all_data.head()
Folks, I have resolved this issue. Instead of sifting through files with files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')], I have instead used files=[f for f in os.listdir("./") if f.endswith('.csv')]. This filtered out some garbage files that were not .csv, thus allowing me to compile all data into a single csv.

Pandas: How to read xlsx files from a folder matching only specific names

I have a folder full of excel files and i have to read only 3 files from that folder and put them into individual dataframes.
File1: Asterix_New file_Jan2020.xlsx
File2: Asterix_Master file_Jan2020.xlsx
File3: Asterix_Mapping file_Jan2020.xlsx
I am aware of the below syntax which finds xlsx file from a folder but not sure how to relate it to specific keywords. In this case starting with "Asterix_"
files_xlsx = [f for f in files if f[-4:] == "xlsx"]
Also i am trying to put each of the excel file in a individual dataframe but not getting successful:
for i in files_xlsx:
df[i] = pd.read_excel(files_xlsx[0])
Any suggestions are appreciated.
I suggest using pathlib. If all the files are in a folder:
from pathlib import Path
from fnmatch import fnmatch
folder = Path('name of folder')
Search for the files using glob. I will also suggest using fnmatch to include the files whose extensions are in capital letters.
iterdir allows you to iterate through the files in the folder
name is a method in pathlib that gives you the name of the file in string format
using the str lower method ensures that extensions such as XLSX, which is uppercase is captured
excel_only_files = [xlsx for xlsx in folder.iterdir()
if fnmatch(xlsx.name.lower(),'asterix_*.xlsx')]
OR
#you'll have to test this, i did not put it though any tests
excel_only_files = list(folder.rglob('Asterix_*.[xlsx|XLSX]')
from there, you can run a list comprehension to read your files:
dataframes = [pd.read_excel(f) for f in excel_only_files]
Use glob.glob to do your pattern matches
import glob
for i in glob.glob('Asterix_*.xlsx'):
...
First generate a list of files you want to read in using glob (based on #cup's answer) and then append them to a list.
import pandas as pd
import glob
my_df_list = [pd.read_excel(f) for f in glob.iglob('Asterix_*.xlsx')]
Depending on what you want to achieve, you can also use a dict to allow for key-value pairs.
At the end of the if statement you need to add another condition for files which also contain 'Asterix_':
files_xlsx = [f for f in files if f[-4:] == "xlsx" and "Asterix_" in f]
The f[-4:] == "xlsx" is to make sure the last 4 characters of the file name are xlsx and "Asterix_" in f makes sure that "Asterix_" exists anywhere in the file name.
To then read these using pandas try:
for file in excel_files:
df = pd.read_excel(file)
print(df)
That should print the result of the DataFrame read from the excel file
If you have read in the file names, you can make sure that it starts with and ends with the desired strings by using this list comprehension:
files = ['filea.txt', 'fileb.xlsx', 'filec.xlsx', 'notme.txt']
files_xlsx = [f for f in files if f.startswith('file') and f.endswith('xlsx')]
files_xlsx # ['fileb.xlsx', 'filec.xlsx']
The list comprehension says, "Give me all the files that start with file AND end with xlsx.

Merge CSV files in ADLS2 that are prepared through DataBricks

While running DataBricks code and preparing CSV files and loading them into ADLS2, the CSV files are split into many CSV files and are being loaded into ADLS2.
Is there a way to merge these CSV files in ADLS2 thru pyspark.
Thanks
Is there a way to merge these CSV files in ADLS2 thru pyspark.
As i know,spark dataframe does makes the files separately.Theoretically,you could use spark.csv method which could accept list of strings as parameters.
>>> df = spark.read.csv('path')
Then use df.toPandas().to_csv() method to write objects into pandas dataframe.You could refer to some clues from this case:Azure Data-bricks : How to read part files and save it as one file to blob?.
However,i'm afraid that this process could not hold such high memory consumption. So,i'd suggest you just using os package to do the merge job directly.I tested below 2 snippet of code for your reference.
1st:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [file for file in files if file.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
2rd:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [os.path.join(root, name) for root, dirs, files in os.walk(top=path , topdown=False) for name in files if name.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final2.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
The second one is compatible hierarchy.
In additional, i provide a way here which is using ADF copy activity to transfer multiple csv files into one file in ADLS gen2.
Please refer to this doc and configure the folder path in ADLS gen2 source dataset.Then set MergeFiles with copyBehavior property.(Besides, you could use wildFileName like *.csv to exclude files which you don't want to touch in the specific folder)
Merges all files from the source folder to one file. If the file name
is specified, the merged file name is the specified name. Otherwise,
it's an autogenerated file name.

Categories

Resources