Extracting csv files from multiple zipped files in python

Extracting csv files from multiple zipped files in python - python

I have written a script which extracts text from multiple csv's. Can someone help me embed the script in this which can read csv data from different zipped files and create multiple csv's(one for each ziped file) at a location.
For example-- If i have 10 csv's in zipped folder z1 and 5 in zipped folder z2. I want to extract files from each zipped file and get the extracted files at one location. In this case it would be z1.csv(with concatenated data from 10 csv's) and z2.csv(with concatenated data from 5 csv's).
I am using the following script,
allfiles=glob.glob(os.path.join(input_fldr,"*.csv"))
a=[]
b=[]
for file_ in allfiles:
dirname, filename=os.path.split(file_)
f=open(file_,'r',encoding='UTF-8')
lines=f.readlines()
f.close()
for line in lines:
if line.startswith('Hello'):
a.append(filename)
b.append(line)
df_a=pd.DataFrame(a,columns=list("A")
df_b=pd.DataFrame(b,columns=list("B")
df=pd.concat([df_a,df_b],axis=1)

The Code
The code I came to, that does roughly what I believe you are wanting to happen is this (all the files you need for this example are available here):
import zipfile
import pandas as pd
virtual_csvs = []
with zipfile.ZipFile("test3.zip", "r") as f:
for name in f.namelist():
if name.endswith(".csv"):
data = f.open(name)
virtual_csvs.append(pd.read_csv(data, header=None))
pd.concat(virtual_csvs, axis=1).to_csv('test4.csv', header=False, index=False)
Code Breakdown
virtual_csvs = []
We start by creating an array that will store all of the panda DataFrames, much like your array [df_a, df_b]
with zipfile.ZipFile("test3.zip", "r") as f:
This will load the zipfile, "test3.zip" - replace with your zipfile name, in read mode into the variable f
for name in f.namelist():
This iterates over every file name in the zipfile, and loads that to the variable: name
if name.endswith(".csv"):
This line is rather self-explanatory - if the file has an extension of .csv, run the following code.
data = f.open(name)
The f.open(name) command opens the file (name) - the equivalent would be open(name, 'r') as data
virtual_csvs.append(pd.read_csv(data, header=None))
pd.read_csv(data, header=None) loads that file into a panda dataframe (header=None means no column headers so the data is loaded into a dataframe)
virtual_csvs.append loads the dataframe into the virtual_csvs list
The final line of this code:
pd.concat(virtual_csvs, axis=1).to_csv('output.csv', header=False, index=False)
concatenates all of the csv files into one larger file ('output.csv').
pd.concat(virtual_csvs, axis=1) means to join all the csv files (DataFrame) in virtual_csvs by column (this returns a pd.DataFrame)
to_csv('output.csv', header=False, index=False) means to convert the given DataFrame to a csv file, named 'output.csv'.
header=False means to remove header names for each column
index=False disables row numbers from the DataFrames

Related

Python script to convert multiple txt files into single one

I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester

pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!

Python Pandas - loop through folder of Excel files, export data from each Excel file's sheet into their own .xlsx file

I have a folder of Excel files, many of which have 3-4 tabs worth of data that I just want as individual Excel files. For example, let's say I have an Excel file with three tabs: "employees", "summary", and "data". I would want this to create 3 new Excel files out of this: employees.xlsx, summary.xlsx, and data.xlsx.
I have code that will loop through a folder and identify all of the tabs, but I have struggling to figure out how to export data individually from each sheet into its own Excel file. I have gotten to the point where I can loop through the folder, open each Excel file, and find the name of each sheet. Here's what I have so far.
import pandas as pd
import os
# filenames
files = os.listdir()
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))
excels = [pd.ExcelFile(name, engine='openpyxl') for name in excel_names]
sh = [x.sheet_names for x in excels] # I am getting all of the sheet names here
for s in sh:
for x in s:
#there is where I want to start exporting each sheet as its own spreadsheet
#df.to_excel("output.xlsx", header=False, index=False) #I want to eventually export it obviously, this is a placeholder

import pandas as pd
import glob
# get the file names using glob
# (this assumes that the files are in the current working directory)
excel_names = glob.glob('*.xlsx')
# iterate through the excel file names
for excel in excel_names:
# read the excel file with sheet_name as none
# this will create a dict
dfs = pd.read_excel(excel, sheet_name=None)
# iterate over the dict keys (which is the sheet name)
for key in dfs.keys():
# use f-strings (only available in python 3) to assign
# the new file name as the sheet_name
dfs[key].to_excel(f'{key}.xlsx', index=False)

'EmptyDataError: No columns to parse from file' in Pandas when concatenating all files in a directory into single CSV

So I'm working on a project that analyzes Covid-19 data from this entire year. I have multiple csv files in a given directory. I am trying to merge all the files' contents from each month into a single, comprehensive csv file. Here's what I got so far as shown below...Specifically, the error message that appears is 'EmptyDataError: No columns to parse from file.' If I were to delete df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file) and simply run print(file) It lists all the correct files that I am trying to merge. However, when trying to merge all data into one I get that error message. What gives?
import pandas as pd
import os
df = pd.read_csv('./csse_covid_19_daily_reports_us/09-04-2020.csv')
files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')]
all_data = pd.DataFrame()
for file in files:
df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file)
all_data = pd.concat([all_data, df])
all_data.head()

Folks, I have resolved this issue. Instead of sifting through files with files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')], I have instead used files=[f for f in os.listdir("./") if f.endswith('.csv')]. This filtered out some garbage files that were not .csv, thus allowing me to compile all data into a single csv.

merging multiple CSV files in one with same header but different csv files name with python

I'm new in python ...I have tried to apply this code to merge multiple csv files but it doesn't work..basically, I have a files which contains stock prices with header: date,open,High,low,Close,Adj Close Volume... . but each csv file has a different name: Apl.csv,VIX.csv,FCHI.csv etc..
I would like to merge all these csv files in One.. but I would like to add a new columns which will disclose the name of the csv files example:
stock_id,date,open,High,low,Close,Adj Close Volume with stock_id = apl,Vix etc..
I used this code but I got stuck in line 4
here is the code:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv")
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")
Someone could help me to sort out this ?

In Python, the indentation level matters, and you need a colon at the end of an if statement. I can't speak to the method you're trying, but you can clean up the synax with this:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv"):
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")

I'm relatively new in python ..here is what I'd like to do..I got a folder with multiples csv files ( 2018.csv,2017.csv,2016.csv etc..)500 csv files to be precise.. each csv contains header "date","Code","Cur",Price etc..I'd like to concatenate all 500 csv files in one datafame...here is my code for one csv files but it's very slow , I want to do it for all 500 files and concantanate in one dataframe :
DB_2017 = pd.read_csv("C:/folder/2018.dat",sep=",", header =None).iloc[: 0,4,5,6]
DB_2017.columns =["date","Code","Cur",Price]
DB_2017['Code'] =DB_2017['Code'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['Cur'] =DB_2017['Cur'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['date'] =DB_2017['date'].apply(lambdax:pd.timestamp(str(x)[:10)
DB_2017['Price'] =pd.to_numeric(DB_2017.Price.replace(',',';')

Writing new data to existing CSV file

I have a script that parses Excel files all together from one directory. It joins all of the files together and concatenates them into one.
Right now the way I write CSV files from a dataframe by starting an empty list then appending the scraped data from the function cutpaste which parses the data I want from each file and into a new dataframe which then writes a final concatenated CSV file.
files is the variable that calls all the Excel files from a given directory.
# Create new CSV file
df_list = []
for file in files:
df = pd.read_excel(io=file, sheet_name=sheet)
new_file = cutpaste(df)
df_list.append(new_file)
df_final = pd.concat(df_list)
df_final.to_csv('Energy.csv', header=True, index=False)
What I need now is a way of changing my code so that I can write any new Excel files that don't already exist in Energy.csv to Energy.csv.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting csv files from multiple zipped files in python - python

Related

Python script to convert multiple txt files into single one

Python Pandas - loop through folder of Excel files, export data from each Excel file's sheet into their own .xlsx file

'EmptyDataError: No columns to parse from file' in Pandas when concatenating all files in a directory into single CSV

merging multiple CSV files in one with same header but different csv files name with python

Writing new data to existing CSV file

Categories

Resources