I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)
Here is the code I am working with:
# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)
# Confirm the staging directory path
os.path.isdir(staging_dir)
# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")
# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()
# Program to unzip the files
import zipfile
z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()
#Create the dataframes
import io
import glob
import pandas as pd
files = glob.glob(os.path.join("staging" + "/*.csv"))
# OS independent reading of files
for file in files:
dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')
I believe P.Tillmann's solution should've worked. Alternatively, you can load all your dataframes first and then write them back.
files = glob.glob(os.path.join("staging" + "/*.csv"))
dict_ = {}
for file in files:
dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()
for file in dict_:
dict_[file].to_csv(file, encoding='utf-8')
Just add
dfs.dropna().to_csv(file, encoding='utf-8')
to your last loop. It will drop all rows with null values and then save the dataframe by overwriting the old version.
And remove the first bracket in your last line, you open two but only close one. Thats where the EOF error is coming from.
Related
I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.
I am writing these csv files with the help of spark dataset like this in java
df.write().csv(somePath)
I was also thinking of using coalsec(1) but it is not memory efficient in my case
I know that this write will also create some redundant files in a folder. so need to handle that also
I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file
I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files
so I am not able to figure out how should I merge all the csv files without merging the headers
coalesce(1) is exactly what you want.
Speed/memory usage is the tradeoff you get for wanting exactly one file
It seems this will do it for you:
# importing libraries
import pandas as pd
import glob
import os
# merging the files
joined_files = os.path.join("/hadoop", "*.csv")
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
Edit: I don't know much about Hadoop, but maybe the same logic applies.
I've been searching for a way to merge all csv files in a folder. They all have the same headers, but different names. I've found some videos on youtube on merge and some questions here on stackoverflow that touches the matter. The problem is that this tutorials are focused on files with the same name as: sales1, sales2, etc.
In my case, all files in the directory are CSVs and are located in 'D:\XXXX\XXXX\output'
The code I have used is:
import pandas as pd
# set files path
amazon = r'D:\XXXX\XXXX\output\amazonbooks.csv'
bookcrossing = r'D:\XXXX\XXXX\output\bookcrossing.csv'
# merge files
dataFrame = pd.concat(
map(pd.read_csv, [amazon, bookcrossing]), ignore_index=True)
print(dataFrame)
If the code could merge all the files that stand in the folder output (since all of them are .csv), instead of naming each one of them, it would be better.
I'd be glad if anyone can help me with this problem, or can guide me on how to solve this.
If the goal is to append the files into a single result, you don't really need any CSV processing at all. Just write the file contents minus the header line (except the first one). glob will return file names with path that match the pattern, "*.csv".
from glob import glob
import os
import shutil
csv_dir = r'D:\XXXX\XXXX\output'
result_csv = r'd:\XXXX\XXXX\combined.csv'
first_hdr = True
# all .csv files in the directory have the same header
with open(result_csv, "w", newline="") as result_file:
for filename in glob(os.path.join(csv_dir, "*.csv")):
with open(filename) as in_file:
header = in_file.readline()
if first_hdr:
result_file.write(header)
first_hdr = False
shutil.copyfileobj(in_file, result_file)
(assuming all csvs have equal number of columns)
Try something like this:
import os
import pandas as pd
csvs = [file for file in os.listdir('D:\XXXX\XXXX\output\') if file.endswith('.csv')]
result_df = pd.concat([pd.read_csv(f'D:\XXXX\XXXX\output\{file}') for file in csvs])
I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester
pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!
I have some Python (3.8) code that does the following:
Walks directory and subdirectories of a given path
Finds all .csv files
Finds all .csv files with 'Pct' in filename
Joins path and file
Reads CSV
Adds filename to df
Concatonates all dfs together
The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.
My current working code is below:
start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/' # change this
df_result= None
#loop_number = 0
for path, dirs, files in os.walk(start_dirctory):
for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
# print(file)
if 'Pct' in file: # filter if contains 'Pct'
# print('Pct = ', file)
full_name=os.path.join(path, file) # make full file path
df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
df_tmp['file']=os.path.basename(file) # df.file = file name
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
#print(full_name, 'imported')
#loop_number = loop_number + 1
#print('Loop number =', loop_number)
Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.
What I have tried below (incomplete):
%%time
import glob
import pandas as pd
df = pd.concat(
[pd.read_csv(f, header=None)
for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
],
axis='index', ignore_index=True
)
Question
Is there any way that I can reduce the time to read and ingest the CSV's in my code above?
Thanks!
Check out the following solution, this assumes the open-file system limit is high enough, because this will stream every file one by one, but it has to open each of them to read headers. In cases where files have different columns, you will get the superset of them in the resulting file:
from convtools import conversion as c
from convtools.contrib.tables import Table
files = sorted(
os.path.join(path, file)
for path, dirs, files in os.walk(start_dirctory)
for file in files
if "Pct" in file and file.endswith(".csv")
)
table = None
for file in files:
table_ = Table.from_csv(file, header=True) # assuming there's header
if table is None:
table = table_
else:
table.chain(table_)
# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict) # or list, or tuple
# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice
If you are sure that all the files have same columns (one file is being opened at a time):
edited to add file column
def concat_files(files):
for file in files:
yield from Table.from_csv(file, header=True).update(
file=file
).into_iter_rows(dict)
# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)
P.S. of course you can replace Table.from_csv with a standard/other reader, but this one adapts to the file, so it is generally faster on large files.
I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.