I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester
pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!
Related
I've been searching for a way to merge all csv files in a folder. They all have the same headers, but different names. I've found some videos on youtube on merge and some questions here on stackoverflow that touches the matter. The problem is that this tutorials are focused on files with the same name as: sales1, sales2, etc.
In my case, all files in the directory are CSVs and are located in 'D:\XXXX\XXXX\output'
The code I have used is:
import pandas as pd
# set files path
amazon = r'D:\XXXX\XXXX\output\amazonbooks.csv'
bookcrossing = r'D:\XXXX\XXXX\output\bookcrossing.csv'
# merge files
dataFrame = pd.concat(
map(pd.read_csv, [amazon, bookcrossing]), ignore_index=True)
print(dataFrame)
If the code could merge all the files that stand in the folder output (since all of them are .csv), instead of naming each one of them, it would be better.
I'd be glad if anyone can help me with this problem, or can guide me on how to solve this.
If the goal is to append the files into a single result, you don't really need any CSV processing at all. Just write the file contents minus the header line (except the first one). glob will return file names with path that match the pattern, "*.csv".
from glob import glob
import os
import shutil
csv_dir = r'D:\XXXX\XXXX\output'
result_csv = r'd:\XXXX\XXXX\combined.csv'
first_hdr = True
# all .csv files in the directory have the same header
with open(result_csv, "w", newline="") as result_file:
for filename in glob(os.path.join(csv_dir, "*.csv")):
with open(filename) as in_file:
header = in_file.readline()
if first_hdr:
result_file.write(header)
first_hdr = False
shutil.copyfileobj(in_file, result_file)
(assuming all csvs have equal number of columns)
Try something like this:
import os
import pandas as pd
csvs = [file for file in os.listdir('D:\XXXX\XXXX\output\') if file.endswith('.csv')]
result_df = pd.concat([pd.read_csv(f'D:\XXXX\XXXX\output\{file}') for file in csvs])
In a loop I adjust the CSV structure of each file.
Now I want them to save in to the assigned folder with unique file names.
I can save to a CSV file, but than CSV file gets overwritten resulting in only the final modified result of the test5 file. I want save the CSV under their own filename+string _modified format.
I have 5 csv files:
Test1.csv
test2.csv
test3.csv
test4.csv
test5.csv
I import them:
for x in allFiles:
print(x)
stop=1
with open(x, 'r') as thecsv:
base=os.path.basename(ROT)
filename=os.path.splitext(base)[0]
print(name)
Now I loop through the files manipulate them and save it as DataFrame.
This is working fine.
Now I want to save each file separately in the output folder with a unique name (filename + _modified)
Output='J:\Temp\Output'
This is what I tried:
df2.to_csv(output+filename+'//_modified.csv'),sep=';',header=False,index=False)
also tried:
df2.to_csv(output(os.path.join(name+'//_modified.csv'),sep=';',header=False,index=False)
Hoping for the output folder looks like this:
test1_modified.csv
test2_modified.csv
test3_modified.csv
test4_modified.csv
test5_modified.csv
I would do something like this, making a new name before the call to write it out:
testFiles = ["test1.csv", "test2.csv", "test3.csv",
"test4.csv", "test5.csv"]
# iterate over each one
for f in testFiles:
# strip old extensions, replace with nothing
f = f.replace(".csv", "")
# I'd use join but you can you +
newName = "_".join([f, "_modified.csv"])
print(newName)
# make your call to write it out
I would also check the pandas docs for writing out, it's simpler than what you're trying:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
import pandas as pd
# read data
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# write data to local
iris.to_csv("iris.csv")
I found the solution to my problem
df.to_csv(output+'\'+filename+'.csv',sep=';',header=False,index=False)
I have written a script which extracts text from multiple csv's. Can someone help me embed the script in this which can read csv data from different zipped files and create multiple csv's(one for each ziped file) at a location.
For example-- If i have 10 csv's in zipped folder z1 and 5 in zipped folder z2. I want to extract files from each zipped file and get the extracted files at one location. In this case it would be z1.csv(with concatenated data from 10 csv's) and z2.csv(with concatenated data from 5 csv's).
I am using the following script,
allfiles=glob.glob(os.path.join(input_fldr,"*.csv"))
a=[]
b=[]
for file_ in allfiles:
dirname, filename=os.path.split(file_)
f=open(file_,'r',encoding='UTF-8')
lines=f.readlines()
f.close()
for line in lines:
if line.startswith('Hello'):
a.append(filename)
b.append(line)
df_a=pd.DataFrame(a,columns=list("A")
df_b=pd.DataFrame(b,columns=list("B")
df=pd.concat([df_a,df_b],axis=1)
The Code
The code I came to, that does roughly what I believe you are wanting to happen is this (all the files you need for this example are available here):
import zipfile
import pandas as pd
virtual_csvs = []
with zipfile.ZipFile("test3.zip", "r") as f:
for name in f.namelist():
if name.endswith(".csv"):
data = f.open(name)
virtual_csvs.append(pd.read_csv(data, header=None))
pd.concat(virtual_csvs, axis=1).to_csv('test4.csv', header=False, index=False)
Code Breakdown
virtual_csvs = []
We start by creating an array that will store all of the panda DataFrames, much like your array [df_a, df_b]
with zipfile.ZipFile("test3.zip", "r") as f:
This will load the zipfile, "test3.zip" - replace with your zipfile name, in read mode into the variable f
for name in f.namelist():
This iterates over every file name in the zipfile, and loads that to the variable: name
if name.endswith(".csv"):
This line is rather self-explanatory - if the file has an extension of .csv, run the following code.
data = f.open(name)
The f.open(name) command opens the file (name) - the equivalent would be open(name, 'r') as data
virtual_csvs.append(pd.read_csv(data, header=None))
pd.read_csv(data, header=None) loads that file into a panda dataframe (header=None means no column headers so the data is loaded into a dataframe)
virtual_csvs.append loads the dataframe into the virtual_csvs list
The final line of this code:
pd.concat(virtual_csvs, axis=1).to_csv('output.csv', header=False, index=False)
concatenates all of the csv files into one larger file ('output.csv').
pd.concat(virtual_csvs, axis=1) means to join all the csv files (DataFrame) in virtual_csvs by column (this returns a pd.DataFrame)
to_csv('output.csv', header=False, index=False) means to convert the given DataFrame to a csv file, named 'output.csv'.
header=False means to remove header names for each column
index=False disables row numbers from the DataFrames
I have hit a wall. So far have the following code:
# define variables of each directory to be used
parent_data_dir = 'C:\\Users\\Admin\\Documents\\Python Scripts\\Data\\'
orig_data_dir = 'C:\\Users\\Admin\\Documents\\Python Scripts\\Data\\Original\\'
new_data_dir = 'C:\\Users\\Admin\\Documents\\Python Scripts\\Data\\New\\'
# Create list of original data files from orig_data_dir
orig_data = []
for root, dirs, files in os.walk(orig_data_dir):
for file in files:
if file.endswith('.csv'):
orig_data.append(file)
# It populates the file names located in the orig_data_dir
# orig_data = ['Test1.csv', 'Test2.csv', 'Test3.csv']
# Create list of new data files from new_data_dir
new_data = []
for root, dirs, files in os.walk(new_data_dir):
for file in files:
if file.endswith('.csv'):
new_data.append(file)
# It populates the file names located in the orig_data_dir
# orig_data = ['Test1_2.csv', 'Test2_2.csv', 'Test3_2.csv']
I have three csv files in each directory. The csv files that end with _2.csv have new data I would like to append to the old data into a new csv file for each respective pair. Each csv file has the exact same rows. What I am trying to do is the following:
Read Test1.csv and Test1_2.csv into one dataframe using the lists I created (if better way, I am open to this) (next iteration = Test2.csv and Test2_2.csv, etc.)
Do some pandas stuff
Write new file called Test_Compiled_1.csv (next iteration = Test_Compiled_2.csv, etc.)
Repeat until each csv pair from the two directories have been combined into a new csv file for each pair.
EDIT:
I have 1000s of csv files. With that said, i need to:
read in the first file pair to the same dataframe:
1st iteration: Test1.csv located in orig_data_dir and Test1_2.csv located in new_data_dir
do pandas stuff
write out the populated dataframe to a new file in parent_data_dir
Repeat for each file pair
2nd iteration would be: Test2.csv and Test2_2.csv
1000 iteration would be: Test1000.csv and Test1000_2.csv
Hope this helps clarify.
The best advice it to let the same names to the files in each directory,
and let only useful data in these directories. Here is a solution for different names:
for filename in os.listdir(orig_data_dir):
name,ext = os.path.splitext(filename)
filename_2 = new_data_dir+name+'_2'+ext # construct new filename from old
if os.path.isfile(filename_2):
df_Orig=pd.read_csv(orig_data_dir+filename,index_col=0)
df_New=pd.read_csv(filename_2,index_col=0)
df_Orig.append(df_New).to_csv(orig_data_dir+filename)
Here I accumulate the result in the Original file. Only one loop is necessary.
Something like this would help you:
from itertools import chain
import fnmatch
paths = ('/path/to/directory/one/', '/path/to/directory/two/', 'etc.', 'etc.')
file1 = []
file2 = []
for path, dirs, files in chain.from_iterable(os.walk(path) for path in paths):
for file in files:
if file in fnmatch.filter(files, '*1*.csv'):
file1.append(file)
if file in fnmatch.filter(files, '*2*.csv'):
file2.append(file)
To create your dataframes you would do something like this;
df_file1 = pd.concat([pd.DataFrame(pd.read_csv(file1[0], sep=';')), pd.DataFrame(pd.read_csv(file1[1], sep=';'))], ignore_index=True)
df_file2 etc.
Note; the 'sep' in your csv might be different.
EDIT; I've changed endswith with fnmatch.filter, you can now use any pattern you like for matching the files you need in the different directories.
I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.