I have a folder which includes around 400 txt files. MAx size of txt file is 2 to 2.5 mb.
I am trying to convert these files to csv with python code. My code perfectly works and quickly converts txt to csv when I have small size of txt( even more than 500 files ) But when size it little heavy it takes quite long time.
Well it's obvious to take long time for heavy data but the problem is I am running this conversion process since 2 days and not even 50% is completed.
Is there any idea to convert these txt file to csv quickly?? I mean withing few hours.
If it takes more than 2 days then I will not have enough time to analyze it.
My code is here:
import glob
import os, os.path, glob
import numpy as np
import matplotlib.pyplot as plt
from natsort import natsorted
import pandas as pd
from matplotlib.patches import Ellipse
from matplotlib.text import OffsetFrom
from mpl_toolkits.mplot3d import Axes3D
from random import random
data_folder = "./all/"
data_folder
files = natsorted(glob.glob(data_folder + 'dump*.data'))
number_of_files = len(files)
#print(number_of_files)
#files
file_open = open("./all/dump80000.data", "r")
with open("./all/dump80000.data") as f:
lines = f.readlines()
#removing 'ITEM:'
s = 'ITEM: ATOMS '
lines[8] = lines[8].replace(s, '')
#getting the header names
headers = lines[8].split()
headers.append('TIMESTEP')
df = pd.DataFrame(columns=headers)
counter = 0
for total_files in range(number_of_files):
with open(files[total_files]) as f:
lines = f.readlines()
total_atoms = int(lines[3])
for i in range(total_atoms):
row_elements = lines[9+i].split()
row_elements.append(int(lines[1]))
df.loc[counter] = row_elements
counter=counter+1
df.to_csv(r'all.csv', index = False)
Any idea ? Suggestion?
Thank you
In case, if you need txt sample:
https://raw.githubusercontent.com/Laudarisd/dump46000.data
or
https://raw.githubusercontent.com/Laudarisd/test/main/dump46000.data
How about using simple readline? I am suspect readlines and/or pd.DataFrame are consuming so much time. The following seems to be fast enough for me.
import glob
import time
start = time.time()
data_folder = "./all/"
files = glob.glob(data_folder + 'dump*.data')
# get header from one of the files
with open('all/dump46000.data', 'r') as f:
for _ in range(8):
next(f) # skip first 8 lines
header = ','.join(f.readline().split()[2:]) + '\n'
for file in files:
with open(file, 'r') as f, open(f'all.csv', 'a') as g: # note the 'a'
g.write(header) # write the header
for _ in range(9):
next(f) # skip first 9 lines
for line in f:
g.write(line.rstrip().replace(' ', ',') + '\n')
print(time.time() - start)
# id,type,x,y,z,vx,vy,vz,fx,fy,fz
# 201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
# 623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
# 851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
# 159,1,-0.0268728,-0.186269,1.51979,-0.000262947,-0.000386994,0.000254515,0.00961213,-0.00640215,-0.0397847
Taking a quick glance at your code, it seems you're taking the following approach to convert a file:
Open the file
Read the entire file into a buffer
Process the buffer
However, if you can make some small adjustments to your code:
Open the file
Read one line
Process the line
Continue until the file is done
Basically, take an iterative approach instead of reading the whole file all at once. Next, you can then make it even faster using asyncio, where you can process all your files concurrently.
It's hard to give precise help without knowing exactly what data you want to extract from those files but from a first glance you definitely should use one of pandas' built-in file reading methods which are guaranteed to be many times faster than your code. Assuming you wish to skip the first 9 rows, you could do something like:
headers = ["a", "b", ...]
pd.read_csv(open("./all/dump80000.data"), skiprows=9, sep=" ", columns=headers)
If this is still not fast enough, you can parallelize your code since most of the processing is just loading data into memory.
I recommend breaking the problem down into distinct steps for a few files, then once you're sure you understand how to correctly code each step independently, you can think about combining them:
convert all TXT to CSVs
process each CSV doing what you need
Here's how to do step 1:
import csv
out_f = open('output.csv', 'w', newline='')
writer = csv.writer(out_f)
in_f = open('input.txt')
# Consume first 8 lines you don't want
for _ in range(8):
next(in_f)
# Get and fix-up your header
header = next(in_f).replace('ITEM: ATOMS ', '')
writer.writerow(header.split())
# Read the rest of the file line-by-line, splitting by space, which will make a row that the CSV writer can write
for line in in_f:
row = line.split()
writer.writerow(row)
in_f.close()
out_f.close()
When I ran that against your sample .data file, I got:
id,type,x,y,z,vx,vy,vz,fx,fy,fz
201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
...
Do that for all 400 TXT files, then write another script to process the resulting CSVs.
I'm on an M1 Macbook Air, with a good, fast SSD. Converting that one .data file takes less than point-one seconds. Unless you've got a really slow disk, I cannot see both steps taking more than hour.
trying to implement nested "for" loop in CSV files search in way - 'name' found in one CSV file being searched in other file. Here is code example:
import csv
import re
# Open the input file
with open("Authentication.csv", "r") as citiAuthen:
with open("Authorization.csv", "r") as citiAuthor:
#Set up CSV reader and process the header
csvAuthen = csv.reader(citiAuthen, quoting=csv.QUOTE_ALL, skipinitialspace=True)
headerAuthen = next(csvAuthen)
userIndex = headerAuthen.index("'USERNAME'")
statusIndex = headerAuthen.index("'STATUS'")
csvAuthor = csv.reader(citiAuthor)
headerAuthor = next(csvAuthor)
userAuthorIndex = headerAuthor.index("'USERNAME'")
iseAuthorIndex = headerAuthor.index("'ISE_NODE'")
# Make an empty list
userList = []
usrNumber = 0
# Loop through the authen file and build a list of
for row in csvAuthen:
user = row[userIndex]
#status = row[statusIndex]
#if status == "'Pass'" :
for rowAuthor in csvAuthor:
userAuthor = rowAuthor[userAuthorIndex]
print userAuthor
What is happening that "print userAuthor" make just one pass, while it has to make as many passes as there rows in csvAuthen.
What I am doing wrong? Any help is really appreciated.
You're reading the both files line-by-line from storage. When you search csvAuthor the first time, if the value you are searching for is not found, the file pointer remains at the end of the file after the search. The next search will start at the end of the file and return immediately. You could need to reset the file pointer to the beginning of the file before each search. Probably better just to read both files into memory before you start searching them.
I am beginner in the programming world and a would like some tips on how to solve a challenge.
Right now I have ~10 000 .dat files each with a single line following this structure:
Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value
I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.
So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.
Any tips on how to go about that?
Thank you!
Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.
I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.
Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.
You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)
Good luck!
import os
# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
attributes = []
values = []
# first we split the input over the &
AtributeValues = line.split('&')
for attrVal in AtributeValues:
# we split the attribute=value over the '=' sign
# the left part goes to split[0], the value goes to split[1]
split = attrVal.split('=')
attributes.append(split[0])
values.append(split[1])
# return the attributes list and values list
return attributes,values
# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)
# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
f_in = open(inFile, 'r') # only reading the file
f_out = open(wfile, 'ab+') # file is opened for reading and appending
# read the whole file line by line
lines = f_in.readlines()
# loop throug evert line in the file and write its values
for line in lines:
# let's check if the file is empty and write the headers then
first_char = f_out.read(1)
header, values = getAttributesAndValues(line)
# we write the header only if the file is empty
if not first_char:
for attribute in header:
f_out.write(attribute+delim)
f_out.write("\n")
# we write the values
for value in values:
f_out.write(value+delim)
f_out.write("\n")
# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]
# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
writeToCsv('input/'+singleFile)
but since the Attribute1,Attribute2...AttributeN are exactly the same
for every file, I would like to make them into column headers and
remove them from every other line.
input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'
once for the the first file:
','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))
for each file's content:
','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))
Maybe you need to trim the strings additionally; don't know how clean your input is.
Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]
mergeDats.py
import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
f = open("myDats/"+file,"r")
tempData = f.readlines()[0]
tempData = tempData.replace("&","\n")
g.write(tempData)
f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
temp2 = x.split("=")
dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
w = csv.DictWriter(output,my_dict.keys())
w.writeheader()
w.writerow(my_dict)
I am starting out in Python, and I am looking at csv files.
Basically my situation is this:
I have coordinates X, Y, Z in a csv.
X Y Z
1 1 1
2 2 2
3 3 3
and I want to go through and add a user defined offset value to all Z values and make a new file with the edited z-values.
here is my code so far which I think is right:
# list of lists we store all data in
allCoords = []
# get offset from user
offset = int(input("Enter an offset value: "))
# read all values into memory
with open('in.csv', 'r') as inFile: # input csv file
reader = csv.reader(inFile, delimiter=',')
for row in reader:
# do not add the first row to the list
if row[0] != "X":
# create a new coord list
coord = []
# get a row and put it into new list
coord.append(int(row[0]))
coord.append(int(row[1]))
coord.append(int(row[2]) + offset)
# add list to list of lists
allCoords.append(coord)
# write all values into new csv file
with open(in".out.csv", "w", newline="") as f:
writer = csv.writer(f)
firstRow = ['X', 'Y', 'Z']
allCoords.insert(0, firstRow)
writer.writerows(allCoords)
But now come's the hard part. How would I go about going through a bunch of csv files (in the same location), and producing a new file for each of the csv's.
I am hoping to have something like: "filename.csv" turns into "filename_offset.csv" using the original file name as a starter for the new filename, appending ".offset" to the end.
I think I need to use "os." functions, but I am not sure how to, so any explanation would be much appreciated along with the code! :)
Sorry if I didn't make much sense, let me know if I need to explain more clearly. :)
Thanks a bunch! :)
shutil.copy2(src, dst)ΒΆ
Similar to shutil.copy(), but metadata is copied as well
shutil
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell. No tilde expansion is
done, but *, ?, and character ranges expressed with [] will be correctly matched
glob
import glob
from shutil import copy2
import shutil
files = glob.glob('cvs_DIR/*csv')
for file in files:
try:
# need to have full path of cvs_DIR
oldName = os.path.join(cvs_DIR, file)
newName = os.path.join(cvs_DIR, file[:4] + '_offset.csv')
copy2(oldName,newName)
except shutil.Error as e:
print('Error: {}'.format(e))
BTW, you can write ...
for row in reader:
if row[0] == "X":
break
for row in reader:
coord = []
...
... instead of ...
for row in reader:
if row[0] != "X":
coord = []
...
This stops checking for 'X'es after the first line.
It works because you dont work with a real list here but with a self consuming iterator, which you can stop and restart.
See also: Detecting if an iterator will be consumed.
I am a complete newb at this and I have the following script..
It writes some random data to .csv. My end goal is to keep this preexisting .csv but add ONE random generated datapoint to the beginning of this csv in a separate Python script.
Completely new at this -- not sure how to go about doing this. Thanks for your help.
output = [a,b]
d = csv.writer(csvfile, delimiter=',', quotechar='|',
quoting=csv.QUOTE_MINIMAL)
d.writerow(output)
Are you sure you are trying to add it to the start of the file? I feel like you would want to add it to the end or if you did want to add it at the beginning you would at least want to put it after the header row which is ['name', 'value'].
As it stands your current script has several errors when I try to compile it myself so I can help you out a bit there.
The directory string doesn't work because of the slashes. It will work if you add an r in front (for raw string) like so r'C:/Users/AMB/Documents/Aptana Studio 3 Workspace/RAVE/RAVE/resources/csv/temperature.csv'
You don't need JSON to import json or logging if this is the entirety of your code.
Inside of your for loop you redefine the temperature writer which is unnecessary, your definition at the start is good enough.
You have an extra comma in your the line output = [timeperiod, temp,]
Moving on to a script that inserts a single data point. This script reads in your existing file. Inserts a new line (you would use random values, I used 1 for time and 2 for value) on the second line which is beneath the header. Let me know if this isn't what you are looking for.
directory = r"C:/Users/AMB/Documents/Aptana Studio 3 Workspace/RAVE/RAVE/resources/csv/temperature.csv"
with open(directory, 'r') as csvfile:
s = csvfile.readlines()
time = 1
value = 2
s.insert(2, '%(time)d,%(value)d\n\n' % \
{'time': time, "value": value})
with open(directory, 'w') as csvfile:
csvfile.writelines(s)
This next section is in response to your more detailed question in the comments:
import csv
import random
directory = r"C:\Users\snorwood\Desktop\temperature.csv"
# Open the file
with open(directory, 'r') as csvfile:
s = csvfile.readlines()
# This array will store your data
data = []
# This for loop converts the data read from the text file into integers values in your data set
for i, point in enumerate(s[1:]):
seperatedPoint = point.strip("\n").split(",")
if len(seperatedPoint) == 2:
data.append([int(dataPoint) for dataPoint in seperatedPoint])
# Loop through your animation numberOfLoops times
numberOfLoops = 100
for i in range(numberOfLoops):
if len(data) == 0:
break
del data[0] # Deletes the first data point
newTime = data[len(data) - 1][0] + 1 # An int that is one higher than the current last time value
newRandomValue = 2
data.append([newTime, newRandomValue]) # Adds the new data point to the end of the array
# Insert your drawing code here
# Write the data back into the text file
with open(directory, 'w') as csvfile: #opens the file for writing
temperature = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) # The object that knows how to write to files
temperature.writerow(["name", "values"]) # Write the header row
for point in data: # Loop through the points stored in data
temperature.writerow(point) # Write current point in set