I have a script who goes through all the files within a given path('C:\Users\Razvi\Desktop\italia') and reads the number of lines from each file found with the name "cnt.csv" and then it writes in another file named "counters.csv" the current date + the name of folder + the sum of lines found in the "cnt.csv".
Now the output file("counters.csv") looks like this:
30/9/2017
8dg5 5
8dg7 7
01/10/2017
8dg5 8
8dg7 10
In which the names 8dg5 and 8dg7 are the folders where the script found the file"cnt.csv" aand the numers 5,7,8,10 are the sum of lines found in csv files every time when i run the script and obviously the date when i run the script it's appended every time.
Now what i want it's to write those dates in csv, somehow appended on columns , not lines, something like this:
30/9/2017 01/10/2017
8dg5 5 8dg5 8
8dg7 7 8dg7 10
Here is the code:
import re
import os
from datetime import datetime
#italia =r'C:\Users\timraion\Desktop\italia'
italia =r'C:\Users\Razvi\Desktop\italia'
file = open(r'C:\Users\Razvi\Desktop\counters.csv', 'a')
now=datetime.now()
dd=str(now.day)
mm=str(now.month)
yyyy=str(now.year)
date=(dd + "/" + mm + "/" + yyyy)
file.write(date + '\n')
for filename in os.listdir(italia):
infilename = os.path.join(italia, filename)
for files in os.listdir(infilename):
if files.endswith("cnt.csv"):
result=os.path.join(infilename,"cnt.csv")
print(result)
infilename2 = os.path.join(infilename, files)
lines = 0
for line in open(infilename2):
lines += 1
file = open(r'C:\Users\Razvi\Desktop\counters.csv', 'a')
file.write(str(filename) +","+ str(lines)+"\n" )
file.close()
Thanks!
If you want to add the new entries as columns instead of rows further down, you'll have to read in the counters.csv file each time, append your columns, and rewrite it.
import os
from datetime import datetime
import csv
italia =r'C:\Users\Razvi\Desktop\italia'
counts = []
for filename in os.listdir(italia):
infilename = os.path.join(italia, filename)
for files in os.listdir(infilename):
if files.endswith("cnt.csv"):
result=os.path.join(infilename,"cnt.csv")
print(result)
infilename2 = os.path.join(infilename, files)
lines = 0
for line in open(infilename2):
lines += 1
counts.append((filename, lines)) # save the filename and count for later
if os.path.exists(r'C:\Users\Razvi\Desktop\counters.csv'):
with open(r'C:\Users\Razvi\Desktop\counters.csv', 'r') as csvfile:
counters = [row for row in csv.reader(csvfile)] # read the csv file
else:
counters = [[]]
now=datetime.now()
# Add the date and a blank cell to the first row
counters[0].append('{}/{}/{}'.format(now.day, now.month, now.year))
counters[0].append('')
for i, count in enumerate(counts):
if i + 1 == len(counters): # check if we need to add a row
counters.append(['']*(len(counters[0])-2))
while len(counters[i+1]) < len(counters[0]) - 2:
counters[i+1].append('') # ensure there are enough columns in the row we're on
counters[i+1].extend(count) # Add the count info as two new cells
with open(r'C:\Users\Razvi\Desktop\counters.csv', 'w') as csvfile:
writer = csv.writer(csvfile) # write the new csv file
for row in counters:
writer.writerow(row)
Another note, file is a builtin function in Python, so you shouldn't use that name as a variable, since unexpected stuff might happen further down in your code.
Related
I have multiple txt files and each of these txt files has 6 columns. What I want to do : add just one column as a last column, so at the end the txt file has maximum 7 columns and if i run the script again it shouldn't add a new one:
At the beginning each file has six columns:
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125
What I want is to add a new column of zeros only if the current number of columns is 6, after that it should prevent adding a new column when I run the script again (7 columns is the total number where the last one is zeros):
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089 0
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938 0
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125 0
My code works and add one additional column each time i run the script but i want to add just once when the number of columns 6. Where (a) give me the number of column and if the condition is fulfilled then add a new one:
from glob import glob
import numpy as np
new_column = [0] * 20
def get_new_line(t):
l, c = t
return '{} {}\n'.format(l.rstrip(), c)
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
**#if a==6: (here is the problem)**
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
writecolumn(path)
When i check the number of columns #if a==6 and shift the content inside the if statement I get error. without shifting the content inside the if every thing works fine and still adding one column each time i run it.
Any help is appreciated.
To test the code create two/one txt files with random number of six columns.
Could be an indentation problem, i.e. block below 'if'. writing new-lines should be indented properly --
This works --
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
if int(a)==6:
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
Use pandas to read your text file:
import pandas as of
df = pd.read_csv("whitespace.csv", header=None, delimiter=" ")
Add a column or more as needed
df['somecolname'] = 0
Save DataFrame with no header.
I have a CSV file with, let's say, 16000 rows. I need to split it up in two separate files, but also need an overlap in the files of about 360 rows, so row 1-8360 in one file and row 8000-16000 in the other. Or 1-8000 and 7640-16000.
CSV file look like this:
Value X Y Z
4.5234 -46.29753186 -440.4915915 -6291.285393
4.5261 -30.89639381 -441.8390165 -6291.285393
4.5289 -15.45761327 -442.6481287 -6291.285393
4.5318 0 -442.9179423 -6291.285393
I have used this code in Python 3 to split the file, but I'm unable to get the overlap I want:
with open('myfile.csv', 'r') as f:
csvfile = f.readlines()
linesPerFile = 8000
filename = 1
for i in range(0,len(csvfile),linesPerFile+):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
f.writelines(csvfile[i:i+linesPerFile])
filename += 1
And tried to modify it like this:
for i in range(0,len(csvfile),linesPerFile+360):
and
f.writelines(csvfile[360-i:i+linesPerFile])
but I haven't been able to make it work.
It's very easy with Pandas CSV and iloc.
import pandas as pd
# df = pd.read_csv('source_file.csv')
df = pd.DataFrame(data=pd.np.random.randn(16000, 5))
df.iloc[:8360].to_csv('file_1.csv')
df.iloc[8000:].to_csv('file_2.csv')
Hope you have got a more elegant answer using Pandas. You could consider below if don't like to install modules.
def write_files(input_file, file1, file2, file1_end_line_no, file2_end_line_no):
# Open all 3 file handles
with open(input_file) as csv_in, open(file1, 'w') as ff, open(file2, 'w') as sf:
# Process headers
header = next(csv_in)
header = ','.join(header.split())
ff.write(header + '\n')
sf.write(header + '\n')
for index, line in enumerate(csv_in):
line_content = ','.join(line.split()) # 4.5234 -46.29753186 -440.4915915 -6291.285393 => 4.5234,-46.29753186,-440.4915915,-6291.285393
if index <= file1_end_line_no: # Check if index is less than or equals first file's max index
ff.write(line_content + '\n')
if index >= file2_end_line_no: # Check if index is greater than or equals second file's max index
sf.write(line_content + '\n')
Sample Run:
if __name__ == '__main__':
in_file = 'csvfile.csv'
write_files(
in_file,
'1.txt',
'2.txt',
2,
2
)
What about this?
for i in range(0,len(csvfile),linesPerFile+):
init = i
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
init = i - 360
f.writelines(csvfile[init:i+linesPerFile+1])
filename += 1
Is this what you are looking for? Please upload a test file if it doesn't so we can provide a better answer :-)
I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?
For example,
some files require this:
f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])
And some require this:
f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])
Here is my chunk of code looping over my files:
for name,ID in stations:
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.
This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.
from io import StringIO
import pandas as pd
from copy import copy
mystr = StringIO("""dasfaf
kgafsda
Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")
mystr2 = copy(mystr)
# replace mystr with open('file.csv', 'r')
with mystr as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))
# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])
print(df)
Date/Time num1 num2
0 2018-01-01 0 1
1 2018-01-02 2 3
Wrap this in a function if you need to repeat the task:
def calc_skiprows(fname):
with fname as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1
df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])
The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:
for name,ID in stations:
#read in each stations .csv files, concatenate together, insert station id column
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
try:
f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
except:
f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.
I have files in a folder which named according to the date and time it recorded: for example: Test_20150925_181323.data [Here on the file name 20150925 is the date (2015/09/25) and 181323 is the time (18 hr 13 min 23 sec)]. Likewise I have more than 20 files.
I want to do the following:
Read the files according to increasing order of date and time.
From each file take the values between lines 11 and 21 (these are values recorded in two columns)
Put those values in two arrays saytimevalues=[] and yvalues=[].
Then read the next file do the same and append the values between line11 and line12 to timevalues and yvalues.
Finally I should have two arrays, timevalues and yvalues in which the data (between lines 11 and 21 from each file) appended according to the time it recorded.
My attempt:
import numpy as np
import re, os
import pandas as pd
from os import walk
path = r'C:\Users\Data1\\'
for data_file in sorted(os.listdir(path)):
print data_file
times = []
yvals = []
for line in data_file.readlines()[11:21]: # read lines from 11 to 21
column = line.split('\t')
times.append(column[0])
yvals.append(column[1])
#print times
#print yvals
This is always giving error message:
for line in data_file.readlines()[11:21]: # read lines from 11 to 21
AttributeError: 'str' object has no attribute 'readlines'.
Also, I am not sure if this is the right way to read the files according to the time on its filename.
os.listdir() doesn't return the full paths of files, so you need to join the directory path to each file.
You're trying to .readlines() from your filename string. You need to open() the file to use .readlines().
If all your files follow Test_YYYYMMDD_HHMMSS.data format they will sort properly without any datetime work needed.
The fileinput module makes line counting convenient and handles opening the files sequentially:
import os
import fileinput
path = r'C:\Users\Data'
files = [os.path.join(path,data_file) for data_file in sorted(os.listdir(path))]
times = []
yvals = []
f = fileinput.input(files=files)
for line in f:
if 11 <= f.filelineno() <= 21:
columns = line.split('\t')
times.append(column[0])
yvals.append(column[1])
f.close()
This is assuming your for data_file in sorted(os.listdir(path)): is working properly. What happens next is the file is opened and within that loop you're going through lines 11:21.
path = r'C:\Users\Data1\\'
times = []
yvals = []
for data_file in sorted(os.listdir(path)):
with open(data_file, 'r') as f:
for line in f.readlines()[11:21]: # read lines from 11 to 21
column = line.split('\t')
times.append(column[0])
yvals.append(column[1])
from datetime import datetime
def extract_datetime(filename):
datetime_string = filename.replace('Test_', '')
datetime_string = datetime_string.replace('.data', '')
return datetime.strptime(datetime_string, '%Y%m%d_%H%M%S')
# Sort files by date/time
log_files = os.listdir(path)
log_files.sort(key=extract_datetime)
for f in log_files:
with open(f, 'r') as infile:
lines = infile.readlines()[11:21]
# The rest of your original code here
I have a folder with four CSV files. In each CSV there are animals, and a number of occurrences for each animal. I'm trying to create a CSV that gathers up information from all the CSVs in the folder, removing duplicates, and adding a third column that lists the original file(s) the animal was found in. For example lion,4,'file2, file4'
I would really like my new CSV to have a third column that lists which files contain each animal, but I can't figure it out. I tried doing it with a second dictionary - refer to lines with locationCount.
Look below for the current script I am using.
The files I have:
file1.csv:
cat,1
dog,2
bird,1
rat,3
file2.csv:
bear,1
lion,1
goat,1
pig,1
file3.csv:
rat,1
bear,1
mouse,1
cat,1
file4.csv:
elephant,1
tiger,2
dog,1
lion,3
Current script:
import glob
import os
import csv, pdb
listCSV = glob.glob('*.csv')
masterCount = {}
locationCount = {}
for i in listCSV: # iterate over each csv
filename = os.path.split(i)[1] # filename for each csv
with open(i, 'rb') as f:
reader = csv.reader(f)
location = []
for row in reader:
key = row[0]
location.append(filename)
masterCount[key] = masterCount.get(key, 0) + int(row[1])
locationCount[key] = locationCount.get(key, location)
writer = csv.writer(open('MasterAnimalCount.csv', 'wb'))
for key, value in masterCount.items():
writer.writerow([key, value])
You're almost right - handle the Locations in the same way as you handle the counts.
I've renamed and shuffled things around, but it's basically the same code structure. masterCount adds a number to the previous numbers, masterLocations adds a filename to a list of previous filenames.
from glob import glob
import os, csv, pdb
masterCount = {}
masterLocations = {}
for i in glob('*.csv'):
filename = os.path.split(i)[1]
for animal, count in csv.reader(open(i)):
masterCount[animal] = masterCount.get(animal, 0) + int(count)
masterLocations[animal] = masterLocations.get(animal, []) + [filename]
writer = csv.writer(open('MasterAnimalCount.csv', 'wb'))
for animal in masterCount.keys():
writer.writerow([animal, masterCount[animal], ', '.join(masterLocations[animal])])