Writing to CSV in Python- Rows inserted out of place - python

I have a function that writes a set of information in rows to a CSV file in Python. The function is supposed to append the file with the new row, however I am finding that sometimes it misbehaves and places the new row in a separate space of the CSV (please see picture as an example).
Whenever I reformat the data manually I delete all of the empty cells again, so you know.
Hoping someone can help, thanks!
def Logger():
fileName = myDict[Sub]
with open(fileName, 'a+', newline="") as file:
writer = csv.writer(file)
if file.tell() == 0:
writer.writerow(["Date", "Asset", "Fear", "Anger", "Anticipation", "Trust", "Surprise", "Sadness", "Disgust", "Joy",
"Positivity", "Negativity"])
writer.writerow([date, Sub, fear, anger, anticip, trust, surprise, sadness, disgust, joy, positivity, negativity])

At first I thought it was a simple matter of there not being a trailing newline, and the new row being appended on the same line, right after the last row, but I can see what looks like a row's worth of empty columns between them.
This whole appending thing looks tricky. If you don't have to use Python, and can use a command-line tool instead, I recommend GoCSV.
Here's a sample file based on your screenshot I mocked up:
base.csv
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32,,,,,,,
I'm calling it base because it's the file that will be growing, and you can see it's got a problem on the last line: all those extras commas (I don't know how they got there 🤷🏻‍♂️).
The first step will be to clean it, and trim those pesky extra commas:
% gocsv clean base.csv > tmp
% mv tmp > base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Here's another set of data to append, sample2.csv:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
GoCSV's stack command will do this job:
% gocsv stack base.csv sample2.csv > tmp
% mv tmp base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
This can be scripted and simplified like this:
% gocsv clean base.csv > base
% gocsv clean sample.csv > sample
% gocsv stack base sample > base.csv
% rm base sample

Try this instead...
def Logger(col_one, col_two):
fileName = 'data.csv'
with open(fileName, 'a+') as file:
writer = csv.writer(file)
file.seek(0)
if file.read().strip() == '':
writer.writerow(["Date", "Asset"])
writer.writerow([col_one, col_two])

Related

get filename , file path , get the line when the search string is found and extract only a part followed by search string of that line

may be I will directly explain with example : I am writing my code in python , for grep part also using bash commands.
I have few files , where I need to grep for some pattern , let's say "INFO"
All those files can be present two different dir structure : tyep1, type2
/home/user1/logs/MAIN_JOB/121/patching/a.log (type1)
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log (type2)
/home/user1/logs/MAIN_JOB/SUB_JOB1/142/DB:2/patching/c.log (type2)
contents of file :
a.log :
[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
b.log :
[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
c.log :
[Thu Jan 22 18:01:00 UTC 2022]: database1: ERR: Subject3: This is subject 3.
So I need to know which are all the files does "INFO" string is present. if present I need to get following :
filename : a.log / b.log
filepath : /home/user1/logs/MAIN_JOB/121/patching or /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching
immediate string after search string : Subject1 / Subject2
So I tried using grep command with -r to know what are all the files I can find "INFO"
$ grep -r /home/user1/logs/MAIN_JOB
/home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
$
So I will store above grep python variable and need to extract above things from this output.
I tried initially splitting grep o/p with "\n" , so I will get two separate rows
/home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
and by taking each row , I can split with ":"
First row: I am able to split properly as ":" is at correct places.
file_with_path : /home/user1/logs/MAIN_JOB/121/patching/a.log(I can get file name separate with os.path.basename(file_with_path))
immediate str after search word : "Subject1"
Second row : This is where I need help , As in the path we have this "DB:1" which has ":" which will break my proper split. If I split I will get as below
file_with_path : /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB (not correct)
actually should be /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log
I am unable to apply split here as it doesn't work properly for both the cases.
Can you please help me with this? any command that can do this work in bash or python would be very helpful.
Thank you In Advance. Also let me know if some info is needed from me.
giving code below:
# main dir
patch_log_home = '/home/user1/logs/MAIN_JOB'
cmd = "grep -r 'INFO' {0}"
patch_bug_inc = self._core.exec_os_cmd(cmd.format(patch_log_home))
# if no occurrance reported continue
if len(patch_bug_inc) == 0:
return
if patch_bug_inc:
patch_bug_inc = patch_bug_inc.split("\n");
for inc in patch_bug_inc:
print("_________________________________________________")
inc = inc.split(":")
# to get subject part
patch_bug_str_index = [i for i, s in enumerate(inc) if 'INFO' in s][0]
inc_name = inc[patch_bug_str_index+1]
# file name
log_file_name = os.path.basename(inc[0])
# get file path
log_path = os.path.split(inc[0])
print("log_path :", log_path)
full_path = log_path[0]
print("FULL PATH: ", full_path)
Here's one way you could achieve this without calling out to grep which, as I said in my comment, may not be portable:
import os
import sys
for root, _, files in os.walk('/home/user1/logs/MAIN_JOB'):
for file in files:
if file.endswith('.log'):
path = os.path.join(root, file)
try:
with open(path) as infile:
for line in infile:
if 'INFO:' in line:
print(path)
break
except Exception:
print(f"Unable to process {path}", file=sys.stderr)

How to delete all the files in sub directories older than x days and not created on Sunday in Python

I wanted to find the solution where deleting all the files in the subdirectories as well older than x days and should not delete if the file was created on Sunday
My Main Folder Path is C:\Main_Folder within than I have the structure,
+---Sub_Folder_1
| Day1.xlsx
| Day2.xlsx
| Day3.xlsx
|
+---Sub_Folder_2
| Day1.xlsx
| Day2.xlsx
| Day3.xlsx
|
\---Sub_Folder_3
Day1.xlsx
Day2.xlsx
Day3.xlsx
I tried with below code but it deletes even subdirectories as well
import os, shutil
folder = 'C:\\Main_Folder\\'
for the_file in os.listdir(folder):
file_path = os.path.join(folder, the_file)
try:
if os.path.isfile(file_path):
os.unlink(file_path)
elif os.path.isdir(file_path): shutil.rmtree(file_path)
except Exception as e:
print(e)
In order to check the time a file was last edited you want to include two libraries. import os.path, time
First thing you have to take into consideration is which field you want to use from a file, for example:
print("Last Modified: %s" % time.ctime(os.path.getmtime("file.txt")))
# Last Modified: Mon Jul 30 11:10:21 2018
print("Created: %s" % time.ctime(os.path.getctime('file.txt')))
# Created: Tue Jul 31 09:13:23 2018
So you will have to parse this line and look for the fields you want to consider older than x date. Take into consideration to look through the string for Sun, Mon, Tue, Wed, Thur, Fri, Sat for the week values.
fileDate = time.ctime(os.path.getmtime('file.txt'))
if 'Sun' not in fileDate:
# Sun for Sunday is not in the time string so check the date and time
You will have to loop through the files in the subdirectory, check the Created time or Last Modified time, whichever you prefer, and then parse the date and time fields from the string and compare to your case, removing those which you seem fitting to be deleted.

PYTHON: Problems with parsing csv file and finding maximum value [duplicate]

This question already has an answer here:
PYTHON: Simplest way to open a csv file and find the maximum number in a column and the name assosciated with it?
(1 answer)
Closed 6 years ago.
I need to read in the file https://drive.google.com/open?id=0B29hT1HI-pwxMjBPQWFYaWoyalE)
however, I have tried 3-4 different code methods and repeatedly get the error: "line contains NULL byte". I read on other threads that this is a problem with your csv-but, this is the file that my professor will be loading and grading me on, and I can't modify it, so I'm looking for a solution around this error.
As I mentioned I've tried several different methods to open the file. Here's my best two:
def largestState():
INPUT = "statepopulations.csv"
COLUMN = 5 # 6th column
with open(INPUT, "rU") as csvFile:
theFile = csv.reader(csvFile)
header = next(theFile, None) # skip header row
pop = [float(row[COLUMN]) for row in theFile]
max_pop = max(pop)
print max_pop
largestState()
This results in the NULL Byte error. Please ignore the additional max_pop lines. The next step after reading the file in is to find the maximum value of row F.
def test():
with open('state-populations.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
test()
This results in the NULL Byte Error.
If anyone could offer a simple solution to this problem I'd greatly appreciate it.
File as .txt: https://drive.google.com/open?id=0B29hT1HI-pwxZzhlMGZGVVAzX28
First of all the "csv" file you have provided via the Google Drive link is NOT a csv file. Its a gzip 'ed xml file.
[~/Downloads] file state-populations.csv
state-populations.csv: gzip compressed data, from Unix
[~/Downloads] gzip -d state-populations.csv
gzip: state-populations.csv: unknown suffix -- ignored
[~/Downloads] mv state-populations.csv state-populations.csv.gz
[~/Downloads] gzip -d state-populations.csv.gz
[~/Downloads] ls state-populations.csv
state-populations.csv
[~/Downloads] file state-populations.csv
state-populations.csv: XML 1.0 document text, ASCII text, with very long lines
You can use some xml module to parse it
[~/Downloads] python
Python 2.7.10 (default, Jul 30 2016, 18:31:42)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('state-populations.csv')
>>> root = tree.getroot()
>>> root
<Element '{http://www.gnumeric.org/v10.dtd}Workbook' at 0x10ded51d0>
>>> root.tag
'{http://www.gnumeric.org/v10.dtd}Workbook'
>>> for child in root:
... print child.tag, child.attrib
...
{http://www.gnumeric.org/v10.dtd}Version {'Epoch': '1', 'Full': '1.12.9', 'Major': '12', 'Minor': '9'}
{http://www.gnumeric.org/v10.dtd}Attributes {}
{urn:oasis:names:tc:opendocument:xmlns:office:1.0}document-meta {'{urn:oasis:names:tc:opendocument:xmlns:office:1.0}version': '1.2'}
{http://www.gnumeric.org/v10.dtd}Calculation {'ManualRecalc': '0', 'MaxIterations': '100', 'EnableIteration': '1', 'IterationTolerance': '0.001', 'FloatRadix': '2', 'FloatDigits': '53'}
{http://www.gnumeric.org/v10.dtd}SheetNameIndex {}
{http://www.gnumeric.org/v10.dtd}Geometry {'Width': '864', 'Height': '322'}
{http://www.gnumeric.org/v10.dtd}Sheets {}
{http://www.gnumeric.org/v10.dtd}UIData {'SelectedTab': '0'}
The new .txt file looks good, and your function largestState() gives the correct output. Just have return instead of print instead in the end.
def largestState():
INPUT = "state-populations.txt"
COLUMN = 5 # 6th column
with open(INPUT, "rU") as csvFile:
theFile = csv.reader(csvFile)
header = next(theFile, None) # skip header row
pop = [float(row[COLUMN]) for row in theFile]
max_pop = max(pop)
return(max_pop)
largestState()

grep with python subprocess replacement

On a switch, i run ntpq -nc rv and get an output:
associd=0 status=0715 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p3-RC10#1.2239-o Mon Mar 21 02:53:48 UTC 2016 (1)",
processor="x86_64", system="Linux/3.4.43.Ar-3052562.4155M", leap=00,
stratum=2, precision=-21, rootdelay=23.062, rootdisp=46.473,
refid=17.253.24.125,
reftime=dbf98d39.76cf93ad Mon, Dec 12 2016 20:55:21.464,
clock=dbf9943.026ea63c Mon, Dec 12 2016 21:28:03.009, peer=43497,
tc=10, mintc=3, offset=-0.114, frequency=27.326, sys_jitter=0.151,
clk_jitter=0.162, clk_wander=0.028
I am attempting to create a bash shell command using Python's subprocess module to extract only the value for "offset", or -0.114 in the example above
I noticed that I can use the subprocess replacement mod or sh for this such that:
import sh
print(sh.grep(sh.ntpq("-nc rv"), 'offset'))
and I get:
mintc=3, offset=-0.114, frequency=27.326, sys_jitter=0.151,
which is incorrect as I just want the value for 'offset', -0.114.
Not sure what I am doing wrong here, whether its my grep function or I am not using the sh module correctly.
grep reads line by line; it returns every line matching any part of the input. But I think grep is overkill. Once you get shell output, just search for the thing after output:
items = sh.ntpq("-nc rv").split(',')
for pair in items:
name, value = pair.split('=')
# strip because we weren't careful with whitespace
if name.strip() == 'offset':
print(value.strip())

How do I truncate an EARRAY in an HDF5 file using pytables?

I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.
Directory listing before truncation:
$ ll total 3694208
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5
The script I am using to truncate (main.py):
import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()
Output of the script. As expected, the EARRAY goes from very large to much smaller.
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''
Yet the file takes up almost exactly the same amount of space on disk:
ll
total 3693196
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5
What am I doing wrong? How can I reclaim this disk space?
If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node simply becomes a numpy array and the hdf5 file is not modified.
As discussed in this question you can't really deallocate disk space from an existing hdf5 file. It's just not a part of how hdf5 is designed, and therefore it's not really a part of pytables. You can either load the data from the file, then rewrite it all as a new file (potentially with the same name), or you can use the command line utility h5repack to do that for you.

Categories

Resources