adding date column to the appending output csv file in python - python

I use this code below to combine all csv files : below each file has 10,000 rows :
billing_report_2014-02-01.csv
billing_report_2014-02-02.csv
:
fout=open("out.csv","a")
for num in range(1,10):
print num
for line in open("billing_report_2014-02-0"+str(num)+".csv"):
fout.write(line)
for num in range(10,29):
print num
for line in open("billing_report_2014-02-"+str(num)+".csv"):
fout.write(line)
fout.close()
but now I want to add new date column to the out.csv file how can I add date column and have value of "2014-02-01" to every row that I append billing_report_2014-02-01 to out.csv, and
value of "2014-02-02" to every row that I append billing_report_2014-02-02 to out.csv how can I approach this ?

List the filenames you want to work on, then take the data from that, build a generator over the input file that removes trailing new lines, and adds a new field with the date... eg:
filenames = [
'billing_report_2014-02-01.csv',
'billing_report_2014-02-02.csv'
]
with open('out.csv', 'w') as fout:
for filename in filenames:
to_append = filename.rpartition('_')[2].partition('.')[0]
with open(filename) as fin:
fout.writelines('{},{}\n'.format(line.rstrip(),to_append) for line in fin)

I think you can just add the date at the end:
for line in open("billing_report_2014-02-0"+str(num)+".csv"):
fout.write(line+',DATE INFORMATION')
I am presuming your CSV is really comma separated, if it is tab separted the characters should be \t
you could also use an intermediate step by changing line:
line = line + ', DATE INFORMATION'
as you are trying to add the file name date just add it per variable:
line = line + ', 2014-02-0'+ str(num//10)
you could use the replace function if it is always the ",LLC" string expression, see the example below
>>> string = "100, 90101, California, Example company,LLC, other data"
>>> string.replace(',LLC',';LLC')
'100, 90101, California, Example company;LLC, other data'
>>>
putting it all together and trying to bring some of the inspiration from #Jon CLements in as well (KUDOS!):
def combine_and_add_date(year, month, startday, endday, replace_dict):
fout=open("out.csv","a")
for num in range(startday,endday+1):
daynum = str(num)
if len(daynum) ==1:
daynum = '0'+daynum
date_info = str(year+'-'month+'-'+daynum)
source_name = 'billing_report_'+date_info+'.csv'
for line in open(source_name):
for key in replace_dict:
line.replace(key,replact_dict[key])
fout.write(line+','+date_info)
fout.close()
I hope this works and you should (hopefully I am a newb...) use it like this, note the dictionary is designed to allow you to make all kinds of replacements
combine_and_add_date("2014","02",1,28, {',LLC': ';LLC', ',PLC':';PLC'})
fingers crossed

Related

How to append current lines to previous line when its start with 'id'

My Problem is the following
I have one text file, it contains more than 1000 rows, want to read files line by line
I am trying this code, but not getting expected output
my source file:
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id:;group1;raji;xyzabc;ramya;public;
abc
def
geh
id:group2;raji;rtyui;ramya;private
cvb
nmh
poi
import csv
output=[]
temp=[]
fo = open ('usergroups.csv', 'r')
for line in fo:
#next(uuid)
line = line.strip()
if not line:
continue #ignore empty lines
#temp.append(line)
if not line.startswith('id:') and not None:
temp.append(line)
print(line)
else:
if temp:
line += ";" + ",".join(temp)
temp.clear()
output.append(line)
print("\n".join(output))
with open('new.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(output)
i am getting this output:
id;group1;raji;xyzabc;ramya;public;uuid;UserGroup;Name;Description;Owner;Visibility;Members
id:group2;raji;rtyui;ramya;private;abc,def,geh
So whenever a line does not start with 'id' it should be appended to the previous line.
my desired output:
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;abc,def,geh
id:group2;raji;rtyui;ramya;private;cvb,nmh,poi
There are a few mistakes. I'll only show the relevant corrections:
Use
if not line.startswith('id'):
No 'id:', since you also have a line starting with 'id;', plus you state yourself that a line has to start with "id" (no ":" there). The and if None part is unneccessary, because it's always true.
The other part:
output.append(line.split(';'))
because writerows need an iterable (list) of "row" objects, and a row object is a list of string. So you need a list of lists, which the above is, thanks to the extra split.
(Of course, now the line print("\n".join(output)) fails, but writer.writerows(output) works.)
I don't know if it will help you but with regex, this problem is solved in a very simple way. I leave here the code in case you are interested.
import regex as re
input_text = """uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;
abc
def
geh
id:group2;raji;rtyui;ramya;private
cvb
nmh
poi"""
formatted = re.sub(r"\n(?!(id|\n))", "", input_text)
print(formatted)
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;abcdefgeh
id:group2;raji;rtyui;ramya;privatecvbnmhpoi
This code just replace the regular expression \n(?!(id|id|n)) with the empty string. This regular expression will replace all line breaks that are not followed by the word "id" or another line break (so we keep the space between the two lines of ids).
Writing to a file has not been included here, but there is a list of strings available to work with, as in your original code.
Note: this is not really an answer to your question, as it is a solution to your problem
The structure is by and large the same, with a few changes for readability.
Readable code is easier to get right
import csv
output = []
temp = []
currIdLine = ""
with( open ('usergroups.csv', 'r')) as f:
for dirtyline in f.readlines():
line = dirtyline.strip()
if not line:
print("Skipping empty line")
continue
if line.startswith('uuid'): # append the header to the output
output.append(line)
continue
if line.startswith('id'):
if temp:
print(temp)
output.append(currIdLine + ";" + ','.join(temp)) #based on current input, there is a bug here where the output will contain two sequential ';' characters
temp.clear()
currIdLine = line
else:
temp.append(line)
output.append(currIdLine + ";" + ','.join(temp))
print(output)

how to split a text file and modify it in Python?

I currently have a text file that reads like this:
101, Liberia, Monrovia, 111000, 3200000, Africa, English, Liberia Dollar;
102, Uganda, Kampala, 236000, 34000000, Africa, English and Swahili, Ugandan Shilling;
103, Madagascar, Antananarivo, 587000, 21000000, Africa, Magalasy and Frances, Malagasy Ariary;
I'm currently printing the file using this code:
with open ("base.txt",'r') as f:
for line in f:
words = line.split(';')
for word in words:
print (word)
What I would like to know is, how can I modify a line by using their id number (101 for example) and keep the format they have and add or remove lines based on their id number?
pandas is a strong tool for solving your requirements. It provides the tools for easily working with CSV files. You can manage your data in DataFrames.
import pandas as pd
# read the CSV file into DataFrame
df = pd.read_csv('file.csv', sep=',', header=None, index_col = 0)
print (df)
# eliminating the `;` character
df[7] = df[7].map(lambda x: str(x).rstrip(';'))
print (df)
# eliminating the #101 row of data
df.drop(101, axis=0, inplace=True)
print (df)
My understanding your asking how to modify a word in a line and then insert the modified line back into the file.
Change a word in the file
def change_value(new_value, line_number, column):
with open("base.txt",'r+') as f: #r+ means we can read and write to the file
lines = f.read().split('\n') #lines is now a list of all the lines in the file
words = lines[line_number].split(',')
words[column] = new_value
lines[line_number] = ','.join(words).rstrip('\n') #inserts the line into lines where each word is seperated by a ','
f.seek(0)
f.write('\n'.join(lines)) #writes our new lines back into the file
In order to use this function to set line 3, word 2 to Not_Madasgascar call it like this:
change_word("Not_Madagascar", 2, 1)
You will always have to add 1 to the line/word number because the first line/word is 0
Add a new line to the file
def add_line(words, line_number):
with open("base.txt",'r+') as f:
lines = f.readlines()
lines.insert(line_number, ','.join(words) + '\n')
f.seek(0)
f.writelines(lines)
In order to use this function add a line at the end containing the words this line is at the end call it like this:
add_line(['this','line','is','at','the','end'], 4) #4 is the line number
For more information on opening files see here.
For more information on reading from and modifying files see here.
Reading this file into an OrderedDict would probably be helpful if you are trying to preserve the original file ordering as well as have the ability to references lines in the file for modification/addition/deletion. There are quite a few assumptions about the full format of the file in the following example, but it will work for your test case:
from collections import OrderedDict
content = OrderedDict()
with open('base.txt', 'r') as f:
for line in f:
if line.strip():
print line
words = line.split(',') # Assuming that you meant ',' vs ';' to split the line into words
content[int(words[0])] = ','.join(words[1:])
print(content[101]) # Prints " Liberia, Monrovia, etc"...
content.pop(101, None) # Remove line w/ 101 as the "id"

replace an array of values in a file with new values

global file_platform
file_platform='karbarge.DAT'
Dir=30
Hs=4
Tp=12
dummy=[]
newval=[Dir, Hs, Tp]
def replace():
oldval=[]
line_no=[]
search_chars=['WaveDir', 'WaveHs', 'WaveTp']
with open(file_platform) as f_input:
for line_number, line in enumerate(f_input, start=1):
for search in search_chars:
if search in line:
first_non_space = line.strip().split(' ')[0]
x=search_chars.index(search)
dummy.append((search, first_non_space, line_number, x))
print (search, first_non_space, line_number, x)
i=0
j=0
for line in fileinput.input(file_platform, inplace=1):
print (i)
if i==eval(dummy[j][2]):
line = line.replace(eval(dummy[j][1]),str(newval[dummy[j][3]]))
j=j+1
sys.stdout.write(line)
i=i+1
return
replace()
Aim is to select the first non-space value based on keywords in a file and append it with index. Then using index replacing it in the same file wit new values. I'm successful in picking the existing values in file, but could replace with new values accordingly and save it in the same filename. Here is the code, i tried.
Could anyone suggest modifications in them or suggest a better code that is simple and clear.?
I honestly didn't understand 100% of your goal, but I would like to propose a cleaner solution that might suit your needs. The main idea would be to get each line of the old file, replace the values accordingly, store the resulting lines in a list and finally storing this list of lines in a new file.
# This dict associates the old values to the values to which they should be changed
replacing_dict = {'WaveDir': Dir, 'WaveHs': Hs, 'WaveTp': Tp}
new_lines = []
with open(file_platform) as f_input:
for line in f_input:
first_non_space = line.strip().split(' ')[0]
if first_non_space in replacing_dict:
line.replace(first_non_space, replacing_dict[first_non_space])
new_lines.append(line)
# Write new_lines into a file and, optionally, replace this new file for the old file

Iterate over a CSV file Python

I have a CSV file that looks like this
a,b,c
d1,g4,4m
t,35,6y
mm,5,m
I'm trying to replace all the m's and y's preceded by a number with 'month' and 'year' respectively. I'm using the following script.
import re,csv
out = open ("out.csv", "wb")
file = "in.csv"
with open(file, 'r') as f:
reader = csv.reader(f)
for ss in reader:
s = str(ss)
month_pair = (re.compile('(\d\s*)m'), 'months')
year_pair = (re.compile('(\d\s*)y'), 'years')
def substitute(s, pairs):
for (pattern, substitution) in pairs:
match = pattern.search(s)
if match:
s = pattern.sub(match.group(1)+substitution, s)
return s
pairs = [month_pair, year_pair]
print (substitute(s, pairs))
It does replace but it does that only on the last row, ignoring the ones before it. How can I have it iterate over all the rows and write to another csv file?
You can use positive look-behind :
>>> re.sub(r'(?<=\d)m','months',s)
'a,b,c\nd1,g4,4months\nt,35,6y\nmm,5,m'
>>> re.sub(r'(?<=\d)y','years',s)
'a,b,c\nd1,g4,4m\nt,35,6years\nmm,5,m'
In this line
print (substitute(s, pairs))
your variable s is only the last line in your file. Note how you update s in your file reading to be the current line.
Solutions (choose one):
You could try another for-loop to iterate over all lines.
Or move the substitution into the for-loop where you read the lines of the file. This is definitely the better solution!
You can easily lookup how to write a new file or change the file you are working on.

Extracting oddly arranged data from csv and converting to another csv file using python

I have a odd csv file thas has data with header value and its corresponding data in a manner as below:
,,,Completed Milling Job,,,,,, # row 1
,,,,Extended Report,,,,,
,,Job Spec numerical control,,,,,,,
Job Number,3456,,,,,, Operator Id,clipper,
Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,
Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,
I need to extract the data from this strucutre create another csv file as per the structure below:
Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row
If you notice, there is a new header column added called 'status" but the value is in the first row of the csv file. rest of the column names in output file are extracted from the original file.
Any thoughts will be greatly appreciated - thanks
Assuming the files are all exactly like that (at least in terms of caps) this should work, though I can only guarantee it on the exact data you have supplied:
#!/usr/bin/python
import glob
from sys import argv
g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
with open(fname) as f:
status=f.readline().strip().strip(',')
f.readline()#extended report not needed
f.readline()#job spec numerical control not needed
s=f.readline()
job_no=s.split('Job Number,')[1].split(',')[0]
op_id=s.split('Operator Id,')[1].strip().strip(',')
s=f.readline()
machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
start_t=s.split('Job Start time,')[1].strip().strip(',')
s=f.readline()
machine_type=s.split('Machine type,')[1].split(',')[0]
end_t=s.split('Job end time,')[1].strip().strip(',')
g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()
It takes a glob argument (like Job*.data) and an output filename and should construct what you need. Just save it as 'so.py' or something and run it as python so.py <data_files_wildcarded> output.csv
Here is a solution that should work on any CSV files that follow the same pattern as what you showed. That is a seriously nasty format.
I got interested in the problem and worked on it during my lunch break. Here's the code:
COMMA = ','
NEWLINE = '\n'
def _kvpairs_from_line(line):
line = line.strip()
values = [item.strip() for item in line.split(COMMA)]
i = 0
while i < len(values):
if not values[i]:
i += 1 # advance past empty value
else:
# yield pair of values
yield (values[i], values[i+1])
i += 2 # advance past pair
def kvpairs_by_column_then_row(lines):
"""
Given a series of lines, where each line is comma-separated values
organized as key/value pairs like so:
key_1,value_1,key_n+1,value_n+1,...
key_2,value_2,key_n+2,value_n+2,...
...
key_n,value_n,key_n+n,value_n+n,...
Yield up key/value pairs taken from the first column, then from the second column
and so on.
"""
pairs = [_kvpairs_from_line(line) for line in lines]
done = [False for _ in pairs]
while not all(done):
for i in range(len(pairs)):
if not done[i]:
try:
key_value_tuple = next(pairs[i])
yield key_value_tuple
except StopIteration:
done[i] = True
STATUS = "Status"
columns = [STATUS]
d = {}
with open("data.csv", "rt") as f:
# get an iterator that lets us pull lines conveniently from file
itr = iter(f)
# pull first line and collect status
line = next(itr)
lst = line.split(COMMA)
d[STATUS] = lst[3]
# pull next lines and make sure the file is what we expected
line = next(itr)
assert "Extended Report" in line
line = next(itr)
assert "Job Spec numerical control" in line
# pull all remaining lines and save in a list
lines = [line.strip() for line in f]
for key, value in kvpairs_by_column_then_row(lines):
columns.append(key)
d[key] = value
with open("output.csv", "wt") as f:
# write column headers line
line = COMMA.join(columns)
f.write(line + NEWLINE)
# write data row
line = COMMA.join(d[key] for key in columns)
f.write(line + NEWLINE)

Categories

Resources