identify csv in python - python

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer,
however my attempt was not fruitful.
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?

How about this:
import re
my_pattern = re.compile("(\"[\w]+\",)+")
with open('<your_file>', 'rb') as fi:
for f in fi:
result = my_pattern.match(f)
if result:
print f
Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)

If your csv lines and only those lines start with \", then you can do this:
import csv
data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data
def importCSV(data):
# outputs list of list with required data
# works on the assumption that all required data starts with \"
# and that no text starts with \"
out = []
for line in data:
if (line != []) and (line[0][0] == "\""):
line = [el.replace("\"", "") for el in line]
out.append(line)
return out
useful = importCSV(data)

Can you not read each line and do a regex to see weather or not to pull the data?
Maybe something like:
^(["][\w]["][,])+["][\w]["]$
My regex is not the best and there may likely be a better way but that seemed to work for me.

Related

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Parsing a text file with line breaks in python

I have a text file with about 20 entries. They look like this:
~
England
Link: http://imgur.com/foobar.jpg
Capital: London
~
Iceland
Link: http://imgur.com/foobar2.jpg
Capital: Reykjavik
...
etc.
I would like to take these entries and turn them into a CSV.
There is a '~' separating each entry. I'm scratching my head trying to figure out how to go thru line by line and create the CSV values for each country. Can anyone give me a clue on how to go about this?
Use the libraries luke :)
I'm assuming your data is well formatted. Most real world data isn't that way. So, here goes a solution.
>>> content.split('~')
['\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n', '\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n']
For writing the CSV, Python has standard library functions.
>>> import csv
>>> csvfile = open('foo.csv', 'wb')
>>> fieldnames = ['Country', 'Link', 'Capital']
>>> writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
>>> for entry in entries:
... cols = entry.strip().splitlines()
... writer.writerow({'Country': cols[0], 'Link':cols[1].split(': ')[1], 'Capital':cols[2].split(':')[1]})
...
If your data is more semi structured or badly formatted, consider using a library like PyParsing.
Edit:
Second column contains URLs, so we need to handle the splits well.
>>> cols[1]
'Link: http://imgur.com/foobar2.jpg'
>>> cols[1].split(':')[1]
' http'
>>> cols[1].split(': ')[1]
'http://imgur.com/foobar2.jpg'
The way that I would do that would be to use the open() function using the syntax of:
f = open('NameOfFile.extensionType', 'a+')
Where "a+" is append mode. The file will not be overwritten and new data can be appended. You could also use "r+" to open the file in read mode, but would lose the ability to edit. The "+" after a letter signifies that if the document does not exist, it will be created. The "a+" I've never found to work without the "+".
After that I would use a for loop like this:
data = []
tmp = []
for line in f:
line.strip() #Removes formatting marks made by python
if line == '~':
data.append(tmp)
tmp = []
continue
else:
tmp.append(line)
Now you have all of the data stored in a list, but you could also reformat it as a class object using a slightly different algorithm.
I have never edited CSV files using python, but I believe you can use a loop like this to add the data:
f2 = open('CSVfileName.csv', 'w') #Can change "w" for other needs i.e "a+"
for entry in data:
for subentry in entry:
f2.write(str(subentry) + '\n') #Use '\n' to create a new line
From my knowledge of CSV that loop would create a single column of all of the data. At the end remember to close the files in order to save the changes:
f.close()
f2.close()
You could combine the two loops into one in order to save space, but for the sake of explanation I have not.

what is a quick way to import a text file in python?

I have a plain text file with a sequence of numbers, one on each line. I need to import those values into a list. I'm currently learning python and I'm not sure of which is a fast or even "standard" way of doing this (also, I come from R so I'm used to the scan or readLines functions that makes this task a breeze).
The file looks like this (note: this isn't a csv file, commas are decimal points):
204,00
10,00
10,00
10,00
10,00
11,00
70,00
276,00
58,00
...
Since it uses commas instead of '.' for decimal points, I guess the task's a little harder, but it should be more or less the same, right?
This is my current solution, which I find quite cumbersome:
f = open("some_file", "r")
data = f.read().replace('\n', '|')
data = data[0:(len(data) - 2)].replace(',', '.')
data = data.split('|')
x = range(len(data))
for i in range(len(data)):
x[i] = float(data[i])
Thanks in advance.
UPDATE
I didn't realize the comma was the decimal separator. If the locale is set right, something like this should work
lines = [locale.atof(line.strip()) for line in open(filename)]
if not, you could do
lines = [float(line.strip().replace(',','.')) for line in open(filename)]
lines = [line.strip() for line in open(filename)]
if you want the data as numbers ...
lines = [map(float,line.strip().split(',')) for line in open(filename)]
edited as per first two comments below
bsoist's answer is good if locale is set correctly. If not, you can simply read the entire file in and split on the line breaks (\n), then use a list comprehension for replacements.
with open('some_file.txt', 'r') as datafile:
data = datafile.read()
x = [float(value.replace(",", ".")) for value in data.split('\n')]
For a more simpler way you could just do
Read = []
with open('File.txt', 'r') as File:
Read = File.readLines()
for A in Read:
print A
The "with open()" will open the file and quit when it's finished reading. This is good practice IIRC.
Then the For loop will just loop over Read and print out the lines.

CSV parsing in Python

I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

Remove special characters from csv file using python

There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()

Categories

Resources