Handle unwanted line breaks with read_csv in Pandas

Handle unwanted line breaks with read_csv in Pandas - python

I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pretty bad data frame.
The most annoying thing is, that I am unable to make pandas aware of this problem, it just read those wrong lines even if the column count is smaller than the header.
An example of a wrong data.txt:
MANDT~BUKRS~BELNR~GJAHR
030~01~0100650326
~2016
030~01~0100758751~2017
You can see, that the first line has a wrong line break after 0100650326. The 2016 belongs to the first row. The third line is as it should be.
If I import this file:
data = pd.read_csv(
path_to_file,
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)
I get this. What is pretty wrong:
MANDT BUKRS BELNR GJAHR
0 30.0 1 100650326.0 NaN
1 NaN 2016 NaN NaN
2 30.0 1 100758751.0 2016.0
Is it possible to fix the wrong line break or to tell pandas to ignore lines where column count is smaller than header?
Just to make it complete. I want to get this:
MANDT BUKRS BELNR GJAHR
0 30 1 100650326 2016
1 30 1 100758751 2016
I tried to use with open and to replace '\n' (the line break) with '' (nothing), but this results in a single liner file. This is not intended.

You can do some pre-processing to get rid of the unwanted breaks. Example below which I tested.
import fileinput
with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
for line in file:
print(line.replace('\n','^'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
for line in file:
print(line.replace('^~','~'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
for line in file:
print(line.replace('^','\n'), end='')

The correct way would be to fix the file at creation time. If this is not possible, you could pre-process the file or use a wrapper.
Here is a solution using a byte level wrapper that combines lines until you have the correct number of delimiters. I use a byte level wrapper to make use of the classes of the io module and add as little code of my own as I can: a RawIOBase reads lines from an underlying byte file object, and combines lines to have the expected number of delimiters (only readinto and readable are overriden)
class csv_wrapper(io.RawIOBase):
def __init__(self, base, delim):
self.fd = base # underlying (byte) file object
self.nfields = None
self.delim = ord(delim) # code of the delimiter (passed as a character)
self.numl = 0 # number of line for error processing
self._getline() # load and process the header line
def _nfields(self):
# number of delimiters in current line
return len([c for c in self.line if c == self.delim])
def _getline(self):
while True:
# loads a new line in the internal buffer
self.line = next(self.fd)
self.numl += 1
if self.nfields is None: # store number of delims if not known
self.nfields = self._nfields()
else:
while self.nfields > self._nfields(): # optionaly combine lines
self.line = self.line.rstrip() + next(self.fd)
self.numl += 1
if self.nfields != self._nfields(): # too much here...
print("Too much fields line {}".format(self.numl))
continue # ignore the offending line and proceed
self.index = 0 # reset line pointers
self.linesize = len(self.line)
break
def readinto(self, b):
if len(b) == 0: return 0
if self.index == self.linesize: # if current buffer is exhausted
try: # read a new one
self._getline()
except StopIteration:
return 0
for i in range(len(b)): # store in passed bytearray
if self.index == self.linesize: break
b[i] = self.line[self.index]
self.index += 1
return i
def readable(self):
return True
You can then change your code to:
data = pd.read_csv(
csv_wrapper(open(path_to_file, 'rb'), '~'),
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)

Related

Python get previous few elements if match checked element

I have some structured data in a text file:
Parse.txt
name1
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
gggggggg
some of the detail4s do not have data and would be replaced by "-":
name2
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
-
How do i parse the data to get the elements below detail1, detail2 and detail3 of only the data with empty detail4s?
So far i have a partially working code but the problem is that it gets each item 40 times. Please help.
Code:
data = []
with open("parse.txt","r",encoding="utf-8") as text_file:
for line in text_file:
data.append(line)
det4li = []
finali= []
for elem,det4 in zip(data,data[1:]):
if "detail4" in elem:
det4li .append(det4)
if "-" in det4:
for elem1,det1,det2,det3 in zip(data,data[1:],data[3:],data[5:]):
if "detail1:" in elem1:
finali.append(det1.strip() + "," + det2.strip() + "," + det3)
Current Output: 40 records of dddddddd,eeeeeeee,ffffffff
Desired Output: dddddddd,eeeeeeee,ffffffff

Don't try to look ahead. Look behind, by storing preceding data:
final = []
with open("parse.txt","r",encoding="utf-8") as text_file:
section = {}
last_header = None
for line in text_file:
line = line.strip()
if line.startswith('detail'):
# detail line, record for later use
last_header = line.rstrip(':')
elif not last_header:
# name line, store as such
section['name'] = line
else:
section[last_header] = line
if last_header == 'detail4':
# section complete, process
if line == '-':
# A section we want to keep
final.append(section)
# reset section data
section, last_header = {}, None
This has the added advantage that you now don't need to read the whole file into memory. If you turn this into a generator (by putting it into a function and replacing the final.append(section) line with yield section), you can even process those matching sections as you read the file without sacrificing readability.

Python perform actions only at certain locations in text file

I have a text file which contains the data like this
AA 331
line1 ...
line2 ...
% information here
AA 332
line1 ...
line2 ...
line3 ...
%information here
AA 1021
line1 ...
line2 ...
% information here
AA 1022
line1 ...
% information here
AA 1023
line1 ...
line2 ...
% information here
I want to perform action only for "informations" that comes after smallest integer that is after line "AA 331"and line "AA 1021" and not after lines "AA 332" , "AA 1022" and "AA 1023" .
P.s This is just a sample data of large file
The below code i try to parse the text file and get the integers which are after "AA" in a list "list1" and in second function i group them to get minimal value in "list2". This will return integers like [331,1021,...]. So i thought of extracting lines which comes after "AA 331" and perform action but i d'nt know how to proceed.
from itertools import groupby
def getlineindex(textfile):
with open(textfile) as infile:
list1 = []
for line in infile :
if line.startswith("AA"):
intid = line[3:]
list1.append(intid)
return list1
def minimalinteger(list1):
list2 = []
for k,v in groupby(list1,key=lambda x: x//10):
minimalint = min(v)
list2.append(minimalint)
return list2
list2 contains the smallest integers which comes after "AA" [331,1021,..]

You can use something like:
import re
matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False
with open(filename) as f:
for line in f:
m = matcher.match(line)
if m:
v = int(m.groups(0)) / 10
else:
v = None
if m and v not in already_was:
good_block = True
already_was.append(m)
if m and v in already_was:
good_block = False
if not m and good_block:
do_action()
These code works only if first value in group is minimal one.

Okay, here's my solution. At a high level, I go line by line, watching for AA lines to know when I've found the start/end of a data block, and watch what I call the run number to know whether or not we should process the next block. Then, I have a subroutine that handles any given block, basically reading off all relevant lines and processing them if needed. That subroutine is what watches for the next AA line in order to know when it's done.
import re
runIdRegex = re.compile(r'AA (\d+)')
def processFile(fileHandle):
lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
line = fileHandle.next()
while line is not None: # None is being used as a special value indicating we've hit the end of the file.
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
# Always remember where we were.
lastNumber = runNumber
# And grab and process data.
line = dataBlock(fileHandle, process=processData)
else:
try:
line = fileHandle.next()
except StopIteration:
line = None
def dataBlock(fileHandle, process=False):
runData = []
try:
line = fileHandle.next()
match = runIdRegex.match(line)
while not match:
runData.append(line)
line = fileHandle.next()
match = runIdRegex.match(line)
except StopIteration:
# Hit end of file
line = None
if process:
# Data processing call here
# processData(runData)
pass
# Return line so we don't lose it!
return line
Some notes for you. First, I'm in agreement with Jimilian that you should use a regular expression to match AA lines.
Second, the logic we talked about with regard to when we should process data is in processFile. Specifically these lines:
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
I assume we don't want to process data, then identify when we do. Logically speaking, you can do the inverse of this and assume you want to process data, then identify when you don't. Next, we need to store the last run's value in order to know whether or not we need to process this run's data. (and watch out for that first run edge case) We know we want to process data when the sequence is broken (the difference between two runs is greater than 1), which is handled by the else statement. We also know that we want to process data when the sequence increments the digit in the tens place, which is handled by my integer divide by 10.
Third, watch out for that return data from dataBlock. If you don't do that, you're going to lose the AA line that caused dataBlock to stop iterating, and processFile needs that line in order to know whether the next data block should be processed.
Last, I've opted to use fileHandle.next() and exception handling to identify when I get to the end of the file. But don't think this is the only way. :)
Let me know in comments if you have any questions.

python iterate through binary file without lines

I've got some data in a binary file that I need to parse. The data is separated into chunks of 22 bytes, so I'm trying to generate a list of tuples, each tuple containing 22 values. The file isn't separated into lines though, so I'm having problems figuring out how to iterate through the file and grab the data.
If I do this it works just fine:
nextList = f.read(22)
newList = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList)
where newList contains a tuple of 22 values. However, if I try to apply similar logic to a function that iterates through, it breaks down.
def getAllData():
listOfAll = []
nextList = f.read(22)
while nextList != "":
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
nextList = f.read(22)
return listOfAll
data = getAllData()
gives me this error:
Traceback (most recent call last):
File "<pyshell#27>", line 1, in <module>
data = getAllData()
File "<pyshell#26>", line 5, in getAllData
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
struct.error: unpack requires a bytes object of length 22
I'm fairly new to python so I'm not too sure where I'm going wrong here. I know for sure that the data in the file breaks down evenly into sections of 22 bytes, so it's not a problem there.

Since you reported that it was running when len(nextList) == 0, this is probably because nextList (which isn't a list..) is an empty bytes object which isn't equal to an empty string object:
>>> b"" == ""
False
and so the condition in your line
while nextList != "":
is never true, even when nextList is empty. That's why using len(nextList) != 22 as a break condition worked, and even
while nextList:
should suffice.

read(22) isn't guaranteed to return a string of length 22. It's contract is to return string of length from anywhere between 0 and 22 (inclusive). A string of length zero indicates there is no more data to be read. In python 3 file objects produce bytes objects instead of str. str and bytes will never be considered equal.
If your file is small-ish then you'd be better off to read the entire file into memory and then split it up into chunks. eg.
listOfAll = []
data = f.read()
for i in range(0, len(data), 22):
t = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", data[i:i+22])
listOfAll.append(t)
Otherwise you will need to do something more complicated with checking the amount of data you get back from the read.
def dataiter(f, chunksize=22, buffersize=4096):
data = b''
while True:
newdata = f.read(buffersize)
if not newdata: # end of file
if not data:
return
else:
yield data
# or raise error as 0 < len(data) < chunksize
# or pad with zeros to chunksize
return
data += newdata
i = 0
while len(data) - i >= chunksize:
yield data[i:i+chunksize]
i += chunksize
try:
data = data[i:] # keep remainder of unused data
except IndexError:
data = b'' # all data was used

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7

Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.

This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

Testing each line in a file

I am trying to write a Python program that reads each line from an infile. This infile is a list of dates. I want to test each line with a function isValid(), which returns true if the date is valid, and false if it is not. If the date is valid, it is written into an output file. If it is not, invalid is written into the output file. I have the function, and all I want to know is the best way to test each line with the function. I know this should be done with a loop, I'm just uncertain how to set up the loop to test each line in the file one-by-one.
Edit: I now have a program that basically works. However, I am getting incorrect output to the output file. Perhaps someone will be able to explain why.
Ok, I now have a program that basically works, but I'm getting strange results in the output file. Hopefully those with Python 3 experience can help.
def main():
datefile = input("Enter filename: ")
t = open(datefile, "r")
c = t.readlines()
ofile = input("Enter filename: ")
o = open(ofile, "w")
for line in c:
b = line.split("/")
e = b[0]
f = b[1]
g = b[2]
text = str(e) + " " + str(f) + ", " + str(g)
text2 = "The date " + text + " is invalid"
if isValid(e,f,g) == True:
o.write(text)
else:
o.write(text2)
def isValid(m, d, y):
if m == 1 or m == 3 or m == 5 or m == 7 or m == 8 or m == 10 or m == 12:
if d is range(1, 31):
return True
elif m == 2:
if d is range(1,28):
return True
elif m == 4 or m == 6 or m == 9 or m == 11:
if d is range(1,30):
return True
else:
return False
This is the output I'm getting.
The date 5 19, 1998
is invalidThe date 7 21, 1984
is invalidThe date 12 7, 1862
is invalidThe date 13 4, 2000
is invalidThe date 11 40, 1460
is invalidThe date 5 7, 1970
is invalidThe date 8 31, 2001
is invalidThe date 6 26, 1800
is invalidThe date 3 32, 400
is invalidThe date 1 1, 1111
is invalid

In the most recent versions of Python you can use the context management features that are implicit for files:
results = list()
with open(some_file) as f:
for line in f:
if isValid(line, date):
results.append(line)
... or even more tersely with a list comprehension:
with open(some_file) as f:
results = [line for line in f if isValid(line, date)]
For progressively older versions of Python you might need to explicitly open and close the file (with simple implicit iteration over the file for line in file:) or add more explicit iteration over the file (f.readline() or f.readlines() (plural) depending on whether you want to "slurp" in the entire file (with the memory overhead implications of that) or iterate line-by-line).
Also note that you may wish to strip the trailing newlines off these file contents (perhaps by calling line.rstrip('\n') --- or possibly just line.strip() if you want to eliminate all leading and trailing whitespace from each line).
(Edit based on additional comment to previous answer):
The function signature isValid(m,d,y) suggests that you're passing a data to this function (month, day, year) but that doesn't make sense given that you must also, somehow, pass in the data to be validated (a line of text, a string, etc).
To help you further you'll have to provide more information (preferable the source or a relevant portion of the source to this "isValid()" function.
In my initial answer I was assuming that your "isValid()" function was merely scanning for any valid date in its single argument. I've modified my code examples to show how one might pass a specific date, as a single argument, to a function which used this calling signature: "isValid(somedata, some_date)."

with open(fname) as f:
for line in f.readlines():
test(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handle unwanted line breaks with read_csv in Pandas - python

Related

Python get previous few elements if match checked element

Python perform actions only at certain locations in text file

python iterate through binary file without lines

How do you make tables with previously stored strings?

Testing each line in a file

Categories

Resources