Reading CSV with newlines within row - python

I'm using Pandas to read a CSV file but some of the rows will continue on the next line and the delimiter(") will be at the start of the next line. I'm trying to find a parameter for 'pd.read_csv' that will fix it but after reading the documentation, I still not sure if there is one.
Ex:
"0","","0","0","0","0","0
","0","0","0","0","0","0"
In other words,
"0","","0","0","0","0","0\r\n","0","0","0","0","0","0"

with open(file, 'rU') as myfile:
filtered = (line.replace('\n', '') for line in myfile)
for row in csv.reader(filtered):

Related

How to know CSV line count before loading in python?

I am new to python and have a requirement to load dataframes from various CSV files. It turns out that there is a business logic depending on the number of rows in csv. Can i know this beforehand if i can know CSV total row numbers without writing read_csv?
yes, you can:
lines = sum(1 for line in open('/path/to/file.csv'))
but be aware that Pandas will read the whole file again
if you are sure that the whole file will fit into memory we can do this:
with open('/path/to/file.csv') as f:
data = f.readlines()
lines = len(data)
df = pd.read_csv(data, ...)
You can read the file without saving the contents. Try:
with open(filename, "r") as filehandle:
number_of_lines = len(filehandle.readlines())

Extracting a substring from string in Python and putting it to a file

I have a file in the following format
name#company.com, information
name#company2.com, information
....
What I need to do is read in the file and output the email address only to a file. I have the following code created
with open ('n-emails.txt') as f:
lines = f.readlines()
print (lines)
Can someone please show me how to only get the email part of the file and how to output it to a file this is all done on a mac.
2 different ways of doing it:
without csv module: read each line, split according to tokens, strip the blanks, print:
with open ('n-emails.txt') as f:
for line in f:
toks = line.split(",")
if toks:
print(toks[0].strip())
with the csv module, map the opened file on a csv reader, iterate on the rows, print first (stripped) row.
import csv
with open ('n-emails.txt') as f:
cr = csv.reader(delimiter=",")
for row in cr:
print(row[0].strip())
the second method has the advantage of being robust to commas contained in cells, quotes, ... that's why I recommend it.

From excel to txt - Separate lines

I'm doing a program where I export an excel file to .txt and I have to import this .txt file into my program. The main goal is to extract the same part from each line but the problem is that in the .txt file the lines of the excel are being made into a huge string with no /n. Do you know if there is a way to separate them within the program and if so how can I do it?
The file I'm working with can be downloaded in http://we.tl/YtixI1ck6l
and so far I was trying something like
ppi = []
for line in read_text:
prot_interaction = line[0:14]
ppi.append(prot_interaction)
result_ppi = []
for line in read_text:
result = line[-1]
result_ppi.append(result)
But since it's not formatted in lines but just in a single one I'm not getting any good results.
Using that file as an example, use the csv module to parse it.
Example:
import csv
with open('/tmp/Model_Oralome.txt', 'rU') as f:
reader=csv.reader(f, delimiter="\t")
for row in reader:
print row[0]
Prints:
ppi
C4FQL5;Q08426
C8PB60;D2NP19
P40189;Q05655
P22712;Q9NR31
...
P05783;P02751
B5E709;D2NPK7
Q8N7J2;Q9UKZ4
(BTW, the issue you may be having with this particular file is the line terminations are a CR only from a Mac Classic OS. You can fix that in Python by using the Universal Newline mode when you open the file...)
Excel is exporting the text file with carriage returns (\r) instead of newlines (\n).
ppi = []
with open("Model_Oralome.txt",'r') as f:
lines = f.readlines()
lines = lines[0].split('\r')
From here you can iterate through each line of lines. Since it looks like you want the value of the first column:
lines = lines[1:]
for line in lines:
content = line.split('\t')
ppi.append(content[0])

Generating CSV and blank line

I am generating and parsing CSV files and I'm noticing something odd.
When the CSV gets generated, there is always an empty line at the end, which is causing issues when subsequently parsing them.
My code to generate is as follows:
with open(file, 'wb') as fp:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='')
a.writerow(["Id", "Builing", "Age", "Gender"])
results = get_results()
for val in results:
if any(val):
a.writerow(val)
It doesn't show up via the command line, but I do see it in my IDE/text editor
Does anyone know why it is doing this?
Could it be possible whitespace?
Is the problem the line terminator? It could be as simple as changing one line:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='', lineterminator='\n')
I suspect this is it since I know that csv.writer defaults to using carriage return + line feed ("\r\n") as the line terminator. The program you are using to read the file might be expecting just a line feed ("\n"). This is common in switching file back and forth between *nix and Windows.
If this doesn't work, then the program you are using to read the file seems to be expecting no line terminator for the last row, I'm not sure the csv module supports that. For that, you could write the csv to a StringIO, "strip()" it and then write that your file.
Also since you are not quoting anyting, is there a reason to use csv at all? Why not:
with open(file, 'wb') as fp:
fp.write("\n".join( [ ",".join([ field for field in record ]) for record in get_results()]))

Replace a word in a file

I am new to Python programming...
I have a .txt file....... It looks like..
0,Salary,14000
0,Bonus,5000
0,gift,6000
I want to to replace the first '0' value to '1' in each line. How can I do this? Any one can help me.... With sample code..
Thanks in advance.
Nimmyliji
I know that you're asking about Python, but forgive me for suggesting that perhaps a different tool is better for the job. :) It's a one-liner via sed:
sed 's/^0,/1,/' yourtextfile.txt > output.txt
This applies the regex /^0,/ (which matches any 0, that occurs at the beginning of a line) to each line and replaces the matched text with 1, instead. The output is directed into the file output.txt specified.
inFile = open("old.txt", "r")
outFile = open("new.txt", "w")
for line in inFile:
outFile.write(",".join(["1"] + (line.split(","))[1:]))
inFile.close()
outFile.close()
If you would like something more general, take a look to Python csv module. It contains utilities for processing comma-separated values (abbreviated as csv) in files. But it can work with arbitrary delimiter, not only comma. So as you sample is obviously a csv file, you can use it as follows:
import csv
reader = csv.reader(open("old.txt"))
writer = csv.writer(open("new.txt", "w"))
writer.writerows(["1"] + line[1:] for line in reader)
To overwrite original file with new one:
import os
os.remove("old.txt")
os.rename("new.txt", "old.txt")
I think that writing to new file and then renaming it is more fault-tolerant and less likely corrupt your data than direct overwriting of source file. Imagine, that your program raised an exception while source file was already read to memory and reopened for writing. So you would lose original data and your new data wouldn't be saved because of program crash. In my case, I only lose new data while preserving original.
o=open("output.txt","w")
for line in open("file"):
s=line.split(",")
s[0]="1"
o.write(','.join(s))
o.close()
Or you can use fileinput with in place edit
import fileinput
for line in fileinput.FileInput("file",inplace=1):
s=line.split(",")
s[0]="1"
print ','.join(s)
f = open(filepath,'r')
data = f.readlines()
f.close()
edited = []
for line in data:
edited.append( '1'+line[1:] )
f = open(filepath,'w')
f.writelines(edited)
f.flush()
f.close()
Or in Python 2.5+:
with open(filepath,'r') as f:
data = f.readlines()
with open(outfilepath, 'w') as f:
for line in data:
f.write( '1' + line[1:] )
This should do it. I wouldn't recommend it for a truly big file though ;-)
What is going on (ex 1):
1: Open the file in read mode
2,3: Read all the lines into a list (each line is a separate index) and close the file.
4,5,6: Iterate over the list constructing a new list where each line has the first character replaced by a 1. The line[1:] slices the string from index 1 onward. We concatenate the 1 with the truncated list.
7,8,9: Reopen the file in write mode, write the list to the file (overwrite), flush the buffer, and close the file handle.
In Ex. 2:
I use the with statement that lets the file handle closing itself, but do essentially the same thing.

Categories

Resources