I have a csv file in the format:
text label
it was incredible!! 1
the politician was exhausted 0
'and so was little Sebastian!' 0
I am trying reading it using pandas:
train = pd.read_csv("myfile.csv", header = 0, delimiter = "\t", quoting = 3)
print(train.shape)
Printing the shape of train gives me double the number of lines orignally present in the csv file.
The problem I found is that the alternate lines in the data frame train are being split by newline character so that train["text"][0] gives:
"it was incredible!!"
train["text"][1] gives:
"
Similar is the result for every alternate line pairs thus resulting in the double of original size. I figured out the possible reason for it could be that before writing my list of tuples, i.e. mylist = [(text, '1'), (text, '0')..] to the csv file, printing mylist[0] gives:
('it was incredible \n', '1')
Similarly, mylist[2] would give:
(" 'and so was little Sebastian! '\n", '0')
i.e. a '\n' is somehow appended at the end of each text. Is there any way to prevent these line splits by eliminating '\n' character?
You can slice the last character using [:-1]:
line = 'x,y,z\n'
print line[:-1] # Out: x,y,z
Or replace '\n' with '':
line = line.replace('\n', '')
What you want might be to strip your train data from any trailing newline characters, which can be done for strings with the Python string method rstrip:
.rstrip('\n')
Similarly for pandas timeseries the method is:
pandas.Series.str.strip()
(See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.strip.html)
I will give you an idea:
test = "hi \n"
print test
print test[:-1]
With [:-1] you can slice the last character
Related
I have a complete_list_of_records which has a length of 550
this list would look something like this:
Apples
Pears
Bananas
The issue is that when i use:
with open("recordedlines.txt", "a") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i)
the outcome of the file is 393 long and the structure someplaces looks like so
Apples
PearsBananas
Pineapples
I have tried with "w" instead of "a" append and manually inserted "\n" for each item in the list but this just creates blank spaces on every second row and still som rows have the same issue with dual lines in one.
Anyone who has encountered something similar?
From the comments seen so far, I think there are strings in the source list that contain newline characters in positions other than at the end. Also, it seems that some strings end with newline character(s) but not all.
I suggest replacing embedded newlines with some other character - e.g., underscore.
Therefore I suggest this:
with open("recordedlines.txt", "w") as recorded_lines:
for line in complete_list_of_records:
line = line.rstrip() # remove trailing whitespace
line = line.replace('\n', '_') # replace any embedded newlines with underscore
print(line, file=recorded_lines) # print function will add a newline
You could simply strip all whitespaces off in any case and then insert a newline per hand like so:
with open("recordedlines.txt", "a") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i.strip() + "\n")
you need to use
file.writelines(listOfRecords)
but the list values must have '\n'
f = open("demofile3.txt", "a")
li = ["See you soon!", "Over and out."]
li = [i+'\n' for i in li]
f.writelines(li)
f.close()
#open and read the file after the appending:
f = open("demofile3.txt", "r")
print(f.read())
output will be
See you soon!
Over and out.
you can also use for loop with write() having '\n' at each iteration
[Soln][1]
complete_list_of_records =['1.Apples','2.Pears','3.Bananas','4.Pineapples']
with open("recordedlines.txt", "w") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i+"\n")
I think it should work.
Make sure that, you write as a string.
Issue
Hello all,
in a text file i need to replace an unknown string by another,
first to find it i need to find the line before it 'name Blur2'
as there is many line beginnig by 'xpos':
name Blur2
xpos 12279 # 12279 is the end of line to find and put in a variable
Code to get unknow string:
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("input_file.txt", 'r+') as f1:
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
if keyString in line:
nextLine = lines[i + 1]
print ' nextLine: ',nextLine #result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
print ' number: ',number #result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print ' newString: ',newString #result: newString: 12289
f2.write("".join([nextLine.replace(number, str(newString))])) #this line isn't working
f1.close()
f2.close()
so, i had completely change of method but the last line: f2.write... isn't working as expected, did someone know why?
thanks again for your help :)
regex seems like it would help, https://regex101.com/.
Regex searches a string with a language that defines a pattern. I listed the most important ones for learning the pattern itself, but it is sometimes a better alternative than python's native string manipulation.
You first describe the pattern that you will be using, then actually compile the pattern. For the string check, I defined it as a raw string using r''. This means I don't have to escape a \ within a string (example: printing \ would be print('\') instead of print(r'').
There are a couple of parts to this regex.
\s for whitespace(characters like space, ' ')
\n or \r for newline and carriage return, [^] defines which characters not to look for (so [^\n\r] searches for anything not containing a newline or carriage return), the * indicates it can have 0 or more of the characters indicated. $ in the regex string accounts for everything before the line end.
so the pattern searches for 'name Blur2' specifically with any number of whitespaces afterwards and a newline. The parentheses allow this to be group 1 (explained later). The second part '([^\n\r]*$)' captures any number of characters that aren't a newline or carriage return up until the end of that line.
Groups account for the parentheses, so '(name blue\n)' is group 1, and the line you want replaced '([^\n\r]*$)' is group 2. checkre.sub should replace the whole text with group 1
and the new string, so it replaces the first line with the first line, and replaces the second line with your new string
import re
check = r'(name Blur2\s*\n)([^\n\r]*$)'
checkre = re.compile(check, re.MULTILINE)
checkre.sub(\g<1>+newstring, file)
You need to set re.MULTILINE since you're checking multiple lines, if the '\n' isn't matched, you could use [\n\r\z] which gets one of either end of the line, carriage return, or absolute end of the string.
rioV8's comment works, but you could also use '.{5}$', which accounts for any 5 characters before the end of the line. It could be helpful within a re
It should be possible to get the old string with
oldstring = checkre.search(filestring).group(1)
I have not played with span yet, but
stringmatch = checkre.search(filestring)
oldstring = stringmatch.group(2)
newfilestring = filestring[0:stringmatch.span[0]] + stringmatch.group(1) + newstring + filestring[stringmatch.span[1]]:]
should be pretty close to what you're looking for, although the splice may not be exactly correct.
The initial program was pretty close. I edited a little bit of it to tweak a few things that were wrong.
You weren't initially writing the lines that needed to be replaced, I'm not sure why you needed to join things. Just replacing the number directly seemed to work. Python doesn't allow changes to the i in a for loop, and you need to skip one line so it isn't written to the file, so I changed it to a while loop. Anyway ask any questions you have, but the below code seems to work.
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("test.txt", 'r+') as f1:
lines = f1.readlines()
i=0
while i <len(lines):
line = lines[i]
if keyString in line:
f2.write(line)
nextLine = lines[i + 1]
#end of necessary 'i' calls, increment i to avoid reprinting writing the replaced line string
i+=1
print (' nextLine: ',nextLine )#result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
#as was said in a comment, this coula also be number = nextLine[-5:]
print (' number: ',number )#result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print (' newString: ',newString) #result: newString: 12289
f2.write(nextLine.replace(number, str(newString))) #this line isn't working
else:
f2.write(line)
i+=1
f1.close()
f2.close()
I have a text file ( huge ) with all numbers separated with a combination of spaces and tabs between them and with comma for decimal and after decimal separation, while the first column is scientific formatted and the next ones are numbers but with commas. I just put the first row here as numbers :
0,0000000E00 -2,7599284 -1,3676726 -1,7231264 -1,0558825 -1,8871096 -3,0763804 -3,2206187 -3,2308111 -2,3147060 -3,9572818 -4,0232415 -4,2180738
the file is so huge that a notepad++ can't process it to convert the "," to "."
So what I do is :
with open(file) as fp:
line = fp.readline()
cnt = 1
while line:
digits=re.findall(r'([\d.:]+)', line)
s=line
s = s.replace('.','').replace(',','.')
number = float(s)
cnt += 1
I tried even to use digits, but that causes to divide the first column in two numbers :
and eventually the error I get when using .replace command. what I would have prefered was to convert the commas to dots regardless of disturbing formats like scientific. I appreciate your help
ValueError: could not convert string to float: ' 00000000E00
\t-29513521 \t-17002219 \t-22375536 \t-14994097
\t-24163610 \t-34076621 \t-31233623 \t-32341597
\t-24724552 \t-42434935 \t-43454237 \t-44885144
\n'
I also put how the input looks like in txt and how I need it in output ( in csv format )
input seems like this :
first line :
between 1st and 2nd column : 3 spaces + 1 Tab
between rest of columns : 6 spaces + 1 Tab
second line and on :
between 1st and 2nd column : 2 spaces + 1 Tab
between rest of columns : 6 spaces + 1 Tab
this is a screen shot of the txt input file :
Attention : there is one space in the beginning of each line
and what I want as output is csv file with separated columns with " ; "
You may try reading the entire file into a Python string, and then doing a global replacement of comma to dot:
data = ""
with open('nums.csv', 'r') as file:
data = file.read().replace(',', '.').replace(' ', ';')
with open("nums_out.csv", "w") as out_file:
out_file.write(data)
For a possibly more robust solution, should there exist the possibility that two columns could be separated by multiple whitespace characters, use re.sub:
data = ""
with open('nums.csv', 'r') as file:
data = file.read().replace(',', '.')
data = re.sub(r'(?<=\n|^)[^\S\r\n]+', '', data)
data = re.sub('(?<=\S)[^\S\r\n]+', ';', data)
If you're working with tabular data in python, you'll want to use the pandas package. It's a large package, so if this is a one-off, the overhead of installing it might not be worth it.
Pandas has a read_csv function that deals with this easily, and the result can be exported to csv:
import pandas as pd
dataframe = pd.read_csv("input.txt", sep="\s+", decimal=",")
dataframe.to_csv("output.csv", sep=";", header=False, index=False)
Note: if your original file has no header, also pass header=None to the read_csv function.
The problem is that you're converting the entire string to a float, which python won't recognize. It will recognize the floats and even the scientific notation when you try to cast them separately.
What you could do is split the line using str.split(). Without arguments, the split function will split on any whitespace character including '\t'. You can then convert each to a float and rebuild the string.
with open(file) as fp:
line = fp.readline()
cnt = 1
while line:
digits=re.findall(r'([\d.:]+)', line)
s=line
s = s.replace('.','').replace(',','.')
# Split the string into a list of strings
s_list = s.split()
# Convert each string to a float
for i, num in enumerate(s_list):
s_list[i] = float(num)
# Rebuild the string for further use
s = " \t".join(s_list)
cnt += 1
How can I reduce multiple blank lines in a text file to a single line at each occurrence?
I have read the entire file into a string, because I want to do some replacement across line endings.
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = sourceFile.read()
This doesn't seem to work
while '\n\n\n' in sourceFileContents:
sourceFileContents = sourceFileContents.replace('\n\n\n', '\n\n')
and nor does this
sourceFileContents = re.sub('\n\n\n+', '\n\n', sourceFileContents)
It's easy enough to strip them all, but I want to reduce multiple blank lines to a single one, each time I encounter them.
I feel that I'm close, but just can't get it to work.
This is a reach, but perhaps some of the lines aren't completely blank (i.e. they have only whitespace characters that give the appearance of blankness). You could try removing all possible whitespace between newlines.
re.sub(r'(\n\s*)+\n+', '\n\n', sourceFileContents)
Edit: realized the second '+' was superfluous, as the \s* will catch newlines between the first and last. We just want to make sure the last character is definitely a newline so we don't remove leading whitespace from a line with other content.
re.sub(r'(\n\s*)+\n', '\n\n', sourceFileContents)
Edit 2
re.sub(r'\n\s*\n', '\n\n', sourceFileContents)
Should be an even simpler solution. We really just want to a catch any possible space (which includes intermediate newlines) between our two anchor newlines that will make the single blank line and collapse it down to just the two newlines.
Your code works for me. Maybe there is a chance of carriage return \r would be present.
re.sub(r'[\r\n][\r\n]{2,}', '\n\n', sourceFileContents)
You can use just str methods split and join:
text = "some text\n\n\n\nanother line\n\n"
print("\n".join(item for item in text.split('\n') if item))
Very simple approach using re module
import re
text = 'Abc\n\n\ndef\nGhijk\n\nLmnop'
text = re.sub('[\n]+', '\n', text) # Replacing one or more consecutive newlines with single \n
Result:
'Abc\ndef\nGhijk\nLmnop'
If the lines are completely empty, you can use regex positive lookahead to replace them with single lines:
sourceFileContents = re.sub(r'\n+(?=\n)', '\n', sourceFileContents)
If you replace your read statement with the following, then you don't have to worry about whitespace or carriage returns:
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = ''.join([l.rstrip() + '\n' for l in sourceFile])
After doing this, both of your methods you tried in the OP work.
OR
Just write it out in a simple loop.
with open(sourceFileName, 'rt') as sourceFile:
lines = ['']
for line in (l.rstrip() for l in sourceFile):
if line != '' or lines[-1] != '\n':
lines.append(line + '\n')
sourceFileContents = "".join(lines)
I guess another option which is longer, but maybe prettier?
with open(sourceFileName, 'rt') as sourceFile:
last_line = None
lines = []
for line in sourceFile:
# if you want to skip lines with only whitespace, you could add something like:
# line = line.lstrip(" \t")
if last_line != "\n":
lines.append(line)
last_line = line
contents = "".join(lines)
I was trying to find some clever generator function way of writing this, but it's been a long week so I can't.
Code untested, but I think it should work?
(edit: One upside is I removed the need for regular expressions which fixes the "now you have two problems" problem :) )
(another edit based on Marc Chiesa's suggestion of lingering whitespace)
For someone who can't do regex like me, if the code to process is python:
import autopep8
autopep8.fixcode('your_code')
Another quick solution, just in case your code isn't Python:
for x in range(100):
content.replace(" ", " ") # reduce the number of multiple whitespaces
# then
for x in range(20):
content.replace("\n\n", "\n") # reduce the number of multiple white lines
Note that if you have more than 100 consecutive whitespaces or 20 consecutive new lines, you'll want to increase the repetition times.
If decoding from unicode, watch out for non-breaking spaces which show up in cat -vet as M-BM-:
sourceFileContents = sourceFile.read()
sourceFileContents = re.sub(r'\n(\s*\n)+','\n\n',sourceFileContents.replace("\xc2\xa0"," "))
I have a CSV file which is made of words in the first column. (1 word per row)
I need to print a list of these words, i.e.
CSV File:
a
and
because
have
Output wanted:
"a","and","because","have"
I am using python and so far I have the follwing code;
text=open('/Users/jessieinchauspe/Dropbox/Smesh/TMT/zipf.csv')
text1 = ''.join(ch for ch in text)
for word in text1:
print '"' + word + '"' +','
This is returning:
"a",
"",
"a",
"n",
...
Whereas I need everything one one line, and not by character but by word.
Thank you for your help!
EDIT: this is a screenshot of the preview of the CSV file
Just loop over the file directly:
with open('/Users/jessieinchauspe/Dropbox/Smesh/TMT/zipf.csv') as text:
print ','.join('"{0}"'.format(word.strip()) for word in text)
The above code:
Loops over the file; this gives you a line (including the newline \n character).
Uses .strip() to remove whitespace around the word (including the newline).
Uses .format() to put the word in quotes ('word' becomes '"word"')
Uses ','.join() to join all quoted words together into one list with commas in between.
When you do :
text=open('/Users/jessieinchauspe/Dropbox/Smesh/TMT/zipf.csv')
that basically returns an iterator with each line as an element. If you want a list out of that and you're sure that there is only one word per line than all you need to do is
result=list(text)
print result
Otherwise you can get the first words only like so :
result = list(x.split(',')[0] for x in text)
print result
You could also use the CSV module:
import csv
input_f = '/Users/jessieinchauspe/Dropbox/Smesh/TMT/zipf.csv'
output_f = '/Users/jessieinchauspe/Dropbox/Smesh/TMT/output.csv'
with open(input_f, 'r') as input_handle, open(output_f, 'w') as output_handle:
writer = csv.writer(output_handle)
writer.writerow(list(input_handle))
If you put a comma at the end of the print statement it suppresses the newline.
print '"' + word + '"' +',',
Will give you the output on one line.
print ','.join('"%s"' % line.strip() for line in open('/tmp/test'))