when I dump processed csv to stdin with csv.writer, ^M are appended in the output. why are they coming in?
writer = csv.writer(sys.stdout, delimiter=output_delimiter, quotechar=quotechar)
for row in csv.reader(open(args[0],"U"), delimiter=delimiter, quotechar=quotechar):
writer.writerow(row)
How I invoke the cmd:
./csvcut -d ',' -q \" -f 2,4,5,6,7,8,9,10 data/listings.csv > data/extracted.csv
The generated file(data/extracted.csv) is:
name,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price^M
COZICOMFORT LONG TERM STAY ROOM 2,Francesca,North Region,Woodlands,1.44255,103.7958,Private room,83^M
Pleasant Room along Bukit Timah,Sujatha,Central Region,Bukit Timah,1.33235,103.78521,Private room,81^M
COZICOMFORT,Francesca,North Region,Woodlands,1.44246,103.79667,Private room,69^M
Ensuite Room (Room 1 & 2) near EXPO,Belinda,East Region,Tampines,1.34541,103.95712,Private room,206^M
B&B Room 1 near Airport & EXPO,Belinda,East Region,Tampines,1.34567,103.95963,Private room,94^M
Room 2-near Airport & EXPO,Belinda,East Region,Tampines,1.34702,103.96103,Private room,104^M
3rd level Jumbo room 5 near EXPO,Belinda,East Region,Tampines,1.34348,103.96337,Private room,208^M
The input file(data/listings.csv) is:
1024986,Super Host Apartment,5643415,Martin,Central Region,River Valley,1.29349,103.83837,Entire home/apt,140,2,145,2019-08-22,2.04,1,230
1060046,S$950/mth spacious room for short/long term lease,5748910,Sarah,West Region,Bukit Panjang,1.38123,103.76874,Private room,49,3,15,2019-06-01,0.20,2,131
1078804,Cozy Studio Room,4602014,F,North-East Region,Hougang,1.36764,103.90228,Private room,31,30,60,2018-09-28,0.78,3,225
1131162,Cozy Room at Bedok Reservoir,6205166,Lydia,East Region,Bedok,1.33729,103.9298,Private room,72,2,1,2017-04-01,0.03,1,360
Typically, ^M signifies carriage return. It is symbolized in different notations across different Operating Systems. Since you are writing to stdout and redirecting the output to another file, python assumes a \r\n carriage return value.
For *nix, carriage return is denoted by \n while for Windows, it is \r\n. ^M is the text editor showing you \r.
From what I see, you have these options:
Write to the file directly in binary mode. (like here)
Replace ^M in the output file with string replacement. (with unix2dos or replace())
Related
I have a long list of unformatted data say data.txt where each set is started with a header and ends with a blank line, like:
TypeA/Price:20$
alexmob
moblexto
unkntom
TypeB/Price:25$
moblexto2
unkntom0
alexmob3
poptop9
tyloret
TypeC/Price:30$
rtyuoper0
kunlohpe6
mobryhox
Now, i want to add the header of each set with it's content side by side with comma separated. Like:
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$
so that whenever i will grep with one keyword, relevant content along with the header comes together. Like:
$grep mob data.txt
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
moblexto2,TypeB/Price:25$
alexmob3,TypeB/Price:25$
mobryhox,TypeC/Price:30$
I am newbie on bash scripting as well as python and recently started learning these, so would really appreciate any simple bash scipting (using sed/awk) or python scripting.
Using sed
$ sed '/Type/{h;d;};/[a-z]/{G;s/\n/,/}' input_file
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$
Match lines containing Type, hold it in memory and delete it.
Match lines with alphabetic characters, append G the contents of the hold space. Finally, sub new line for a comma.
I would use GNU AWK for this task following way, let file.txt content be
TypeA/Price:20$
alexmob
moblexto
unkntom
TypeB/Price:25$
moblexto2
unkntom0
alexmob3
poptop9
tyloret
TypeC/Price:30$
rtyuoper0
kunlohpe6
mobryhox
then
awk '/^Type/{header=$0;next}{print /./?$0 ";" header:$0}' file.txt
output
alexmob;TypeA/Price:20$
moblexto;TypeA/Price:20$
unkntom;TypeA/Price:20$
moblexto2;TypeB/Price:25$
unkntom0;TypeB/Price:25$
alexmob3;TypeB/Price:25$
poptop9;TypeB/Price:25$
tyloret;TypeB/Price:25$
rtyuoper0;TypeC/Price:30$
kunlohpe6;TypeC/Price:30$
mobryhox;TypeC/Price:30$
Explanation: If line starts with (^) Type set header value to that line ($0) and go to next line. For every line print if it does contain at least one character (/./) line ($0) concatenated with ; and header, otherwise print line ($0) as is.
(tested in GNU Awk 5.0.1)
Using any awk in any shell on every Unix box regardless of which characters are in your data:
$ awk -v RS= -F'\n' -v OFS=',' '{for (i=2;i<=NF;i++) print $i, $1; print ""}' file
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$
I'm working on Windows. I've a Python file to create a new CSV file and I view that using Notepad (even through Microsoft Excel).
import csv
data = [['fruit','quantity'], ['apple',5], ['banana',7],['mango',8]]
with open('d:\lineter.csv', 'w') as l:
w = csv.writer(l,delimiter='|', lineterminator='\r')
w.writerows(data)
The resulting file in Notepad:
fruit|quantityapple|5banana|7mango|8
Does the carriage return \r work or not? It works like lineterminator='' in Notepad. But in Excel, it works like '\n'.
The output doesn't seem to implement carriage return. When I use lineterminator as:
w = csv.writer(l, delimiter='|', lineterminator='*\r*\n')
The output in Notepad is:
fruit|quantity**
apple|5**
banana|7**
mango|8**
This is evident here too.
How does '\r' work in lineterminator in writer()?
Or is there another thing happening there?
The shorter answer:
When to use carriage return (CR, \r) vs. line feed (LF, \n) vs. both (CRLF, \r\n) to make a new line appear in a text editor on Windows, Mac, and Linux:
How does '\r' work in lineterminator in writer()??
It works fine in csv.writer(). This really isn't a Python, CSV, or writer problem. This is an operating system historical difference (actually, it's more accurate to state it is a program-specific difference) going back to the 1960s or so.
Or is there another thing happening there?
Yes, this is the one.
Your version of Notepad doesn't recognize a carriage return (\r) as a character used to display new lines, and hence won't display it as such in Notepad. Other text editors, such as Sublime Text 3, however probably would, even on Windows.
Up until about the year 2018 or so, Windows and Notepad required a carriage return + line feed (\r\n) together to display a new line. Contrast this to Mac and Linux, which require only \n.
The solution is to use \r\n for a new line on Windows, and \n alone for a new line on Mac or Linux. You can also try a different text editor, such as Sublime Text, when viewing or editing text files, or upgrade your version of Windows or Notepad, if possible, as somewhere around the year 2018 Windows Notepad started to accept \r alone as a valid old-Mac-style new line char.
(from the OP's comment under this answer):
Then why to give '\r\n'???
When a programmer writes a program, the programmer can make the program do whatever the programmer wants the program to do. When Windows programmers made Windows and Notepad they decided to make the program do nothing if it got a \r, nothing if it got a \n, and to do a new line if it got a \r\n together. It's that simple. The program is doing exactly what the programmers told it to do, because they decided that's how they wanted the program to work. So, if you want a new line in the older (pre-2018) version of Notepad in Windows, you must do what the programmers require you to do to get it. \r\n is it.
This goes back to the days of teletypewriters (read the "History" and "Representation" sections here), and this page about "teleprinters" / "teletypewriters" / "teletype or TTY machines" too:
A typewriter or electromechanical printer can print characters on paper, and execute operations such as move the carriage back to the left margin of the same line (carriage return), advance to the same column of the next line (line feed), and so on.
(source; emphasis added)
The mechanical carriage return button on a teletypewriter (\r now on a computer) meant: "return the carriage (print head) to the beginning of the line" (meaning: the far left side of the page), and the line feed mechanical mechanism on a teletypewriter (\n now on a computer) meant: "roll the paper up one line so we can now type onto the next line." Without the mechanical line feed (\n) action, the carriage return (\r) alone would move the mechanical print head to the far left of the page and cause you to type right back on top of the words you already typed! And without the carriage return mechanical action (\r on a computer), the line feed mechanical action (\n) alone would cause you to just type in the last column at the far right on each new line on the page, never able to return the print head to the left side of the page again! On an electro-mechanical teletypewriter, they both had to be used: the carriage return would bring the print head back to the left side of the page, and the line feed action would move the print head down to the next line. So, presumably, Windows programmers felt it was logical to keep that tradition alive, and they decided to require both a \r\n together to create a new line on a computer, since that's how it had to be done traditionally on an electro-mechanical teletypewriter.
Read below for details.
Details (the longer answer):
I have some ideas of what's going on, but let's take a look. I believe we have two questions to answer:
Is the \r actually being stored into the file?
Is Notepad actually showing the \r, and if not, why not?
So, for #1. Let's test it on Linux Ubuntu 20.04 (Focal Fossa):
This program:
#!/usr/bin/python3
import csv
data = [['fruit','quantity'], ['apple',5], ['banana',7],['mango',8]]
with open('d:\lineter.csv','w') as l:
w = csv.writer(l, delimiter='|', lineterminator='\r')
w.writerows(data)
produces this file: d:\lineter.csv. If I open it in the Sublime Text 3 text editor I see:
fruit|quantity
apple|5
banana|7
mango|8
So far so good. Let's look at the characters with hexdump at the command line:
hexdump -c shows the \r characters, sure enough!
$ hexdump -c d\:\\lineter.csv
0000000 f r u i t | q u a n t i t y \r a
0000010 p p l e | 5 \r b a n a n a | 7 \r
0000020 m a n g o | 8 \r
0000028
You can also use hexdump -C to show the characters in hexadecimal instead, and again, I see the \r in the file as a hex 0d char, which is correct.
Ok, so I boot up Windows 10 Professional in my VirtualBox virtual machine in Linux, and open the same file in Notepad, and....it works too! See screenshot:
But, notice the part I circled which says "Macintosh (CR)". I'm running the latest version of Windows 10 Professional. I'm betting you're using an old version of Notepad which doesn't have this fix, and yours won't say that here. This is because for 33 years Notepad didn't handle Carriage Return, or \r, as a valid line-ending, so it wouldn't display it as such. See here: Windows Notepad fixed after 33 years: Now it finally handles Unix, Mac OS line endings.
Due to historical differences dating back to teletypewriters and Morse code (read the "History" and "Representation" sections here), different systems decided to make their text editors treat line endings in different ways. From the article just above (emphasis added):
Notepad previously recognized only the Windows End of Line (EOL) characters, specifically Carriage Return (CR, \r, 0x0d) and Line Feed (LF, \n, 0x0a) together.
For old-school Mac OS, the EOL character is just Carriage Return (CR, \r, 0x0d) and for Linux/Unix it's just Line Feed (LF, \n, 0x0a). Modern macOS, since Mac OS X, follows the Unix convention.
So, what we have here is the following displayed as a newline in a text editor:
Old-school Mac: CR (\r) only
Windows Notepad up until ~2018: CR & LF together (\r\n)
Linux: LF (\n) only
Modern Mac: LF (\n) only
Modern Windows Notepad (year ~2018 and later): any of the scenarios above.
So, for Windows, just stick to always using \r\n for a newline, and for Mac or Linux, just stick to always using \n for a newline, unless you're trying to guarantee old-school (i.e., pre-2019 :)) Windows compatibility of your files, in which case you should use \r\n for newlines as well.
Note, for Sublime Text 3, I just searched the preferences in Preferences → Settings and found this setting:
// Determines what character(s) are used to terminate each line in new files.
// Valid values are 'system' (whatever the OS uses), 'windows' (CRLF) and
// 'unix' (LF only).
"default_line_ending": "system",
So, to use the convention for whatever OS you're running Sublime Text on, the default is "system". To force 'windows' (CRLF) line endings when editing and saving files in Sublime Text, however, use this:
"default_line_ending": "windows",
And to force Unix (Mac and Linux) LF-only line ending settings, use this:
"default_line_ending": "unix",
In the Notepad editor, I can find no such settings to configure. It is a simple editor, catering for 33 years to Windows line endings only.
Additional Reading:
https://en.wikipedia.org/wiki/Teleprinter
https://en.wikipedia.org/wiki/Newline#History
Is a new line = \n OR \r\n?
Why does Windows use CR LF?
[I still need to read & study] Unix & Linux: Why does Linux use LF as the newline character?
[I still need to read & study] Retrocomputing: Why is Windows using CR+LF and Unix just LF when Unix is the older system?
Suppose I have 3 data corresponding with 3 persons.
Each person have 3 txt files data of the form (variable,value|...),
1.txt
A,10|
B,11|
C,12|
D,13|
E,14|F,15|G,16|H,17|
I,18|J,19|K,20|L,21|
2.txt
M,22|
N,23|
O,24|
P,25|
Q,26|R,27|S,28|T,29|
U,30|V,31|W,32|X,33|
3.txt
Y,34|
Z,35|
AA,36|
AB,37|
AC,38|AD,39|AE,40|AF,41|
AG,42|AH,43|AI,44|AJ,45|
In the same way for the other persons
Convert in excel file
picture show result after converted in excel file
It can be done quite easily in Bash into a csv
cat [123].txt | tr -d '[A-Z,][:space:]' | tr "|" "," > /tmp/output.csv
And now a short explanation:
cat [123].txt just outputs the text in the files sequentially.
The "|" pipes the output of the previous command to the next one. It is immensely useful and very worth getting to know if you intend to use Bash scripting even a little.
tr -d '[A-Z,][:space:]' deletes all capital letters and commas, as well as any and all spaces and invisible characters such as new line or line break.
tr "|" "," translates the "|" character to a "," character.
And finally > /tmp/output.csv outputs it to a new file.
Now there is one caveat. The file is technically a .TXT file and not a .CSV file. It doesn't change anything if you open it manually, but I don't know your workflow so this could matter.
I exported a CSV from excel to parse using python. When I opened the vimmed the CSV, I noticed that it was all one line with ^M characters where newlines should be.
Name, Value, Value2, OtherStuff ^M Name, Value, Value2, OtherStuff ^M
I have the file parsed such that I modify the values and put the into a string (using 'rU' mode in csvreader). However, the string has no newlines. So I am wondering, is there a way to split the string on this ^M character, or a way to replace it with a \n?
^M is how vim displays windows end-of-line's
The dos2unix command should fix those up for you:
dos2unix my_file.csv
It's due to the different EOL formats on Windows/Unix.
On windows, it's \r\n
On Unix/Linux/Mac, it's just \n
The ^M is actually vim showing you the windows CR (Carriage Return) or \r
The python open command documentation has more information on handling Universal Newlines: http://docs.python.org/2/library/functions.html#open
If you are on a unix system, there is a program called dos2unix (and its counterpart unix2dos) that will do exactly that conversion.
But, it is pretty much the same as something like this:
sed -i -e 's/$/\r/' file
What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode?
Especially when the text file in question may contain non-ASCII characters.
This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.
In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.
Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.
from the documentation:
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL - \n in Unix, \r in Mac versions prior to OS X, \r\n in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just \n. And vice versa, i.e. when you try to write \n to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checking os.linesep.
When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.
Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters \r, \n and \r\n to \n.
For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):
In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)
So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:
0 $ cat data.txt
line1
line2
line3
0 $ file data.txt
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'
It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:
0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']
(the universal newline mode specifier is deprecated as of Python 3.x)
On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.