csv.writer.writerow is separating my data improperly [duplicate] - python

This question already has answers here:
Why does csvwriter.writerow() put a comma after each character?
(4 answers)
Closed 3 years ago.
I made a code to delete certain columns (column 0,1,2,3,4,5,6) from a bunch of .csv dataset.
import csv
import os
data_path = "C:/Users/hhs/dataset/PSP/Upper/"
save_path = "C:/Users/hhs/Refined/PSP/Upper/"
for filename in os.listdir(data_path):
data_full_path = os.path.join(data_path, filename)
save_full_path = os.path.join(save_path, filename)
with open(data_full_path,"r") as source:
rdr= csv.reader(source)
with open(save_full_path,"w") as result:
wtr= csv.writer( result )
for line in rdr:
wtr.writerow((line[7]))
One of original dataset looks like this
Normals:0 Normals:1 Normals:2 Points:0 Points:1 Points:2 area cp
-0.69498 0.62377 0.34311 28.829 3.4728 -0.947160 0.25877 -0.094391
-0.73130 0.54405 0.39395 30.082 4.9111 -0.785480 0.23499 -0.261690
-0.74539 0.49691 0.42782 31.210 6.4629 -0.626470 0.20982 -0.330730
-0.75245 0.48322 0.42985 32.359 8.0473 -0.455080 0.19428 -0.221340
-0.77195 0.46254 0.41825 33.546 9.7963 -0.270990 0.19849 -0.086641
-0.78905 0.45241 0.39759 34.737 11.6860 -0.079976 0.18456 -0.022418
-0.79771 0.45422 0.37858 35.915 13.5840 0.118160 0.17047 0.026102
-0.80090 0.45479 0.37198 37.092 15.4810 0.330220 0.15594 0.154880
-0.80260 0.45516 0.36904 38.268 17.3770 0.550100 0.14279 0.316590
-0.80504 0.45774 0.36178 39.444 19.2740 0.769020 0.12996 0.475640
-0.80747 0.46024 0.35383 40.620 21.1710 0.982050 0.11692 0.624090
The result does have the last column as "cp" values, which is what I want.
However, the result looks very weird, every digit is located at different columns.
c p
- 0 . 0 9 4 3 9
- 0 . 2 6 1 6 9
- 0 . 3 3 0 7 3
- 0 . 2 2 1 3 4
- 0 . 0 8 6 6 4
- 0 . 0 2 2 4 1
0 . 0 2 6 1 0 2
0 . 1 5 4 8 8
0 . 3 1 6 5 9
0 . 4 7 5 6 4
0 . 6 2 4 0 9
.
.
.
Why the result looks like this?

Fix two issues in the second loop
add newline & delimiter
CSV file written with Python has blank lines between each row
change (line[7]) to [line[7]]
Why does csvwriter.writerow() put a comma after each character?
with open(save_full_path, "w", newline='') as result:
wtr= csv.writer(result, delimiter=',')
for line in rdr:
wtr.writerow([line[7]])

Related

Multiplying a specific column in txt file with constant number

I have txt file with 7 column...I want to mutiply a 3rd column with a constant number keeping all other column same and then output the file containing all the columns. Anyone can help?
1 2 1
2 2 1
3 2 1
mutiplying column 3 with "14" the output should be like
1 2 14
2 2 14
3 2 14
While you have a text file with 7 columns, your example only shows 3.
So I have based my answer on your example:
The important part of code related to multiplication is this:
matrix[:,(target_col-1)] *= c_val
Here is the full PYTHON code:
import numpy as np
# Constant value (used for multiplication)
c_val = 14
# Number of columns in the matrix
n_col = 3
# Column to be multiplied (ie. Third column)
target_col = 3
# Import the text file containing the matrix
filename = 'data.txt'
matrix = np.loadtxt(filename, usecols=range(n_col))
# Multiply the target column (ie. 3rd column) by c_val (ie.14)
matrix[:,(target_col-1)] *= c_val
# Save the matrix to a new text file
with open('new_text_file.txt','wb') as f:
np.savetxt(f, matrix, delimiter=' ', fmt='%d')
OUTPUT:
new_text_file.txt
1 2 14
2 2 14
3 2 14
This is a possible solution for C++17.
If you are sure about the format of the input file, you could reduce the code to the one below:
Just walk the input stream, multiply every 3rd number by a constant, and add a new line for every 10th number (you mentioned 7 numbers per line but then your example contained 9 numbers).
Notice you would need to use file streams instead of string streams.
#include <fmt/core.h>
#include <sstream> // istringstream, ostringstream
void parse_iss(std::istringstream& iss, std::ostringstream& oss, int k) {
for (int number_counter{ 0 }, number; iss >> number; ++number_counter) {
oss << ((number_counter % 3 == 2) ? number*k : number);
oss << ((number_counter % 9 == 8) ? "\n" : " ");
}
}
int main() {
std::istringstream iss{
"1 2 1 2 2 1 3 2 1\n"
"2 4 2 4 4 5 5 5 6\n"
};
std::ostringstream oss{};
parse_iss(iss, oss, 14);
fmt::print("{}", oss.str());
}
// Outputs:
//
// 1 2 14 2 2 14 3 2 14
// 2 4 28 4 4 70 5 5 84
[Demo]
Can be done as below:
MULTIPLIER = 14
input_file_name = "numbers_in.txt"
output_file_name = "numbers_out.txt"
with open(input_file_name, 'r') as f:
lines = f.readlines()
with open(output_file_name, 'w+') as f:
for line in lines:
new_line = ""
for i, x in enumerate(line.strip().split(" ")):
if (i+1)%3 == 0:
new_line += str(int(x)*MULTIPLIER) + " "
else:
new_line += x + " "
f.writelines(new_line + "\n")
# numbers_in.txt:
# 1 2 1 2 2 1 3 2 1
# 1 3 1 3 3 1 4 3 1
# 1 4 1 4 4 1 5 4 1
# numbers_out.txt:
# 1 2 14 2 2 14 3 2 14
# 1 3 14 3 3 14 4 3 14
# 1 4 14 4 4 14 5 4 14

Splitting a string into a list of integers with Python

This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps

Reading and Rearranging data in Python

I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')

Same code works diffferently regarding file size

I am running a simple code to select text from lines in the input file and write that text to an output file.
with open('inputpath', 'r') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_datoteka.write (NMEA + '\n')
The data I need to process looks something like this (two lines):
2012-05-01
23:59:59.007;!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;2470028;1;NULL;2012-05-01
21:59:59.007 2012-05-01
23:59:59.007;!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;2470032;1;NULL;2012-05-01
21:59:59.007 ...
Since I have large files to process (~2GB) I first tested the code on a small part of one of the large files (simply copied first 1000 or so lines and saved them into a test file).
The code worked perfectly and I got the results I was looking for:
!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;
!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;
After that I tried using the code on the whole data and got very different outputs:
2 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 , 0 , , 3 3 c m > k 1 0
0 0 1 3 v g l D P k W 1 Q S i n 0 0 0 0 , 0 * 6 E ; 2 4 7 0 0 2 8 ; 1
; N U L L ; 2 0 1 2 - 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 ,
0 , , 1 9 N S B n 0 0 1 n Q 8 < 7 v D h I q 4 3 C < 2 2 8 0 F , 0 * 0
7 ; 2 4 7 0 0 3 2 ; 1 ; N U L L ; 2 0 1 2 - ...
I have been trying to figure out the reason for such behaviour and have ran out of ideas and obviously need help.
Thank you Tobias for your comment.
Apparently the large data files were in UTF16-LE, which was the problem. I corrected the python code to read in utf16 and write to utf8 and that did the trick.
with codecs.open('inputpath', 'r', encoding='utf-16-le') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_line = NMEA + '\n'
iz_datoteka.write (iz_line.encode('utf-8'))

Listing distributions in Python

I have a sheet of numbers, separated by spaces into columns. Each column represents a different category, and within each column, each number represents a different value. For example, column number four represents age, and within the column, the number 5 represents an age of 44-55. Obviously, each row is a different person's record. I'd like to use a Python script to search through the the sheet, and find all columns where the sixth column is number "1." After that, I want to know how many times each number in column one appears where the number in column six is equal to "1." The script should output to the user that "While column six equals '1', the value '1' appears 12 times in column one. The value '2' appears 18 times..." etc. I hope I'm being clear here. I just want it to list the numbers, basically. Anyway, I'm new to Python. I've attached my code below. I think I should be using dictionaries, but I'm just not totally sure how. So far, I haven't really come close to figuring this out. I would really appreciate if someone could walk me through the logic that would be behind such code. Thank you so much!
ldata = open("list.data", "r")
income_dist = {}
for line in ldata:
linelist = line.strip().split(" ")
key_income_dist = linelist[6]
if key_income_dist in income_dist:
income_dist[key_income_dist] = 1 + income_dist[key_income_dist]
else:
income_dist[key_income_dist] = 1
ldata.close()
print value_no_occupations
First, indentation is majorly important in Python and the above is bad: the 5 lines following linelist = line.strip().split(" ") need to be indented to be in the loop like they should be.
Next they should be indented further and this line added before them:
if len(linelist)>6 and linelist[6]=="1":
This line skips over short lines (there are some), and tests for what you said you wanted: "where column six equals "1."" This is column [6] where the first number on the line is referenced as [0] (these are "offsets", not "cardinal", or counting, numbers).
You'll probably want to change key_income_dist = linelist[6] to key_income_dist = linelist[0] or [1] to get what you want. Play around if necessary.
Finally, you should say print income_dist at the end to get a look at your results. If you want fancier output, study up on formatting.
This is actually easier than it seems! The key is collections.Counter
from collections import Counter
ldata = open("list.data")
rows = [tuple(row.split()) for row in ldata if row.split()[5]==1]
# warning this will break if some rows are shorter than 6 columns
first_col = Counter(item[0] for item in rows)
If you want the distribution of every column (not just the first) do:
distribution = {column: Counter(item[column] for item in rows) for column in range(len(rows[0]))}
# warning this will break if all rows are not the same size!
Considering that the data file has ~9000 rows of data, if you don't want to keep the original data, you can combine step 1 & 2 to make the program use less memory and a little faster.
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
# keep only the rows with 6th column = '1'
only1 = []
for line in ldata:
if line.strip() == '': # ignor blank lines
continue
row = tuple(line.strip().split(" "))
if row[5] == '1':
only1.append(row)
ldata.close()
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])
Sample Test Data in list.data:
9 2 1 5 4 5 5 3 3 0 1 1 7 NA
9 1 1 5 5 5 5 3 5 2 1 1 7 1
9 2 1 3 5 1 5 2 3 1 2 3 7 1
1 2 5 1 2 6 5 1 4 2 3 1 7 1
1 2 5 1 2 6 3 1 4 2 3 1 7 1
8 1 1 6 4 8 5 3 2 0 1 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 1 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
8 2 1 3 6 6 2 2 4 2 1 1 7 1
7 2 1 5 3 5 5 3 4 0 2 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 9 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
Following your original program logic, I come up with this version:
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
linelist = []
for line in ldata:
linelist.append(tuple(line.strip().split(" ")))
ldata.close()
# keep only the rows with 6th column = '1'
only1 = []
for row in linelist:
if row[5] == '1':
only1.append(row)
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])

Categories

Resources