Transposing a big data file in one line [Python/Unix] - python

I have a (big) file containing values as such :
1 2 3 4 5 6 7
8 9 10 ... N
I want to be able to transpose the data values inside this file in on line to get the final result :
1 2 3 4 ... N

tr '\n' ' ' < inputfile
Thanks to John Gordon.

Related

Multiplying a specific column in txt file with constant number

I have txt file with 7 column...I want to mutiply a 3rd column with a constant number keeping all other column same and then output the file containing all the columns. Anyone can help?
1 2 1
2 2 1
3 2 1
mutiplying column 3 with "14" the output should be like
1 2 14
2 2 14
3 2 14
While you have a text file with 7 columns, your example only shows 3.
So I have based my answer on your example:
The important part of code related to multiplication is this:
matrix[:,(target_col-1)] *= c_val
Here is the full PYTHON code:
import numpy as np
# Constant value (used for multiplication)
c_val = 14
# Number of columns in the matrix
n_col = 3
# Column to be multiplied (ie. Third column)
target_col = 3
# Import the text file containing the matrix
filename = 'data.txt'
matrix = np.loadtxt(filename, usecols=range(n_col))
# Multiply the target column (ie. 3rd column) by c_val (ie.14)
matrix[:,(target_col-1)] *= c_val
# Save the matrix to a new text file
with open('new_text_file.txt','wb') as f:
np.savetxt(f, matrix, delimiter=' ', fmt='%d')
OUTPUT:
new_text_file.txt
1 2 14
2 2 14
3 2 14
This is a possible solution for C++17.
If you are sure about the format of the input file, you could reduce the code to the one below:
Just walk the input stream, multiply every 3rd number by a constant, and add a new line for every 10th number (you mentioned 7 numbers per line but then your example contained 9 numbers).
Notice you would need to use file streams instead of string streams.
#include <fmt/core.h>
#include <sstream> // istringstream, ostringstream
void parse_iss(std::istringstream& iss, std::ostringstream& oss, int k) {
for (int number_counter{ 0 }, number; iss >> number; ++number_counter) {
oss << ((number_counter % 3 == 2) ? number*k : number);
oss << ((number_counter % 9 == 8) ? "\n" : " ");
}
}
int main() {
std::istringstream iss{
"1 2 1 2 2 1 3 2 1\n"
"2 4 2 4 4 5 5 5 6\n"
};
std::ostringstream oss{};
parse_iss(iss, oss, 14);
fmt::print("{}", oss.str());
}
// Outputs:
//
// 1 2 14 2 2 14 3 2 14
// 2 4 28 4 4 70 5 5 84
[Demo]
Can be done as below:
MULTIPLIER = 14
input_file_name = "numbers_in.txt"
output_file_name = "numbers_out.txt"
with open(input_file_name, 'r') as f:
lines = f.readlines()
with open(output_file_name, 'w+') as f:
for line in lines:
new_line = ""
for i, x in enumerate(line.strip().split(" ")):
if (i+1)%3 == 0:
new_line += str(int(x)*MULTIPLIER) + " "
else:
new_line += x + " "
f.writelines(new_line + "\n")
# numbers_in.txt:
# 1 2 1 2 2 1 3 2 1
# 1 3 1 3 3 1 4 3 1
# 1 4 1 4 4 1 5 4 1
# numbers_out.txt:
# 1 2 14 2 2 14 3 2 14
# 1 3 14 3 3 14 4 3 14
# 1 4 14 4 4 14 5 4 14

Splitting a string into a list of integers with Python

This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other
If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():
So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Why is set not calculating my unique integers?

I just started teaching myself Python last night via Python documentation, tutorials and SO questions.
So far I can ask a user for a file, open and read the file, remove all # and beginning \n in the file, read each line into an array, and count the number of integers per line.
I want to calculate the number of unique integers per line. I realized that Python uses a set capability which I thought would work perfectly for this calculation. However, I always receive the value of one greater than the prior value (I will show you). I looked at other SO posts related to sets and do not see what I am not missing and have been stumped for a while.
Here is the code:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
#calculate the number of integers per line
names_list.append(line)
#print "There are ", len(line.split()), " numbers on this line"
#print names_list
#calculate the number of unique integers
myset = set(names_list)
print myset
myset_count = len(myset)
print "unique:",myset_count
For further explanation:
names_list is:
['1 2 3 4 5 6 5 4 5\n', '14 62 48 14\n', '1 3 5 7 9\n', '123 456 789 1234 5678\n', '34 34 34 34 34\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n']
and my_set is:
set(['1 2 3 4 5 6 5 4 5\n', '1 3 5 7 9\n', '34 34 34 34 34\n', '14 62 48 14\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n', '123 456 789 1234 5678\n'])
The output I receive is:
unique: 1
unique: 2
unique: 3
unique: 4
unique: 5
unique: 6
unique: 7
The output that should occur is:
unique: 6
unique: 3
unique: 5
unique: 5
unique: 1
unique: 1
unique: 7
Any suggestions as to why my set per line is not calculating the correct number of unique integers per line? I would also like any suggestions on how to improve my code in general (if you would like) because I just started learning Python by myself last night and would love tips. Thank you.
The problem is that as you are iterating over your file you are appending each line to the list names_list. After that, you build a set out of these lines. Your text file does not seem to have any duplicate lines, so printing the length of your set just displays the current number of lines you have processed.
Here's a commented fix:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
numbers = line.split() # splits the string by whitespace and gives you a list
unique_numbers = set(numbers) # builds a set of the strings in numbers
print(len(unique_numbers)) # prints number of items in the set
Note that we are using the currently processed line and build a set from it (after splitting the line). Your original code stores all lines and then builds a set from the lines in each loop.
myset = set(names_list)
should be
myset = set(line.split())

Reading and Rearranging data in Python

I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')

Categories

Resources