Python script for copying a column - python

This seems like a simple question, but I can't find an answer.
Input:
a 3 4
b 1 4
c 8 3
d 3 8
Wanted output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
Note: the file .txt input has many rows in the first column.

You didn't ask for it, but would you want awk? You could do:
awk '{$1=$1 OFS $1}1' Input
or the more obvious but less flexible:
awk '{print $1 $1 $2 $3}' Input

Assuming you've read your results in an array, you want:
values = ["a",1,2,3]
values.insert(0,values[0])
This inserts the value of index 0 (in this case "a") at position 0, moving all the other contents of values to the right.
This will also work on strings, so if your results are read as a string you can do the following - please note that I am including the spaces after each digit and am doing it a bit differently:
values="a 1 2 3"
values = values[:2] + values
In this example we take the first two array members (values[:2] or values[0:2]) and adding the existing array values to the end.
Hope this helps!

Try this:
fin = open("text.txt")
content = fin.readlines()
fin.close()
for elem in content:
print(elem[0],elem[0]+elem[1:-1])
Output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8

with open("sample.csv") as inputs:
for line in inputs:
trimed_line = line.strip()
parts = trimed_line.split()
print("{0} {1}".format(parts[0], trimed_line))
output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8

Related

Edit columns in a text file

I have a text file containing 4 columns. I need to remove the first two columns and replace them with one column. The value which I should put as the new column is being produced in a loop. Here is something I am trying to do.
The input is like this:
1 2 3 4
5 6 7 8
9 1 2 3
The output should be like this:
d 3 4
d 7 8
d 2 3
but "d" is a variable that is being produced in a loop for each line.
with open('EQ.txt','r') as f:
i = 0
for line in f
...
...
d=r+d
with open(c.txt, "w") as wrt:
new_line = d\n.format(line[2], line[3])
wrt.write(new_line)
What you want is:
new_line = "%d %d %d\n".format(d, line[2], line[3])
The format string has to be in quotes, with %d formats to specify that you want decimal numbers there. Then you list all three values in the argument list.

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other
If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():
So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Removing repetitions in one column and iterate lines to collapse in second column

I am looking for a way to generate some statistics on my model predictions.
On the left I have true values and on the right I have predictions.
My true values are over an interval so I want to condense them into a single value for each interval and know which predictions were made.
I think I need to do something like "uniq" to the first column and iterate each line of the second column until the value changes in the first column.
I would imagine that awk would be very good at this using $1 and $2 to treat the columns but the iteration of the second column without losing the info in the first column is where I am stuck. It is worth while to note that the values in the first column can occur many times and I want them repeated in each interval just not in sequence.
I can accept any code that is in shell or python.
Example Input:
1 1
1 0
1 1
2 2
2 2
1 1
3 3
3 3
3 2
3 3
2 3
2 2
2 1
Example output:
1 1 0 1
2 2 2
1 1
3 3 3 2 3
2 3 2 1
Really simple using awk:
awk 'NR>1{cr="\n"}L!=$1{printf "%s%s ",cr,$1;L=$1}{printf " %s" ,$2}END{print ""}' input
Result
1 1 0 1
2 2 2
1 1
3 3 3 2 3
2 3 2 1
Explanation
NR>1{cr="\n"}: cr or carriage return: will be null until the first record is processed (NR>1).
L!=$1{printf "%s%s ",cr,$1;L=$1}: If the L Last key differs from the current ($1) it prints cr (null at first record) the current key $1, and stores it's value in L as the Last key processed.
{printf " %s" ,$2}: Just show the second column of each record.
END{print ""}: Print a final carriage return when all records are processed.
Here's a version in bash:
#/bin/bash
while read a b; do
if [ $a != "$val" ]; then
[ -n "$val" ] && echo $val $pred
val=$a
pred=$b
else
pred="$pred $b"
fi
done <inputfile
[ -n "$val" ] && echo $val $pred

Finding the most frequent items in a dataset

I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?
You have to iterate again the dataset and, for each line, show only those who are int the most common data set.
If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item
for line in dataset:
for element in line.split()
if element in most_common_elements:
print(element, end=' ')
print()
PS: For Python 2, add from __future__ import print_function on top of your script
According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:
with open('dataset-mostcommon.txt', 'w') as output:
for item, occurence in p:
output.writelines("%d has %d occurrences,\n"%(item, occurence))

Reading and Rearranging data in Python

I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')

Categories

Resources