Splitting a string into a list of integers with Python - python

This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']

Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]

Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps

Related

csv.writer.writerow is separating my data improperly [duplicate]

This question already has answers here:
Why does csvwriter.writerow() put a comma after each character?
(4 answers)
Closed 3 years ago.
I made a code to delete certain columns (column 0,1,2,3,4,5,6) from a bunch of .csv dataset.
import csv
import os
data_path = "C:/Users/hhs/dataset/PSP/Upper/"
save_path = "C:/Users/hhs/Refined/PSP/Upper/"
for filename in os.listdir(data_path):
data_full_path = os.path.join(data_path, filename)
save_full_path = os.path.join(save_path, filename)
with open(data_full_path,"r") as source:
rdr= csv.reader(source)
with open(save_full_path,"w") as result:
wtr= csv.writer( result )
for line in rdr:
wtr.writerow((line[7]))
One of original dataset looks like this
Normals:0 Normals:1 Normals:2 Points:0 Points:1 Points:2 area cp
-0.69498 0.62377 0.34311 28.829 3.4728 -0.947160 0.25877 -0.094391
-0.73130 0.54405 0.39395 30.082 4.9111 -0.785480 0.23499 -0.261690
-0.74539 0.49691 0.42782 31.210 6.4629 -0.626470 0.20982 -0.330730
-0.75245 0.48322 0.42985 32.359 8.0473 -0.455080 0.19428 -0.221340
-0.77195 0.46254 0.41825 33.546 9.7963 -0.270990 0.19849 -0.086641
-0.78905 0.45241 0.39759 34.737 11.6860 -0.079976 0.18456 -0.022418
-0.79771 0.45422 0.37858 35.915 13.5840 0.118160 0.17047 0.026102
-0.80090 0.45479 0.37198 37.092 15.4810 0.330220 0.15594 0.154880
-0.80260 0.45516 0.36904 38.268 17.3770 0.550100 0.14279 0.316590
-0.80504 0.45774 0.36178 39.444 19.2740 0.769020 0.12996 0.475640
-0.80747 0.46024 0.35383 40.620 21.1710 0.982050 0.11692 0.624090
The result does have the last column as "cp" values, which is what I want.
However, the result looks very weird, every digit is located at different columns.
c p
- 0 . 0 9 4 3 9
- 0 . 2 6 1 6 9
- 0 . 3 3 0 7 3
- 0 . 2 2 1 3 4
- 0 . 0 8 6 6 4
- 0 . 0 2 2 4 1
0 . 0 2 6 1 0 2
0 . 1 5 4 8 8
0 . 3 1 6 5 9
0 . 4 7 5 6 4
0 . 6 2 4 0 9
.
.
.
Why the result looks like this?
Fix two issues in the second loop
add newline & delimiter
CSV file written with Python has blank lines between each row
change (line[7]) to [line[7]]
Why does csvwriter.writerow() put a comma after each character?
with open(save_full_path, "w", newline='') as result:
wtr= csv.writer(result, delimiter=',')
for line in rdr:
wtr.writerow([line[7]])

Finding particular column value via regex

I have a txt file containing multiple rows as below.
56.0000 3 1
62.0000 3 1
74.0000 3 1
78.0000 3 1
82.0000 3 1
86.0000 3 1
90.0000 3 1
94.0000 3 1
98.0000 3 1
102.0000 3 1
106.0000 3 1
110.0000 3 0
116.0000 3 1
120.0000 3 1
Now I am looking for the row which has '0' in the third column .
I am using python regex package. What I have tried is re.match("(.*)\s+(0-9)\s+(1)",line) but of no help..
What should be the regular expression pattern I should be looking for?
You probably don't need a regex for this. You can strip trailing whitespaces from the right side of the line and then check the last character:
if line.rstrip()[-1] == "0": # since your last column only contains 0 or 1
...
Just split line and read value from list.
>>> line = "56.0000 3 1"
>>> a=line.split()
>>> a
['56.0000', '3', '1']
>>> print a[2]
1
>>>
Summary:
f = open("sample.txt",'r')
for line in f:
tmp_list = line.split()
if int(tmp_list[2]) == 0:
print "Line has 0"
print line
f.close()
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Line has 0
110.0000 3 0

Python script for copying a column

This seems like a simple question, but I can't find an answer.
Input:
a 3 4
b 1 4
c 8 3
d 3 8
Wanted output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
Note: the file .txt input has many rows in the first column.
You didn't ask for it, but would you want awk? You could do:
awk '{$1=$1 OFS $1}1' Input
or the more obvious but less flexible:
awk '{print $1 $1 $2 $3}' Input
Assuming you've read your results in an array, you want:
values = ["a",1,2,3]
values.insert(0,values[0])
This inserts the value of index 0 (in this case "a") at position 0, moving all the other contents of values to the right.
This will also work on strings, so if your results are read as a string you can do the following - please note that I am including the spaces after each digit and am doing it a bit differently:
values="a 1 2 3"
values = values[:2] + values
In this example we take the first two array members (values[:2] or values[0:2]) and adding the existing array values to the end.
Hope this helps!
Try this:
fin = open("text.txt")
content = fin.readlines()
fin.close()
for elem in content:
print(elem[0],elem[0]+elem[1:-1])
Output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
with open("sample.csv") as inputs:
for line in inputs:
trimed_line = line.strip()
parts = trimed_line.split()
print("{0} {1}".format(parts[0], trimed_line))
output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8

Reading and Rearranging data in Python

I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')

Listing distributions in Python

I have a sheet of numbers, separated by spaces into columns. Each column represents a different category, and within each column, each number represents a different value. For example, column number four represents age, and within the column, the number 5 represents an age of 44-55. Obviously, each row is a different person's record. I'd like to use a Python script to search through the the sheet, and find all columns where the sixth column is number "1." After that, I want to know how many times each number in column one appears where the number in column six is equal to "1." The script should output to the user that "While column six equals '1', the value '1' appears 12 times in column one. The value '2' appears 18 times..." etc. I hope I'm being clear here. I just want it to list the numbers, basically. Anyway, I'm new to Python. I've attached my code below. I think I should be using dictionaries, but I'm just not totally sure how. So far, I haven't really come close to figuring this out. I would really appreciate if someone could walk me through the logic that would be behind such code. Thank you so much!
ldata = open("list.data", "r")
income_dist = {}
for line in ldata:
linelist = line.strip().split(" ")
key_income_dist = linelist[6]
if key_income_dist in income_dist:
income_dist[key_income_dist] = 1 + income_dist[key_income_dist]
else:
income_dist[key_income_dist] = 1
ldata.close()
print value_no_occupations
First, indentation is majorly important in Python and the above is bad: the 5 lines following linelist = line.strip().split(" ") need to be indented to be in the loop like they should be.
Next they should be indented further and this line added before them:
if len(linelist)>6 and linelist[6]=="1":
This line skips over short lines (there are some), and tests for what you said you wanted: "where column six equals "1."" This is column [6] where the first number on the line is referenced as [0] (these are "offsets", not "cardinal", or counting, numbers).
You'll probably want to change key_income_dist = linelist[6] to key_income_dist = linelist[0] or [1] to get what you want. Play around if necessary.
Finally, you should say print income_dist at the end to get a look at your results. If you want fancier output, study up on formatting.
This is actually easier than it seems! The key is collections.Counter
from collections import Counter
ldata = open("list.data")
rows = [tuple(row.split()) for row in ldata if row.split()[5]==1]
# warning this will break if some rows are shorter than 6 columns
first_col = Counter(item[0] for item in rows)
If you want the distribution of every column (not just the first) do:
distribution = {column: Counter(item[column] for item in rows) for column in range(len(rows[0]))}
# warning this will break if all rows are not the same size!
Considering that the data file has ~9000 rows of data, if you don't want to keep the original data, you can combine step 1 & 2 to make the program use less memory and a little faster.
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
# keep only the rows with 6th column = '1'
only1 = []
for line in ldata:
if line.strip() == '': # ignor blank lines
continue
row = tuple(line.strip().split(" "))
if row[5] == '1':
only1.append(row)
ldata.close()
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])
Sample Test Data in list.data:
9 2 1 5 4 5 5 3 3 0 1 1 7 NA
9 1 1 5 5 5 5 3 5 2 1 1 7 1
9 2 1 3 5 1 5 2 3 1 2 3 7 1
1 2 5 1 2 6 5 1 4 2 3 1 7 1
1 2 5 1 2 6 3 1 4 2 3 1 7 1
8 1 1 6 4 8 5 3 2 0 1 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 1 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
8 2 1 3 6 6 2 2 4 2 1 1 7 1
7 2 1 5 3 5 5 3 4 0 2 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 9 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
Following your original program logic, I come up with this version:
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
linelist = []
for line in ldata:
linelist.append(tuple(line.strip().split(" ")))
ldata.close()
# keep only the rows with 6th column = '1'
only1 = []
for row in linelist:
if row[5] == '1':
only1.append(row)
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])

Categories

Resources