I am very new to python and I would be grateful for some guidance with the following.
I have a text file with over 5 million rows and 8 columns, I am trying to add "15" to each value in column 4 only.
For example:
10 21 34 12 50 111 234 21 7
21 10 23 56 80 90 221 78 90
Would be changed to:
10 21 34 12 **65** 111 234 21 7
21 10 23 56 **95** 90 221 78 90
My script below allows me to isolate the column, but when I try to add any amount to it i return "TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'"
file = open("file.txt")
column = []
for line in file:
column.append(int(line.split("\t")[3]))
print column
Any advice would be great.
try this to get you started -- there are many better ways using libraries but this will show you some better file handling basic methods anyway. works for the data you posted -- as long as the delimiter in your files is double space (" ") and that everything can be cast to an int. If not.....
Also -- note the correct way to start a script is with:
if __name__ == "__main__":
this is because you wont generally want any code to execute if you are making a library...
__author__ = 'charlie'
in_filename = "in_file.txt"
out_filename = "out_file.txt"
delimiter = " "
def main():
with open(in_filename, "r") as infile:
with open(out_filename, "w") as outfile:
for line in infile:
ldata = line.split(delimiter)
ldata[4] = str(int(ldata[4]) + 15)
outfile.write(delimiter.join(ldata))
if __name__ == "__main__":
main()
With Pandas :
import pandas as pd
df = pd.read_clipboard(header=None)
df[4] += 15
Related
I am trying to automate some parts of my work. I have a INP file which is text-like (but not .txt file) and contains both strings and ints/floats. I'd like to replace certain columns from the 6 to the end rows with the values in the output(result) of a loop.
Here's what I want to accomplish for the test.INP:
Keep the first 5 lines, replace the data from columns 3-5 with those data in result. Hopefully, the final test.INP file is not newly created but the data has been replaced.
Because the dimension of the data to be replaced with and the target data in result is the same, to avoid the first 5 lines, I am trying to define a function to reversely read line by line and replace test.INP file.
Python script:
...
with open('test.INP') as j:
raw = j.readlines()
def replace(raw_line, sep='\t', idx=[2, 3, 4], values=result[-1:]):
temp = raw[-1].split('\t')
for i, v in zip(idx, values):
temp[i] = str(v)
return sep.join(temp)
raw[::-1] = replace(raw[::-1])
print('\n'.join(raw))
...
test.INP contents (before):
aa bb cc dd
abcd
e
fg
cols1 cols2 cols3 cols4 cols5 cols6
65 69 433 66 72 70b
65 75 323 61 71 68g
61 72 12 57 73 26c
Result contents:
[[329 50 58]
[258 47 66]
[451 38 73]]
My final goal is to get the test.INP below:
test.INP contents(after):
aa bb cc dd
abcd
e
fg
cols1 cols2 cols3 cols4 cols5 cols6
65 69 329 50 58 70b
65 75 258 47 66 68g
61 72 451 38 73 26c
But the code doesn't work as expected, seems nothing changed in the test.INP file. Any suggestions?
Getting error message at the bottom it says:
ValueError Traceback (most recent call last)
<ipython-input-1-92f8c1020af3> in <module>
36 temp[i] = str(v)
37 return sep.join(temp)
---> 38 raw[::-1] = replace(raw[::-1])
39 print('\n'.join(raw))
ValueError: attempt to assign sequence of size 100 to extended slice of size 8
I coudn't understand your code so I build own version.
Later you understand what you try to do - you reverse lines to works from last until you use all results. Problem is that you forgot loop which will do it. You run replace only once and send all rows at once but replace works only with one row and it returns only one row - so finally you get one row (with 8 columns) and you want to assign in places of all rows (probably 100 rows)
Here version which works for me. I put text directly in code but I expect it will works also with text from file
text = '''aa bb cc dd
abcd
e
fg
cols1\tcols2\tcols3\tcols4\tcols5\tcols6
65\t69\t433\t66\t72\t70b
65\t75\t323\t61\t71\t68g
61\t72\t12\t57\t73\t26c'''
results = [[329, 50, 58], [258, 47, 66], [451, 38, 73]]
idx = [2,3,4]
sep = '\t'
print(text)
#with open('test.INP') as j:
# lines = j.readlines()
# split text to lines
lines = text.splitlines()
def replace(line_list, result_list, idx):
for i, v in zip(idx, result_list):
line_list[i] = str(v)
return line_list
# start at 5 line and group line (text) with values to replace
for line_number, result_as_list in zip(range(5, len(lines)), results):
# convert line from string to list
line_list = lines[line_number].split(sep)
# replace values
line_list = replace(line_list, result_as_list, idx)
# convert line from list to string
lines[line_number] = sep.join(line_list)
# join lines to text
text = '\n'.join(lines)
print(text)
with open('test.INP', 'w') as j:
j.write(text)
I have a contig file loaded in pandas like this:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
Thank you very much.
I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted:
>Identifier
sequence
>Identifier
sequence
You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
Whether this code works depends on the details of your input which you didn't provide, but that might get you started.
I am trying to use pandas.read_sas() to read binary compressed SAS files in chunks and save each chunk as a separate feather file.
This is my code
import feather as fr
import pandas as pd
pdi = pd.read_sas("C:/data/test.sas7bdat", chunksize = 100000, iterator = True)
i = 1
for pdj in pdi:
fr.write_dataframe(pdj, 'C:/data/test' + str(i) + '.feather')
i = i + 1
However I get the following error
ValueError Traceback (most recent call
last) in ()
1 i = 1
2 for pdj in pdi:
----> 3 fr.write_dataframe(pdj, 'C:/test' + str(i) + '.feather')
4 i = i + 1
5
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write_feather(df, dest)
116 writer = FeatherWriter(dest)
117 try:
--> 118 writer.write(df)
119 except:
120 # Try to make sure the resource is closed
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write(self, df)
94
95 elif inferred_type not in ['unicode', 'string']:
---> 96 raise ValueError(msg)
97
98 if not isinstance(name, six.string_types):
ValueError: cannot serialize column 0 named SOME_ID with dtype bytes
I am using Windows 7 and Python 3.6. When I inspect it most the columns' cells are wrapped in b'cell_value' which I assume to mean that the columns are in binary format.
I am a complete Python beginner so don't understand what is the issue?
Edit: looks like this was a bug patched in a recent version:
https://issues.apache.org/jira/browse/ARROW-1672
https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855
Are the column names strings? Are you sure pdj is of type pd.DataFrame?
Limitations
Some features of pandas are not supported in Feather:
Non-string column names
Row indexes
Object-type columns with non-homogeneous data
I have separated files, one part are files only contained header info, like the example shown in below:
~content of "header1.txt"~
a 3
b 2
c 4
~content of "header2.txt"~
a 4
b 3
c 5
~content of "header3.txt"~
a 1
b 7
c 6
And another part are files only contained data, as shown below:
~content of "data1.txt"~
10 20 30 40
20 14 22 33
~content of "data2.txt"~
11 21 31 41
21 24 12 23
~content of "data3.txt"~
21 22 11 31
10 26 14 33
After combined the corresponded data files, the results are similar with examples in below:
~content of "asc1.txt"~
a 3
b 2
c 4
10 20 30 40
20 14 22 33
~content of "asc2.txt"~
a 4
b 3
c 5
11 21 31 41
21 24 12 23
~content of "asc3.txt"~
a 1
b 7
c 6
21 22 11 31
10 26 14 33
Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do
for i in range(1,4):
h = open('header{0}.txt'.format(i),'r')
d = open('data{0}.txt'.format(i),'r')
a = open('asc{0}.txt'.format(i),'a')
hdata = h.readlines()
ddata = d.readlines()
a.writelines(hdata+ddata)
a.close()
Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand:
# can start by creating a function to read contents of
# each file and return the contents as a string
def readFile(file):
contentsStr = ''
for line in file:
contentsStr += line
return contentsStr
# Read all the header files header1, header2, header3
header1 = open('header1.txt','r')
header2 = open('header2.txt','r')
header3 = open('header3.txt','r')
# Read all the data files data1, data2, data3
data1 = open('data1.txt','r')
data2 = open('data2.txt','r')
data3 = open('data3.txt','r')
# Open/create output files asc1, asc2, asc3
asc1_outFile = open('asc1.txt','w')
asc2_outFile = open('asc2.txt','w')
asc3_outFile = open('asc3.txt','w')
# read contents of each header file and data file into string variabls
header1_contents = readFile(header1)
header2_contents = readFile(header2)
header3_contents = readFile(header3)
data1_contents = readFile(data1)
data2_contents = readFile(data2)
data3_contents = readFile(data3)
# Append the contents of each data file contents to its
# corresponding header file
asc1_contents = header1_contents + '\n' + data1_contents
asc2_contents = header2_contents + '\n' + data2_contents
asc3_contents = header3_contents + '\n' + data3_contents
# now write the necessary results to asc1.txt, asc2.txt, and
# asc3.txt output files respectively
asc1_outFile.write(asc1_contents)
asc2_outFile.write(asc2_contents)
asc3_outFile.write(asc3_contents)
# close the file streams
header1.close()
header2.close()
header3.close()
data1.close()
data2.close()
data3.close()
asc1_outFile.close()
asc2_outFile.close()
asc3_outFile.close()
# done!
By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory/header*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory/data*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
Edit
This method will only work if they are in different folder and there should be no other files other than header or data file in that folder
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory1/*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory2/*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?
What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...