Data Manipulation: Stemming from a inability to select lists

Data Manipulation: Stemming from a inability to select lists - python

I am very new to python with no real prior programing knowledge. At my current job I am being asked to take data in the form of text from about 500+ files and plot them out. I understand the plotting to a degree, but I cannot seem to figure out how to manipulate the data in a way that it is easy to select specific sections. Currently this is what I have for opening a file:
fp=open("file")
for line in fp:
words = line.strip().split()
print words
The result is it gives me a list for each line of the file, but I can only access the last line made. Does any one know a way that would allow me to choose different variations of lists? Thanks a lot!!

The easiest way to get a list of lines from a file is as follows:
with open('file', 'r') as f:
lines = f.readlines()
Now you can split those lines or do whatever you want with them:
lines = [line.split() for line in lines]
I'm not certain that answers your question -- let me know if you have something more specific in mind.
Since I don't understand exactly what you are asking, here are a few more examples of how you might process a text file. You can experiment with these in the interactive interpreter, which you can generally access just by typing 'python' at the command line.
>>> with open('a_text_file.txt', 'r') as f:
... text = f.read()
...
>>> text
'the first line of the text file\nthe second line -- broken by a symbol\nthe third line of the text file\nsome other data\n'
That's the raw, unprocessed text of the file. It's a string. Strings are immutable -- they can't be altered -- but they can be copied in part or in whole.
>>> text.splitlines()
['the first line of the text file', 'the second line -- broken by a symbol', 'the third line of the text file', 'some other data']
splitlines is a string method. splitlines splits the string wherever it finds a \n (newline) character; it then returns a list containing copies of the separate sections of the string.
>>> lines = text.splitlines()
Here I've just saved the above list of lines to a new variable name.
>>> lines[0]
'the first line of the text file'
Lists are accessed by indexing. Just provide an integer from 0 to len(lines) - 1 and the corresponding line is returned.
>>> lines[2]
'the third line of the text file'
>>> lines[1]
'the second line -- broken by a symbol'
Now you can start to manipulate individual lines.
>>> lines[1].split('--')
['the second line ', ' broken by a symbol']
split is another string method. It's like splitlines but you can specify the character or string that you want to use as the demarcator.
>>> lines[1][4]
's'
You can also index the characters in a string.
>>> lines[1][4:10]
'second'
You can also "slice" a string. The result is a copy of characters 4 through 9. 10 is the stop value, so the 10th character isn't included in the slice. (You can slice lists too.)
>>> lines[1].index('broken')
19
If you want to find a substring within a string, one way is to use index. It returns the index at which the first occurrence of the substring appears. (It throws an error if the substring isn't in the string. If you don't want that, use find, which returns a -1 if the substring isn't in the string.)
>>> lines[1][19:]
'broken by a symbol'
Then you can use that to slice the string. If you don't provide a stop index, it just returns the remainder of the string.
>>> lines[1][:19]
'the second line -- '
If you don't provide a start index, it returns the beginning of the string and stops at the stop index.
>>> [line for line in text.splitlines() if 'line' in line]
['the first line of the text file', 'the second line -- broken by a symbol', 'the third line of the text file']
You can also use in -- it's a boolean operation that returns True if a substring is in a string. In this case, I've used a list comprehension to get only the lines that have 'line' in them. (Note that the last line is missing from the list. It has been filtered.)
Let me know if you have any more questions.

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

I have a set of three .fasta files of standardized format. Each one begins with a string that acts as a header on line 1, followed by a long string of nucleotides on line 2, where the header string denotes the animal that the nucleotide sequence came from. There are 14 of them altogether, for a total of 28 lines, and each of the three files has the headers in the same order. A snippet of one of the files is included below as an example, with the sequences shortened for clarity.
anas-crecca-crecca_KSW4951-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM021-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM020-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
What I would like to do is write a script or program that cats each of the strings of nucleotides together, but keeps them in the same position. My knowledge, however, is limited to rudimentary python, and I'd appreciate any help or tips someone could give me.

Try this:
data = ""
with open('filename.fasta') as f:
i = 0
for line in f:
i=i+1
if (i%2 == 0):
data = data + line[:-1]
# Copy and paste above block for each file,
# replacing filename with the actual name.
print(data)
Remember to replace "filename.fasta" with your actual file name!
How it works
Variable i acts as a line counter, when it is even, i%2 will be zero and the new line is concatenated to the "data" string. This way, the odd lines are ignored.
The [:-1] at the end of the data line removes the line break, allowing you to add all sequences to the same line.

Printing characters from a given sequence till a certain range only. How to do this in Python?

I have a file in which I have a sequence of characters. I want to read the second line of that file and want to read the characters of that line to a certain range only.
I tried this code, however, it is only printing specific characters from both lines. And not printing the range.
with open ("irumfas.fas", "r") as file:
first_chars = [line[1] for line in file if not line.isspace()]
print(first_chars)
Can anyone help in this regard? How can I give a range?
Below is mentioned the sequence that I want to print.But I want to start printing the characters from the second line of the sequence till a certain range only.
IRUMSEQ
ATTATAAAATTAAAATTATATCCAATGAATTCAATTAAATTAAATTAAAGAATTCAATAATATACCCCGGGGGGATCCAATTAAAAGCTAAAAAAAAAAAAAAAAAA

The following approach can be used.
Consider the file contains
RANDOMTEXTSAMPLE
SAMPLERANDOMTEXT
RANDOMSAMPLETEXT
with open('sampleText.txt') as sampleText:
content = sampleText.read()
content = content.split("\n")[1]
content = content[:6]
print(content)
Output will be
SAMPLE

I think you want something like this:
with open("irumfas.fas", "r") as file:
second_line = file.readlines()[1]
print(second_line[0:9])
readlines() will give you a list of the lines -- which we index to get only the 2nd line. Your existing code will iterate over all the lines (which is not what you want).
As for extracting a certain range, you can use list slices to select the range of characters you want from that line -- in the example above, its the first 10.

You can slice the line[1] in the file as you would slice a list.
You were very close:
end = 6 # number of characters
with open ("irumfas.fas", "r") as file:
first_chars = [line[1][:end] for line in file if not line.isspace()]
print(first_chars)

replace hexadecimal with decimal in multiple locations within text document

I have a rather large text document and would like to replace all instances of hexadecimals inside with regular decimals. Or if possible convert them into text surrounded by '' e.g. 'I01A' instead of $49303141
The hexadecimals are currently marked by starting with $ but I can ctrl+F change that into 0x if that helps, and I need the program to detect the end of the number since some are short $A, while others are long like $568B1F
How could I do this with python, or is it not possible?
Thank you for the help thus far, hoping to clarify my request a bit more to hopefully get a complete solution.
I used a version of Grismar's answer and the output it gives me is
"if not (GetItemTypeId(GetSoldItem())==I0KB) then
set int1= 2+($3E8*3)"
However, I would like to add the ' around the newly created text and convert hex strings smaller then 8 to decimals instead so the output becomes
"if not (GetItemTypeId(GetSoldItem())=='I0KB') then
set int1= 2+(1000*3)"
Hoping for some more help tog et the rest of the way.
def hex2dec(s):
return int(s,16)
was my attempt to convert the shorter hexadecimals to decimal but clearly has not worked, throws syntax errors instead.
Also, I will manually deal with the few $ not used to denote a hexadecimal.
# just creating an example file
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'w') as f:
f.write('if not (GetItemTypeId(GetSoldItem())==$49304B42) then\n')
f.write('set int1= 2+($3E8*3)\n')
def hex_match_to_string(m):
return ''.join([chr(int(m.group(1)[i:i+2], 16)) for i in range(0, len(m.group(1)), 2)])
def hex2dec(s):
return int(s,16)
# open the file for reading
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'r') as file_in:
# open the same file again for reading and writing
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'r+') as file_out:
# start writing at the start of the existing file, overwriting the contents
file_out.seek(0)
while True:
line = file_in.readline()
if line == '':
# end of file
break
# replace the parts of the string matching the regex
line = re.sub(r'\$((?:\w\w\w\w\w\w\w\w)+)', hex_match_to_string, line)
#line = re.sub(r'$\w+', hex2dec,line)
file_out.write(line)
# the resulting file is shorter, truncate it from the current position
file_out.truncate()

See the answer https://stackoverflow.com/a/12597709/1780027 for how to use re.sub to replace specific content of a string with the output of a function. Using this you could presumably use the "int("FFFF", 16) " code snippet you're talking about to perform the action you desire.
EG:
>>> def replace(match):
... match = match.group(1)
... return str(int(match, 16))
>>> sample = "here's a hex $49303141 and there's a nother 1034B and another $8FD0B"
>>> re.sub(r'\$([a-fA-F0-9]+)', replace, sample)
"here's a hex 1227895105 and there's a nother 41803 and another 589067"

Since you are replacing parts of the file with something that's shorter, you can write to the same file you're reading. But keep in mind that, if you were replacing those parts with something that was longer, you would need to write the result to a new file and replace the old file with the new file once you were done.
Also, from your description, it appears you are reading a text file, which makes reading the file line by line the easiest, but if your file was some sort of binary file, using re wouldn't be as convenient and you'd probably need a different solution.
Finally, your question doesn't mention whether $ might also appear elsewhere in the text file (not just in front of pairs of characters that should be read as hexadecimal numbers). This answer assumes $ only appears in front of strings of 2-character hexadecimal numbers.
Here's a solution:
import re
# just creating an example file
with open('test.txt', 'w') as f:
f.write('example line $49303141\n')
f.write('$49303141 example line, with more $49303141\n')
f.write('\n')
f.write('just some text\n')
def hex_match_to_string(m):
return ''.join([chr(int(m.group(1)[i:i+2], 16)) for i in range(0, len(m.group(1)), 2)])
# open the file for reading
with open('test.txt', 'r') as file_in:
# open the same file again for reading and writing
with open('test.txt', 'r+') as file_out:
# start writing at the start of the existing file, overwriting the contents
file_out.seek(0)
while True:
line = file_in.readline()
if line == '':
# end of file
break
# replace the parts of the string matching the regex
line = re.sub(r'\$((?:\w\w)+)', hex_match_to_string, line)
file_out.write(line)
# the resulting file is shorter, truncate it from the current position
file_out.truncate()
The regex is simple r'\$((?:\w\w)+)', which matches any string starting with an actual $ (the backslash avoids it being interpreted as 'the beginning of the string') and followed by 1 or more (+) pairs of letters and numbers (\w\w).
The function hex_match_to_string(m) expects a regex match object and loops over pairs of characters in the first matched group. Each pair is turned into its decimal value by interpreting it as a hexadecimal string (int(pair, 16)) and that decimal value is then turned into a character with that ASCII value (chr(value)). All the resulting characters are joined into a single string (''.join(list)).
A different way or writing hex_match_to_string(m):
def hex_match_to_string(m):
hex_nums = iter(m.group(1))
return ''.join([chr(int(a, 16) * 16 + int(b, 16)) for a, b in zip(hex_nums, hex_nums)])
This may perform a bit better, since it avoids manipulating strings, but it does the same thing.

Slice strings in .txt and return only one of the new strings

I want to use lines of strings of a .txt file as search queries in other .txt files. But before this, I need to slice those strings of the lines of my original text data. Is there a simple way to do this?
This is my original .txt data:
CHEMBL2057820|MUBD_HDAC2_ligandset|mol2|42|dock12
CHEMBL1957458|MUBD_HDAC2_ligandset|mol2|58|dock10
CHEMBL251144|MUBD_HDAC2_ligandset|mol2|41|dock98
CHEMBL269935|MUBD_HDAC2_ligandset|mol2|30|dock58
... (over thousands)
And I need to have a new file where the new new lines contain only part of those strings, like:
CHEMBL2057820
CHEMBL1957458
CHEMBL251144
CHEMBL269935

Open the file, read in the lines and split each line at the | character, then index the first result
with open("test.txt") as f:
parts = (line.lstrip().split('|', 1)[0] for line in f)
with open('dest.txt', 'w') as dest:
dest.write("\n".join(parts))
Explanation:
lstrip - removes whitespace on leading part of the line
split("|") returns a list like: ['CHEMBL2057820', 'MUBD_HDAC2_ligandset', 'mol2', '42', 'dock12'] for each line
Since we're only conerned with the first section it's redundant to split the rest of the contents of the line on the | character, so we can specify a maxsplit argument, which will stop splitting the string after it's encoutered that many chacters
So split("|", 1)
gives['CHEMBL2057820','MUBD_HDAC2_ligandset|mol2|42|dock12']
Since we're only interested in the first part split("|", 1)[0] returns
the "CHEMBL..." section

Use split and readlines:
with open('foo.txt') as f:
g = open('bar.txt')
lines = f.readlines()
for line in lines:
l = line.strip().split('|')[0]
g.write(l)

Reading from a file/split function new character line

f = open("test.txt", 'r+')
print ("Name of the file: ", f.name)
str = f.read();
str = str.split(',')
print(str2)
f.close()
I need to read from a file and it gives the name of the class it has to make and the parameters it needs to pass.
Example:
rectangle,9.7,7.3
square,6
so I have to make a rectangle object and pass those 2 parameters. then write it to another file with the results. I am stuck chopping up the string.
I use the split function to get rid of the commas, and it seems it returns a list which I am saving into the str list, which is probably bad, I should change the name. However, my concern is that although it does take the comma out. It keeps the ,\n, new line character and concatenates it to the next line. So it splits it like this ['rectangle', '9.7', '7.3\nsquare', ...
how can I get rid of that.
Any suggestions would be welcomed. Should I read line by line instead of reading the whole thing?

Try calling strip() on each line to get rid of the newline character before splitting it.
Give this a try (EDIT - Annotated code with comments to make it easier to follow):
# Using "with open()" to open the file handle and have it automatically closed for your when the program exits.
with open("test.txt", 'r+') as f:
print "Name of the file: ", f.name
# Iterate over each line in the test.txt file
for line in f:
# Using strip() to remove newline characters and white space from each end of the line
# Using split(',') to create a list ("tokens") containing each segment of the line separated by commas
tokens = line.strip().split(',')
# Print out the very first element (position 0) in the tokens list, which should be the "class"
print "class: ", tokens[0]
# Print out all of the remaining elements in the tokens list, starting at the second element (i.e. position "1" since lists are "zero-based")
# This is using a "slice". "tokens[1:]" means return the contents of the tokens list starting at position 1 and continuing to the end
# "tokens[1:3]" Would mean give me all of the elements of the tokens list starting at position 1 and ending at position 3 (excluding position 3).
# Loop over the elements returned by the slice, assigning them one by one to the "argument" variable
for argument in tokens[1:]:
# Print out the argument
print "argument: ", argument
output:
Name of the file: test.txt
class: rectangle
argument: 9.7
argument: 7.3
class: square
argument: 6
More information on slice: http://pythoncentral.io/how-to-slice-listsarrays-and-tuples-in-python/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Manipulation: Stemming from a inability to select lists - python

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

Printing characters from a given sequence till a certain range only. How to do this in Python?

replace hexadecimal with decimal in multiple locations within text document

Slice strings in .txt and return only one of the new strings

Reading from a file/split function new character line

Categories

Resources