Python: Cut and Paste Data on Text File.

Python: Cut and Paste Data on Text File. - python

I have a text file that contains numbers.
I need to move some of the numbers from the beginning of the file to the end in the correct order.
For example, the Original TEXT file has the following content: 0123456789.
I need to move the first 4 numbers to the end in the same order so it'll look like this:
4567890123.
Unfortunately i have no idea how to do this with Python,
I don't know even where to start.
Any pointers to solving this problem would be highly appreciated.

See the Python tutorial (section "Strings"; search for "slice notation"):
>>> a = "0123456789"
>>> b = a[4:] + a[:4]
>>> b
'4567890123'
Or what is it you're really trying to do?

The individual characters of the string a = '0123456789' can be accessed through a[i], where for i=0 you get the character at the first position (indexes are numbered from 0), so '0'. You can also extract several characters at once in the form a[i:j], where i is the position of the first character and j is the position of the character after the last character. If you omit one of i or j, it will take all characters from the beginning or until the end of the string.
So:
a[0] = a[0:1] = a[:1] = '0'
a[1] = a[1:2] = '1'
a[4] = a[4:5] = '4'
a[0:3] = a[:3] = '012'
a[3:5] = '34'
a[4:] = '456789'
So the first 4 characters are a[:4] and the rest is a[4:]. Now you concatenate them together:
a[4:] + a[:4]
and it will return
'4567890123'
In order to read the file, you will have to open it in the read mode and use the first line, stripping any whitespace/newlines:
with open('filename.txt', 'r') as f:
line = f.readline().strip()
print(line[4:] + line[:4])

Related

specific characters printing with Python

given a string as shown below,
"[xyx],[abc].[cfd],[abc].[dgr],[abc]"
how to print it like shown below ?
1.[xyz]
2.[cfd]
3.[dgr]
The original string will always maintain the above-mentioned format.

I did not realize you had periods and commas... that adds a bit of trickery. You have to split on the periods too
I would use something like this...
list_to_parse = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
count = 0
for i in list_to_parse.split('.'):
for j in i.split(','):
string = str(count + 1) + "." + j
if string:
count += 1
print(string)
string = None
Another option is split on the left bracket, and then just re-add it with enumerate - then strip commas and periods - this method is also probably a tiny bit faster, as it's not a loop inside a loop
list_to_parse = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
for index, i in enumerate(list.split('[')):
if i:
print(str(index) + ".[" + i.rstrip(',.'))
also strip is really "what characters to remove" not a specific pattern. so you can add any characters you want removed from the right, and it will work through the list until it hits a character it can't remove. there is also lstrip() and strip()
string manipulation can always get tricky, so pay attention. as this will output a blank first object, so index zero isn't printed etc... always practice and learn your needs :D

You can use split() function:
a = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
desired_strings = [i.split(',')[0] for i in a.split('.')]
for i,string in enumerate(desired_strings):
print(f"{i+1}.{string}")

This is just a fun way to solve it:
lst = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
count = 1
var = 1
for char in range(0, len(lst), 6):
if var % 2:
print(f"{count}.{lst[char:char + 5]}")
count += 1
var += 1
output:
1.[xyx]
2.[cfd]
3.[dgr]
explanation : "[" appears in these indexes: 0, 6, 12, etc. var is for skipping the next pair. count is the counting variable.
Here we can squeeze the above code using list comprehension and slicing instead of those flag variables. It's now more Pythonic:
lst = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
lst = [lst[i:i+5] for i in range(0, len(lst), 6)][::2]
res = (f"{i}.{item}" for i, item in enumerate(lst, 1))
print("\n".join(res))

You can use RegEx:
import regex as re
pattern=r"(\[[a-zA-Z]*\])\,\[[a-zA-Z]*\]\.?"
results=re.findall(pattern, '[xyx],[abc].[cfd],[abc].[dgr],[abc]')
print(results)

Using re.findall:
import re
s = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
print('\n'.join(f'{i+1}.{x}' for i,x in
enumerate(re.findall(r'(\[[^]]+\])(?=,)', s))))
Output:
1.[xyx]
2.[cfd]
3.[dgr]

How to separate different input formats from the same text file with Python

I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?

You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)

You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)

The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end

ArcGIS:Python - Adding Commas to a String

In ArcGIS I have intersected a large number of zonal polygons with another set and recorded the original zone IDs and the data they are connected with. However the strings that are created are one long list of numbers ranging from 11 to 77 (each ID is 11 characters long). I am looking to add a "," between each one making, it easier to read and export later as a .csv file. To do this I wrote this code:
def StringSplit(StrO,X):
StrN = StrO #Recording original string
StrLen = len(StrN)
BStr = StrLen/X #How many segments are inside of one string
StrC = BStr - 1 #How many times it should loop
if StrC > 0:
while StrC > 1:
StrN = StrN[ :((X * StrC) + 1)] + "," + StrN[(X * StrC): ]
StrC = StrC - 1
while StrC == 1:
StrN = StrN[:X+1] + "," + StrN[(X*StrC):]
StrC = 0
while StrC == 0:
return StrN
else:
return StrN
The main issue is how it has to step through multiple rows (76) with various lengths (11 -> 77). I got the last parts to work, just not the internal loop as it returns an error or incorrect outputs for strings longer than 22 characters.
Thus right now:
1. 01234567890 returns 01234567890
2. 0123456789001234567890 returns 01234567890,01234567890
3. 012345678900123456789001234567890 returns either: Error or ,, or even ,,01234567890
I know it is probably something pretty simple I am missing, but I can't seem remember what it is...

It can be easily done by regex.
those ........... are 11 dots for give split for every 11th char.
you can use pandas to create csv from the array output
Code:
import re
x = re.findall('...........', '01234567890012345678900123456789001234567890')
print(x)
myString = ",".join(x)
print(myString)
output:
['01234567890', '01234567890', '01234567890', '01234567890']
01234567890,01234567890,01234567890,01234567890
for the sake of simplicity you can do this
code:
x = ",".join(re.findall('...........', '01234567890012345678900123456789001234567890'))
print(x)

Don't make the loops by yourself, use python libraries or builtins, it will be easier. For example :
def StringSplit(StrO,X):
substring_starts = range(0, len(StrO), X)
substrings = (StrO[start:start + X] for start in substring_starts)
return ','.join(substrings)
string = '1234567890ABCDE'
print(StringSplit(string, 5))
# '12345,67890,ABCDE'

String formatting by inserting ':'

I have a string '0000000000000201' in python
dpid_string = '0000000000000201'
Which is the best way to convert this to the following string
00:00:00:00:00:00:02:01

You'd partition the string into chunks of size 2, and join them with str.join():
':'.join([dpid_string[i:i + 2] for i in range(0, len(dpid_string), 2)])
Demo:
>>> dpid_string = '0000000000000201'
>>> ':'.join([dpid_string[i:i + 2] for i in range(0, len(dpid_string), 2)])
'00:00:00:00:00:00:02:01'

seq = '0000000000000201'
length = 2
":".join([seq[i:i+length] for i in range(0, len(seq), length)])

Although not very simple, you can do
dpid_string = '0000000000000201'
''.join([':' + char if not i % 2 else char for i, char in enumerate(dpid_string)])[1:]
To break it down from within the list comprehension:
[char for char in dpid_string] just loops over characters and returns them as a list.
We want it to return a string, so we join the full list using ''.join(list).
Now we want it to react on the location of the character, so we want to assess the index. Therefore we use i, value in enumerate(list)
If this index is even, add a colon before the char (modulus 2 is False).
Now this leaves us with a colon at index 0, we remove it by indexing [1:]

An alternative using re.sub:
import re
dpid_string = '0000000000000201'
subbed = re.sub('(..)(?!$)', r'\1:', dpid_string)
# 00:00:00:00:00:00:02:01
Read as take every 2 characters that aren't at the end of the string, and replace it with those two characters followed by :.

Unable to parse just sequences from FASTA file

How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences?
I have this code:
with open('sequence.fasta', 'r') as f :
while True:
line1=f.readline()
line2=f.readline()
line3=f.readline()
if not line3:
break
fct([line1[i:i+100] for i in range(0, len(line1), 100)])
fct([line2[i:i+100] for i in range(0, len(line2), 100)])
fct([line3[i:i+100] for i in range(0, len(line3), 100)])
Output:
['>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n']
['CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n']
['AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n']
['CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA\n']
['AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA\n']
['ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT\n']
['AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA\n']
['GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC\n']
['AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT\n']
['TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT\n']
['GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT\n']
['GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC\n']
['\n']
...
My function is:
def fct(input_string):
code={"a":0,"c":1,"g":2,"t":3}
p=[code[i] for i in input_string]
n=len(input_string)
c=0
for i, n in enumerate(range(n, 0, -1)):
c +=p[i]*(4**(n-1))
return c+1
fct() returns an integer from a string. For example, ACT gives 8
i.e.: my function must take as input string sequences contain just the following bases A,C,G,T
But when I use my function it gives:
KeyError: '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'
I try to remove ids by stripping lines start with > and writing the rest in text file so, my text file output.txt contains just sequences without ids, but when I use my function fct I found the same error:
KeyError: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n'
What can I do?

I see two major problems in your code: You're having problems parsing FASTA sequences, and your function is not properly iterating over each sequence.
Parsing FASTA data
Might I suggest using the excellent Biopython package? It has excellent FASTA support (reading and writing) built in (see Sequences in the Tutorial).
To parse sequences from a FASTA file:
for seq_record in SeqIO.parse("seqs.fasta", "fasta"):
print record.description # gi|2765658|emb|Z78533.1...
print record.seq # a Seq object, call str() to get a simple string
>>> print record.id
'gi|2765658|emb|Z78533.1|CIZ78533'
>>> print record.description
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA'
>>> print record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
>>> print str(record.seq)
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACC' #(truncated)
Iterating over sequence data
In your code, you have a list of strings being passed to fct() (input_string is not actually a string, but a list of strings). The solution is just to build one input string, and iterate over that.
Other errors in fct:
You need to capitalize the keys to your dictionary: case matters
You should have the return statement after the for loop. Keeping it nested means c is returned immediately.
Why bother constructing p when you can just index into code when iterating over the sequence?
You write over the sequence's length (n) by using it in your for loop as a variable name
Modified code (with proper PEP 8 formatting), and variables renamed to be clearer what they mean (still have no idea what c is supposed to be):
from Bio import SeqIO
def dna_seq_score(dna_seq):
nucleotide_code = {"A": 0, "C": 1, "G": 2, "T": 3}
c = 0
for i, k in enumerate(range(len(dna_seq), 0, -1)):
nucleotide = dna_seq[i]
code_num = nucleotide_code[nucleotide]
c += code_num * (4 ** (k - 1))
return c + 1
for record in SeqIO.parse("test.fasta", "fasta"):
dna_seq_score(record.seq)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Cut and Paste Data on Text File. - python

See the Python tutorial (section "Strings"; search for "slice notation"): >>> a = "0123456789" >>> b = a[4:] + a[:4] >>> b '4567890123' Or what is it you're really trying to do?

Related

specific characters printing with Python

How to separate different input formats from the same text file with Python

ArcGIS:Python - Adding Commas to a String

String formatting by inserting ':'

Unable to parse just sequences from FASTA file

Categories

Resources