Select and store data from string in python - python

I got the information of a txt file and store it as lines
print(lines)
['>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79\n', 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC\n', 'AGGACAGGCCGCTAAAGTG\n', '>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80\n', 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG\n', 'GCCTGGTAACACGTGCCAGC\n']
If you execute the code
for i in lines:
print(i)
You get:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
I want to store the sequences that are in caps TTGGTTTCGTGGTTT... as independent elements in an object so you can operate with them, so you would be able to do something like:
seq[1]
>>> TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG

gattaca = [x.strip() for x in lines if x.isupper()]
>>> gattaca
['TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC',
'AGGACAGGCCGCTAAAGTG',
'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG',
'GCCTGGTAACACGTGCCAGC']

You can do this:
lines = list(map(str.strip, (filter(str.isupper, lines))))

So if you use isupper() you can check if your string in the list is upper case. If True, it means it is.
for i in lines:
if i.isupper():
## store the string

To check wheter a string is caps I woult use mySting == mySting.upper().
To get all caps elements you could use a list comprehension like so:
result = [s for s in lines if lines == lines.upper()]
This would still allow special characters in your string.
If you only want uppercase leters then use lines.isalpha().
result = [s for s in lines if lines == lines.upper() and lines.isalpha()]

I would use a regex:
import re
seq={}
pattern=r'^(>.*$)\n([ACGTU\n]*?)(?=^>|\Z)'
for i,m in enumerate(re.finditer(pattern, ''.join(lines), flags=re.M)):
seq[i]=m.group(2).replace('\n','')
Then each FASTA seq is mapped to an integer:
>>> seq
{0: 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG', 1: 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCGGCCTGGTAACACGTGCCAGC'}

Related

Align numbers right for any input?

I need to align numbers right for any input. However I can't seem to do it only for a specific input not for any input. I've also tried turning the list of strings into a list of nums using list comprehension and then do print("{:5d}".format(i)). I've also tried doing something like print("{:>len(i)}".format(i))
n = input().split()
m = sorted(n, key =int, reverse = True)
for i in m:
print("{:>10}".format(i))
Sample Input:
8 11 12 123 45678
Sample Output:
45678
123
12
11
8
I've managed to do it for the input above, but not for any input.
maybe it could generalize better by keeping your elements as strings and using rjust() ?
n = input().split()
for i in n:
print(i.rjust(len(max(n, key = len))))
You can use a variable width specifier in string formatting:
n = input().split()
l = max(len(i) for i in n)
m = sorted(n, key=int, reverse=True)
for i in m:
print("{:>{:d}}".format(i, l))
With an input of 1231231232131213 123123213213 12312321321 213123123, the output is:
1231231232131213
123123213213
12312321321
213123123

Apply 'for' loop by words

I have the following .txt file:
a.txt
which shows the following 2 lines of data:
300 100 500 250 150
34984 29220 43640 36410 7980
I need to create a code that creates a dictionary that shows the following result:
A 300
B 100
C 500
D 250
E 150
I have tried with this code, but I cannot separate the figures, nor choose only the first line. Any ideas?
f.read
import string
mayusculas = string.ascii_uppercase
f = open("a.txt", "r")
for i, c in zip(mayusculas, f):
print(i, c)
f.close()
Thank you all.
Preserving your code structure, split() is what you're looking for:
f = '''300 100 500 250 150
34984 29220 43640 36410 7980'''
for i, c in zip(mayusculas, (f.split('\n')[0]).split(' ')):
print(i, c)
Explanation:
f.split('\n'): splits your string into a list, by newline, so you get a two element list
(f.split('\n')[0]).split(' '): I take the first element in your list and I split those by space, getting a five element list with the five elements you need, as stated in your example.
Output:
A 300
B 100
C 500
D 250
E 150
A few notes on your code:
1. Always open files using a context manager
2. You need can read the file one line at a time
3. Use split to break the line, don't feed it any arguments so it can parse tabs and multiple spaces properly.
Putting it all together, your code should look like this:
import string
mayusculas = string.ascii_uppercase
with open("clientes_pibpc.txt", "r") as f:
for i, c in zip(mayusculas, f.readline().split()):
print(i, c)
Just read the first line of the file and then split it, followed by zipping with the uppercase:
with open('a.txt', 'r') as f:
data = f.readline().split()
final_dict = dict(zip(string.ascii_uppercase, data))
final_dict
{'A': '300', 'B': '100', 'C': '500', 'D': '250', 'E': '150'}

Regex also takes index with for loop

The following regex does exactly what I want it to do, except that it also outputs the index as a digit ( I think it's the index). This messes up my output. So how can I tell it not to take the index ?
import re
import pandas as pd
df = pd.read_excel("tstfile.xlsx", names=["col1"])
for index, row in df.iterrows():
# print(index)
if str(row[0]).split():
if not re.findall("(.[A-Z]\d+\-\d+)", str(row)):
for i in re.findall("(\d+)", str(row)):
print(i)
Input data would look like:
123, 456
111 * 222
LL123-456
35
I get an output that looks like this:
123
0
456
1
111
2
222
3
35
4
The final desired output should be:
123
456
111
222
35
So only the data that is actually given in as input.
You can change your code like this:
for row in df.values.astype(str):
for word in row:
if not re.findall("(.[A-Z]\d+\-\d+)", word):
for num in re.findall("(\d+)", word):
print(num)
Alternatively, here is a one liner that converts the dataframe values into a string and uses the re.findall method to extract the numbers as strings. Words that start with upper case letters and contain a minus sign are excluded.
all_numbers = re.findall(r'(\d+)', ' '.join([j for i in df.values.astype(str) for j in i if not re.search(r'[A-Z].+\-', j)]))
for item in all_numbers:
print(item)
If you need integer numbers instead of strings, you can convert the list into a generator with
all_integers = map(int, all_numbers)
for i in all_integers:
print(i)
But remember, that generators can only be used once.
You can try this:
>>> data = """123, 456
... 111 * 222
... LL123-456
... 35"""
>>> data = data.replace(',', '')
>>> data = data.split()
>>> x = [int(i) for i in data if i.isdigit()]
>>> print(x)
The output is
[123, 456, 111, 222, 35]

using an integer in a string to create a dictionary (or list) with that many numbers

so i have this text (wordnet) file made up of numbers and words, for example like this -
"09807754 18 n 03 aristocrat 0 blue_blood 0 patrician"
and i want to read in the first number as a dictionary name (or list) for the words that follow. the layout of this never changes, it is always an 8 digit key followed by a two digit number, a single letter and a two digit number. This last two digit number (03) tells how many words (three words in this case) are associated with the first 8 digit key.
my idea was that i would search for the 14th place in the string and use that number to run a loop to pick in all of the words associated with that key
so i think it would go something like this
with open('nouns.txt','r') as f:
for line in f:
words = range(14,15)
numOfWords = int(words)
while i =< numOfWords
#here is where the problem arises,
#i want to search for words after the spaces 3 (numOfWords) times
#and put them into a dictionary(or list) associated with the key
range(0,7) = {word(i+1), word(i+2)}
Technically i am looking for whichever one of these makes more sense:
09807754 = { 'word1':aristocrat, 'word2':blue_blood , 'word3':patrician }
or
09807754 = ['aristocrat', 'blue_blood', 'patrician']
Obviously this doesnt run but if anyone could give me any pointers it would be greatly appreciated
>>> L = "09807754 18 n 03 aristocrat 0 blue_blood 0 patrician".split()
>>> L[0], L[4::2]
('09807754', ['aristocrat', 'blue_blood', 'patrician'])
>>> D = {}
>>> D.update({L[0]: L[4::2]})
>>> D
{'09807754': ['aristocrat', 'blue_blood', 'patrician']}
For the extra line in your comment, some extra logic is needed
>>> L = "09827177 18 n 03 aristocrat 0 blue_blood 0 patrician 0 013 # 09646208 n 0000".split()
>>> D.update({L[0]: L[4:4 + 2 * int(L[3]):2]})
>>> D
{'09807754': ['aristocrat', 'blue_blood', 'patrician'], '09827177': ['aristocrat', 'blue_blood', 'patrician']}
res = {}
with open('nouns.txt','r') as f:
for line in f:
splited = line.split()
res[splited[0]] = [w for w in splited[4:] if not w.isdigit()]
Output:
{'09807754': ['aristocrat', 'blue_blood', 'patrician']}

Python: create a decimal number out of entries from a text file

I have a text file containing entries similar to the following example:
# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc...
What I would like to achieve is:
Convert the As into 1s, the Bs into 0s in order to have a binary number
For the example above this would be 110100 (i.e. AABABB)
Convert this binary number into a decimal number
For the example above this would then be 52
Map this decimal number to a text string
(i.e. 52 = "Case 1" or 53 = "Case 2" etc.) and
Print this on the stdout
I have a little bit of Python experience but the problem above is way beyond my capabilities. Therefore any help from the community would be appreciated.
Many thanks in advance,
Hib
A few pointers (assuming Python 2):
Translating a string:
>>> import string
>>> table = string.maketrans("AB","10")
>>> translated = "AABABB".translate(table)
>>> translated
'110100'
Converting to base 10:
>>> int(translated, 2)
52
No idea how you would map that to those arbitrary strings - more information needed.
Printing to stdout - really? Which part are you having trouble with?
Something like this should work (not tested):
from itertools import islice
binary_map = dict(zip("AB", "10")) # Equivalent to {"A": "1", "B": "0"}
string_map = {52: "Case 1", 53: "Case 2"}
with open("my_text_file") as f:
binary_str = "".join(binary_map[x.split()[2]] for x in islice(f, 9, None))
binary_value = int(binary_string, 2)
print string_map[binary_value]
I'll break down the indented code line for you and explain it.
The join method of an empty string will concatenate the strings given in the argument, so "".join(["A", "B", "C"]) is equal to "ABC".
We pass this method a so-called generator expression, X for Y in Z. It has the same syntax as a list comprehension, except the square brackets are omitted.
The islice function returns an iterator that silently skips the first 9 lines of the file object f, so it yields lines starting with the 10th.
The split method of str with no arguments will split on any sequence of whitespace characters (space, tab ("\t"), linefeed ("\n") and carriage return ("\r")) and return a list. So for example, " a \t b\n\t c\n".split() is equal to ['a', 'b', 'c']. We're interested in the third column, x.split()[2], which is either "A" or "B".
Looking up this value in the binary_map dictionary will give us either "1" or "0" instead.
a.txt:
# 8 rows of header
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
you can try this:
>>> int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2)
>>> 52
As for mapping the int to a string, not sure what you mean.
>>> value = {int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2): 'case 52'}
>>> value[52]
'case 52'
>>>
I used re module in order to check the format of the lines to be accepted:
>>> def map_file_to_string(string):
values = []
for line in string.split('\n'):
if re.match(r'\d{3} \w{3}\d{5} [AB] .*', line):
values.append(1 if line[13] == 'A' else 0)
return dict_map[int(''.join(map(str, values)), 2)]
>>> dict_map = {52: 'Case 1', 53: 'Case 2'}
>>> s1 = """# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc.."""
>>> map_file_to_string(s1)
'Case 1'
>>>

Categories

Resources