The following regex does exactly what I want it to do, except that it also outputs the index as a digit ( I think it's the index). This messes up my output. So how can I tell it not to take the index ?
import re
import pandas as pd
df = pd.read_excel("tstfile.xlsx", names=["col1"])
for index, row in df.iterrows():
# print(index)
if str(row[0]).split():
if not re.findall("(.[A-Z]\d+\-\d+)", str(row)):
for i in re.findall("(\d+)", str(row)):
print(i)
Input data would look like:
123, 456
111 * 222
LL123-456
35
I get an output that looks like this:
123
0
456
1
111
2
222
3
35
4
The final desired output should be:
123
456
111
222
35
So only the data that is actually given in as input.
You can change your code like this:
for row in df.values.astype(str):
for word in row:
if not re.findall("(.[A-Z]\d+\-\d+)", word):
for num in re.findall("(\d+)", word):
print(num)
Alternatively, here is a one liner that converts the dataframe values into a string and uses the re.findall method to extract the numbers as strings. Words that start with upper case letters and contain a minus sign are excluded.
all_numbers = re.findall(r'(\d+)', ' '.join([j for i in df.values.astype(str) for j in i if not re.search(r'[A-Z].+\-', j)]))
for item in all_numbers:
print(item)
If you need integer numbers instead of strings, you can convert the list into a generator with
all_integers = map(int, all_numbers)
for i in all_integers:
print(i)
But remember, that generators can only be used once.
You can try this:
>>> data = """123, 456
... 111 * 222
... LL123-456
... 35"""
>>> data = data.replace(',', '')
>>> data = data.split()
>>> x = [int(i) for i in data if i.isdigit()]
>>> print(x)
The output is
[123, 456, 111, 222, 35]
Related
I need to align numbers right for any input. However I can't seem to do it only for a specific input not for any input. I've also tried turning the list of strings into a list of nums using list comprehension and then do print("{:5d}".format(i)). I've also tried doing something like print("{:>len(i)}".format(i))
n = input().split()
m = sorted(n, key =int, reverse = True)
for i in m:
print("{:>10}".format(i))
Sample Input:
8 11 12 123 45678
Sample Output:
45678
123
12
11
8
I've managed to do it for the input above, but not for any input.
maybe it could generalize better by keeping your elements as strings and using rjust() ?
n = input().split()
for i in n:
print(i.rjust(len(max(n, key = len))))
You can use a variable width specifier in string formatting:
n = input().split()
l = max(len(i) for i in n)
m = sorted(n, key=int, reverse=True)
for i in m:
print("{:>{:d}}".format(i, l))
With an input of 1231231232131213 123123213213 12312321321 213123123, the output is:
1231231232131213
123123213213
12312321321
213123123
I'm trying to find occurrences of several pairs of words in strings which are in a list in a tsv file. A list in a tsv file is below.
0 ILDIGCGRGRHARALVRRGWQVTGLDLSEDAVAAARSRVADDDLDV...
1 AELETLQAKINPHFLYNSLNSIASLVYTDPEKAEKMVLMLSKLFRV...
2 AQLSSLKEQLNPHFLFNTFNTLYGISLKYPERVPDLIMHTSQLMRY...
3 TEIKALQSQIKPHFLFNTLNAIRCTIINNNNDKAADLVYKLAMLLR...
4 SEMSRLNAQINPHFLFNTLNFFYSEVRTLHPKISESILLLSDIMRY...
...
...1000 SELSFLKAQINPHFFFNTLNNIYALTMMDVASAQEALHRLSRMMRY...
1001 ILEPGCGTGRLMLALAEHGHHVAGVDASATALEFCRERLTQHGLTG...
1002 IADLGAGEGTISQLMAQRAKRVIAIDNSEKMVEFGAELARKHGIAN...
1003 AELRALRAQISPHFIYNALAAIASFVRTDPERARELLLEFADFSRY...
1004 VVDLGCGSGASTDALVNSMGHRGETYAAIGIDASAGMLTEAHSKPW...
[1005 rows x 1 columns]
then, I'd like to get occurrences of AA, AB, AC, ...ZY, ZZ for each row. An example is below.
If there is a string "AEAETLQAKIN" in a row, then I'd like to get the result below.
(the definition of strings must be acid. ex)acid='AEAETLQAKIN')
IN[]......(I'd like to know how to describe codes which can get occurrences here. )
OUT[] AA: 0, AC: 0, AD: 0, AE: 2, ... AK: 1, ... EA: 1, ...
If you want a dict containing only existing pairs, use a defaultdict
from collections import defaultdict
from string import ascii_uppercase
def occurrences(content):
result = defaultdict(int)
for i in range(len(content) - 1):
result[content[i:i + 2]] += 1
return result
If you want to also have the 0, so 26x26=676 pairs, prepare one dict before
from itertools import product
OCCURRENCE_DEFAULT = {f"{x}{y}": 0 for x, y in product(ascii_uppercase, repeat=2)}
def occurrences(content):
result = OCCURRENCE_DEFAULT.copy()
for i in range(len(content) - 1):
result[content[i:i + 2]] += 1
return result
Then apply on each string of your content
value = ["0 ILDIGCGRGRHARALVRRGWQVTGLDLSEDAVAAARSRVADDDLDV",
"4 SEMSRLNAQINPHFLFNTLNFFYSEVRTLHPKISESILLLSDIMRY"]
for row in value:
occ = occurrences(row.split()[1])
print(occ)
I got the information of a txt file and store it as lines
print(lines)
['>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79\n', 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC\n', 'AGGACAGGCCGCTAAAGTG\n', '>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80\n', 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG\n', 'GCCTGGTAACACGTGCCAGC\n']
If you execute the code
for i in lines:
print(i)
You get:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
I want to store the sequences that are in caps TTGGTTTCGTGGTTT... as independent elements in an object so you can operate with them, so you would be able to do something like:
seq[1]
>>> TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG
gattaca = [x.strip() for x in lines if x.isupper()]
>>> gattaca
['TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC',
'AGGACAGGCCGCTAAAGTG',
'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG',
'GCCTGGTAACACGTGCCAGC']
You can do this:
lines = list(map(str.strip, (filter(str.isupper, lines))))
So if you use isupper() you can check if your string in the list is upper case. If True, it means it is.
for i in lines:
if i.isupper():
## store the string
To check wheter a string is caps I woult use mySting == mySting.upper().
To get all caps elements you could use a list comprehension like so:
result = [s for s in lines if lines == lines.upper()]
This would still allow special characters in your string.
If you only want uppercase leters then use lines.isalpha().
result = [s for s in lines if lines == lines.upper() and lines.isalpha()]
I would use a regex:
import re
seq={}
pattern=r'^(>.*$)\n([ACGTU\n]*?)(?=^>|\Z)'
for i,m in enumerate(re.finditer(pattern, ''.join(lines), flags=re.M)):
seq[i]=m.group(2).replace('\n','')
Then each FASTA seq is mapped to an integer:
>>> seq
{0: 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG', 1: 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCGGCCTGGTAACACGTGCCAGC'}
I have a string thestackoverflow
and I also have # columns = 4
Then I want the output as
thes
tack
over
flow
you can do it using python slice notation.
refer to this thread for a nice explanation on slice notation: Understanding slice notation
example code for your question:
>>> input_string = "thestackoverflow"
>>> chop_size = 4
>>> while(input_string):
... print input_string[:chop_size]
... input_string = input_string[chop_size:]
...
thes
tack
over
flow
You can have a look at textwrap
import textwrap
string = 'thestackoverflow'
max_width = 4
result = textwrap.fill(string,max_width)
print(result)
thes
tack
over
flow
If you don't want to use any module
string = 'thestackoverflow'
max_width = 4
row = 0
result = ''
while row*max_width < len(string):
result+='\n'+string[row*max_width:(row+1)*max_width]
row+=1
result = result.strip()
print(result)
I have a text file containing entries similar to the following example:
# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc...
What I would like to achieve is:
Convert the As into 1s, the Bs into 0s in order to have a binary number
For the example above this would be 110100 (i.e. AABABB)
Convert this binary number into a decimal number
For the example above this would then be 52
Map this decimal number to a text string
(i.e. 52 = "Case 1" or 53 = "Case 2" etc.) and
Print this on the stdout
I have a little bit of Python experience but the problem above is way beyond my capabilities. Therefore any help from the community would be appreciated.
Many thanks in advance,
Hib
A few pointers (assuming Python 2):
Translating a string:
>>> import string
>>> table = string.maketrans("AB","10")
>>> translated = "AABABB".translate(table)
>>> translated
'110100'
Converting to base 10:
>>> int(translated, 2)
52
No idea how you would map that to those arbitrary strings - more information needed.
Printing to stdout - really? Which part are you having trouble with?
Something like this should work (not tested):
from itertools import islice
binary_map = dict(zip("AB", "10")) # Equivalent to {"A": "1", "B": "0"}
string_map = {52: "Case 1", 53: "Case 2"}
with open("my_text_file") as f:
binary_str = "".join(binary_map[x.split()[2]] for x in islice(f, 9, None))
binary_value = int(binary_string, 2)
print string_map[binary_value]
I'll break down the indented code line for you and explain it.
The join method of an empty string will concatenate the strings given in the argument, so "".join(["A", "B", "C"]) is equal to "ABC".
We pass this method a so-called generator expression, X for Y in Z. It has the same syntax as a list comprehension, except the square brackets are omitted.
The islice function returns an iterator that silently skips the first 9 lines of the file object f, so it yields lines starting with the 10th.
The split method of str with no arguments will split on any sequence of whitespace characters (space, tab ("\t"), linefeed ("\n") and carriage return ("\r")) and return a list. So for example, " a \t b\n\t c\n".split() is equal to ['a', 'b', 'c']. We're interested in the third column, x.split()[2], which is either "A" or "B".
Looking up this value in the binary_map dictionary will give us either "1" or "0" instead.
a.txt:
# 8 rows of header
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
you can try this:
>>> int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2)
>>> 52
As for mapping the int to a string, not sure what you mean.
>>> value = {int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2): 'case 52'}
>>> value[52]
'case 52'
>>>
I used re module in order to check the format of the lines to be accepted:
>>> def map_file_to_string(string):
values = []
for line in string.split('\n'):
if re.match(r'\d{3} \w{3}\d{5} [AB] .*', line):
values.append(1 if line[13] == 'A' else 0)
return dict_map[int(''.join(map(str, values)), 2)]
>>> dict_map = {52: 'Case 1', 53: 'Case 2'}
>>> s1 = """# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc.."""
>>> map_file_to_string(s1)
'Case 1'
>>>