I have a tab-delimited txt file like this:
A B aaaKP
C D bbbZ
E F cccLL
This is tab-delimited.
If
phrase = aaa or bbb
column = 3
then I would like only those rows whose 3rd column starts with aaa or bbb
The output will be
A B aaaKP
C D bbbZ
I have a code for the case where there is only one phrase.
phrase, column = 'aaa', 3
fn = lambda l : len(l) >= column and len(l[column-1]) >= len(phrase) and phrase == l[column-1][:len(phrase)]
fp = open('output.txt', 'w')
fp.write(''.join(row for row in open('input.txt') if fn(row.split('\t'))))
fp.close()
But if there are multiple phrases.. I tried
phrase, column = {'aaa','bbb'}, 3
but it didn't work.
In general case you can use regular expressions with branches for quick matching and searching:
import re
phrases = [ 'aaa', 'bbb' ]
column = 3
pattern = re.compile('|'.join(re.escape(i) for i in phrases))
column -= 1
with open('input.txt') as inf, open('output.txt', 'w') as outf:
for line in inf:
row = line.split('\t')
if pattern.match(row[column]):
outf.write(line)
The code builds a regular expression from all the possible phrases, using re.escape to escape special characters. The resulting expression in this case is aaa|bbb. pattern.match matches the beginning of the string against the pattern (the match must start from the first character).
However if you must only match the beginning of string against fixed phrases, then do note that startswith accepts a tuple, and this is the fastest code:
phrases = [ 'aaa', 'bbb' ]
column = 3
phrase_tuple = tuple(phrases)
column -= 1
with open('input.txt') as inf, open('output.txt', 'w') as outf:
for line in inf:
row = line.split('\t')
if row[column].startswith(phrase_tuple):
outf.write(line)
Also it demonstrates the use of context managers for opening the file, opens the input.txt before output.txt so that if the former does not exist, the latter does not get created. And finally shows that this looks nicest without any generators and lambdas.
You could use python's re module for this,
>>> import re
>>> data = """A B aaaKP
... C D bbbZ
... E F cccLL"""
>>> m = re.findall(r'^(?=\S+\s+\S+\s+(?:aaa|bbb)).*$', data, re.M)
>>> for i in m:
... print i
...
A B aaaKP
C D bbbZ
Positive lookahead is used to check whether the line contains particular string. The above regex checks for the lines in which the third column starts with aaa or bbb . If yes, then the corresponding lines will be printed.
You could try this regex code also,
>>> s = """A B aaaKP
... C D bbbZ
... E F cccLL
... """
>>> m = re.findall(r'^(?=\S+\t\S+\t(?:aaa|bbb)).*$', s, re.M)
>>> for i in m:
... print i
...
A B aaaKP
C D bbbZ
Solution:
#!/usr/bin/env python
import csv
from pprint import pprint
def read_phrases(filename, phrases):
with open(filename, "r") as fd:
reader = csv.reader(fd, delimiter="\t")
for row in reader:
if any((row[2].startswith(phrase) for phrase in phrases)):
yield row
pprint(list(read_phrases("foo.txt", ["aaa"])))
pprint(list(read_phrases("foo.txt", ["aaa", "bbb"])))
Example:
$ python foo.py
[['A', 'B', 'aaaKP']]
[['A', 'B', 'aaaKP'], ['C', 'D', 'bbbZ']]
Related
I have the following data.
1
abc
>
2
def
efg
>
3
hij
jkl
>
4
mno
5
pqr
stu
I want all the contents to be added to a list after each occurrence of '>' and number say '3'.
output should be like
[[abc],[[def],[efg]],[[hij],[jkl]],[mno],[[pqr],[stu]]]
Assuming the items are in a list L
>>> from itertools import groupby
>>> [list(g) for k, g in groupby(L, ">0123456789".__contains__) if not k]
[['abc'], ['def', 'efg'], ['hij', 'jkl'], ['mno'], ['pqr', 'stu']]
It's not exactly the right format, because your format isn't valid python. But should get you started
If you using file to store the information, you could use the following approach:
result = []
with open("data.txt", "r") as f:
# Read file line by line
for line in f:
line = line.strip()
# Ignore '>' or numbers
if line == '>' or line.isdigit():
continue
result.append([line])
print(result)
I want to match a certain string in a CSV file and return the column of the string within the CSV file for example
import csv
data = ['a','b','c'],['d','e','f'],['h','i','j']
for example I'm looking for the word e, I want it to return [1] as it is in the second column.
The solution using csv.reader object and enumerate function(to get key/value sequence):
def get_column(file, word):
with open(file) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
for k,v in enumerate(row):
if v == word:
return k # immediate value return to avoid further loop iteration
search_word = 'e'
print(get_column("data/sample.csv", search_word)) # "data/sample.csv" is an exemplary file path
The output:
1
I am not sure why do you need csv in this example.
>>> data = ['a','b','c'],['d','e','f'],['h','i','j']
>>>
>>>
>>> string = 'e'
>>> for idx, lst in enumerate(data):
... if string in lst:
... print idx
1
A variation of wolendranh's answer:
>>> data = ['a','b','c'],['d','e','f'],['h','i','j']
>>> word = 'e'
>>> for row in data:
... try:
... print(row.index(word))
... except ValueError:
... continue
Try the following:
>>> data_list = [['a','b','c'],['d','e','f'],['h','i','j']]
>>> col2_list = []
>>>
>>> for d in data_list:
... col2=d[1]
... col2_list.append(col2)
So in the end you get a list with all the values of column [1]:
col2_list = ["b","e","i"]
sorry for asking but I'm kind of new to these things. I'm doing a splitting words from the text and putting them to dict creating an index for each token:
import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
a=0
c=0
e=[]
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(list(filter(None, b)))
c = c + 1
e = e + b
d = dict(zip(e, range(len(e))))
But in the end I receive a dict with spaces in it like that:
{'': 633,
'a': 617,
'according': 385,
'adjacent': 237,
'allow': 429,
'allows': 459}
How can I remove "" from the final result in dict? Also how can I change the indexing after that to not use "" in index counting? (with "" the index count is 633, without-248)
Big thanks!
How about this?
b = list(filter(None, re.split('[^a-z]', line.lower())))
As an alternative:
b = re.findall('[a-z]+', line.lower())
Either way, you can then also remove that filter from the next line:
a += len(b)
EDIT
As an aside, I think what you end up with here is a dictionary mapping words to the last position in which they appear in the text. I'm not sure if that's what you intended to do. E.g.
>>> dict(zip(['hello', 'world', 'hello', 'again'], range(4)))
{'world': 1, 'hello': 2, 'again': 3}
If you instead want to keep track of all the positions a word occurs, perhaps try this code instead:
from collections import defaultdict
import re
indexes = defaultdict(list)
with open('test.txt', 'r') as f:
for index, word in enumerate(re.findall(r'[a-z]+', f.read().lower())):
indexes[word].append(index)
indexes then maps each word to a list of indexes at which the word appears.
EDIT 2
Based on the comment discussion below, I think you want something more like this:
from collections import defaultdict
import re
word_positions = {}
with open('test.txt', 'r') as f:
index = 0
for word in re.findall(r'[a-z]+', f.read().lower()):
if word not in word_positions:
word_positions[word] = index
index += 1
print(word_positions)
# Output:
# {'hello': 0, 'goodbye': 2, 'world': 1}
Your regex looks not a good one. Consider to use:
line = re.sub('[^a-z]*$', '', line.strip())
b = re.split('[^a-z]+', line.lower())
Replace:
d = dict(zip(e, range(len(e))))
With:
d = {word:n for n, word in enumerate(e) if word}
Alternatively, to avoid the empty entries in the first place, replace:
b=re.split('[^a-z]', line.lower())
With:
b=re.split('[^a-z]+', re.sub('(^[^a-z]+|[^a-z]+$)', '', line.lower()))
I have a tab delimited file text file with two columns. I need to find a way that prints all values that “hit” each other to one line.
For example, my input looks like this:
A B
A C
A D
B C
B D
C D
B E
D E
B F
C F
F G
F H
H I
K L
My desired output should look like this:
A B C D
B D E
B C F
F G H
H I
K L
My actual data file is much larger than this if that makes any difference. I would prefer to do this in Unix or Python where possible.
Can anybody help?
Thanks in advance!
There's no way to put input file as .csv? It would be easier to parse delimiters.
If it wouldn't be posible, try next example:
from itertools import groupby
from operator import itemgetter
with open('example.txt','rb') as txtfile:
cleaned = []
#store file information in a list of lists
for line in txtfile.readlines():
cleaned.append(line.split())
#group py first element of nested list
for elt, items in groupby(cleaned, itemgetter(0)):
row = [elt]
for item in items:
row.append(item[1])
print row
Hope it helps you.
Solution using a .csv file:
from itertools import groupby
from operator import itemgetter
import csv
with open('example.csv','rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
cleaned.append(row) #group py first element of nested list
for elt, items in groupby(cleaned, itemgetter(0)):
row = [elt]
for item in items:
row.append(item[1])
print row
I have some text file like this, with several 5000 lines:
5.6 4.5 6.8 "6.5" (new line)
5.4 8.3 1.2 "9.3" (new line)
so the last term is a number between double quotes.
What I want to do is, using Python (if possible), to assign the four columns to double variables. But the main problem is the last term, I found no way of removing the double quotes to the number, is it possible in linux?
This is what I tried:
#!/usr/bin/python
import os,sys,re,string,array
name=sys.argv[1]
infile = open(name,"r")
cont = 0
while 1:
line = infile.readline()
if not line: break
l = re.split("\s+",string.strip(line)).replace('\"','')
cont = cont +1
a = l[0]
b = l[1]
c = l[2]
d = l[3]
for line in open(name, "r"):
line = line.replace('"', '').strip()
a, b, c, d = map(float, line.split())
This is kind of bare-bones, and will raise exceptions if (for example) there aren't four values on the line, etc.
There's a module you can use from the standard library called shlex:
>>> import shlex
>>> print shlex.split('5.6 4.5 6.8 "6.5"')
['5.6', '4.5', '6.8', '6.5']
The csv module (standard library) does it automatically, although the docs isn't very specific about skipinitialspace
>>> import csv
>>> with open(name, 'rb') as f:
... for row in csv.reader(f, delimiter=' ', skipinitialspace=True):
... print '|'.join(row)
5.6|4.5|6.8|6.5
5.4|8.3|1.2|9.3
for line in open(fname):
line = line.split()
line[-1] = line[-1].strip('"\n')
floats = [float(i) for i in line]
another option is to use built-in module, that is intended for this task. namely csv:
>>> import csv
>>> for line in csv.reader(open(fname), delimiter=' '):
print([float(i) for i in line])
[5.6, 4.5, 6.8, 6.5]
[5.6, 4.5, 6.8, 6.5]
Or you can simply replace your line
l = re.split("\s+",string.strip(line)).replace('\"','')
with this:
l = re.split('[\s"]+',string.strip(line))
I used in essence to remove the " in "25" using
Code:
result = result.strip("\"") #remove double quotes characters
I think the easiest and most efficient thing to do would be to slice it!
From your code:
d = l[3]
returns "6.5"
so you simply add another statement:
d = d[1:-1]
now it will return 6.5 without the leading and end double quotes.
viola! :)
You can use regexp, try something like this
import re
re.findall("[0-9.]+", file(name).read())
This will give you a list of all numbers in your file as strings without any quotes.
IMHO, the most universal doublequote stripper is this:
In [1]: s = '1 " 1 2" 0 a "3 4 5 " 6'
In [2]: [i[0].strip() for i in csv.reader(s, delimiter=' ') if i != ['', '']]
Out[2]: ['1', '1 2', '0', 'a', '3 4 5', '6']