I have a large text document that I am reading in and attempting to split into a multiple list. I'm having a hard time with the logic behind actually splitting up the string.
example of the text:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
This data contains 4 pieces of information in this format:
City[coordinates]Population Distances_to_previous
My aim is to split this data up into a List:
Data = [[City] , [Coordinates] , [Population] , [Distances]]
As far as I know I need to use .split statements but I've gotten lost trying to implement them.
I'd be very grateful for some ideas to get started!
I would do this in stages.
Your first split is at the '[' of the coordinates.
Your second split is at the ']' of the coordinates.
Third split is end of line.
The next line (if it starts with a number) is your distances.
I'd start with something like:
numCities = 0
Data = []
i = 0
while i < len(lines):
split = lines[i].partition('[')
if (split[1]): # We found something
city = split[0]
split = split[2].partition(']')
if (split[1]):
coords = split[0] #If you want this as a list then rsplit it
population = split[2]
distances = []
if i > 0:
i += 1
distances = lines[i].rsplit(' ')
Data.append([city, coords, population, distances])
numCities += 1
i += 1
for data in Data:
print (data)
This will print
['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]
The easiest way would be with a regex.
lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""
import re
pat = re.compile(r"""
(?P<City>.+?) # all characters up to the first [
\[(?P<Coordinates>\d+,\d+)\] # grabs [(digits,here)]
(?P<Population>\d+) # population digits here
\s # a space or a newline?
(?P<Distances>[\d ]+)? # Everything else is distances""", re.M | re.X)
groups = pat.finditer(lines)
results = [[[g.group("City")],
[g.group("Coordinates")],
[g.group("Population")],
g.group("Distances").split() if
g.group("Distances") else [None]]
for g in groups]
DEMO:
In[50]: results
Out[50]:
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
[['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
[['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]
Though if I may, it's probably BEST to do this as a list of dictionaries.
groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
try:
d['Distances'] = d['Distances'].split()
except AttributeError:
# distances is None -- that's okay
pass
Related
I want to search for multi-line string in a file in python. If there is a match, then I want to get the start line number, end line number, start column and end column number of the match. For example: in the below file,
I want to match the below multi-line string:
pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""
The result of the match should be as: start_line: 2, end_line = 3, start_column: 23 and end_column: 114
The start column is the index in that line where the first character is matched of the pattern and end column is the last index of the line where the last character is matched of the pattern. The end column is shown below:
I tried with the re package of python but it returns None as it could not find any match.
import re
pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""
with open("test.py") as f:
content = f.read()
print(re.search(pattern, content))
I can find the metadata of the location of the match of a single line strings in a file using
with open("test.py") as f:
data = f.read()
for n, line in enumerate(data):
match_index = line.find(pattern)
if match_index != -1:
print("Start Line:", n + 1)
print("End Line", n + 1)
print("Start Column:", match_index)
print("End Column:", match_index + len(pattern) + 1)
break
But, I am struggling to make it work for multi-line strings. How can I match multi-line strings in a file and get the metadata of the location of the match in python?
You should use the re.MULTILINE flag to search multiple lines
import re
pattern = r"(c\nd)"
string = """
a
b
c
d
e
f
"""
match = re.search(pattern, string, flags=re.MULTILINE)
print(match)
To get the start line, you could count the newline characters as follows
start, stop = match.span()
start_line = string[:start].count('\n')
You could do the same for the end_line, or if you know how many lines is your pattern, you can just add this info to avoid counting twice.
To also get the start column, you can check the line itself, or a pure regex solution could also look line:
pattern = "(?:.*\n)*(\s*(c\s*\n\s*d)\s*)"
match = re.match(pattern, string, flags=re.MULTILINE)
start_column = match.start(2) - match.start(1)
start_line = string[:match.start(1)].count('\n')
print(start_line, start_column)
However, I think difflib could be more useful here.
Alternative Solution
Below, I got a more creative solution to your problem:
You are interested in the row and column position of some sample text (not a pattern, but a fixed text) in a larger text.
This problem reminds me a lot on image registration, see https://en.wikipedia.org/wiki/Digital_image_correlation_and_tracking for a short introduction or https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate2d.html for a more sophisticated example.
import os
from itertools import zip_longest
import numpy as np
text = """Some Title
abc xyz ijk
12345678
abcdefgh
xxxxxxxxxxx
012345678
abcabcabc
yyyyyyyyyyy
"""
template = (
"12345678",
"abcdefgh"
)
moving = np.array([
[ord(char) for char in line]
for line in template
])
lines = text.split(os.linesep)
values = [
[ord(char) for char in line]
for line in lines
]
# use zip longest, to pad array with fill value
reference = np.array(list(zip_longest(*values, fillvalue=0))).T
windows = np.lib.stride_tricks.sliding_window_view(reference, moving.shape)
# get a distance matrix
distance = np.linalg.norm(windows - moving, axis=(2, 3))
# find minimum and retrun index location
row, column = np.unravel_index(np.argmin(distance), distance.shape)
print(row, column)
I've seen variations of this question asked a million times but somehow can't figure out a solution for myself.
( PIN 700W_start_stop( STS_PROP( POS_X 1233 )( POS_Y 456 )( BIT_CNT 1 )( CNCT_ID 7071869 ))(USR_PROP( VAR 1( Var_typ -1 )(AssocCd H12 )( termLBLttt +S)( Anorm 011.1)(Amax 1.0))
How do I pull out the number after 'POS_X'? i.e. 1233
I thought I had it figured out using regex because it seems extremely straightforward. But it's not working (go figure).
import re
import pandas as pd
df_pin = pd.DataFrame(columns =
['ID','Pos_x','Pos_y','conn_ID','Association_Code','Anorm','Amax'])
with open(r'C:\Users\user1\Documents\Python Scripts\test1.txt', 'r',
encoding="ISO-8859-1") as txt:
for line in txt:
data = txt.read()
line = line.strip()
x = re.search(r'POS_X (\d+)', data)
df_pin = df_pin.append({'POS_X' : x, ignore_index = True}
print (x)
Shouldn't this give me the numbers after 'POS_X' and then append it do the corresponding column in my dataframe?? There may be multiple occurrences of 'POS_X ###' on the same line, I only want to find the first. What if I wanted to do the same for 'PIN' and extract '700W_start_stop'?
re.search() returns a MatchObject object. \d+ is matched by the first capture group in the regexp, so you need to use
if x:
print(x.group(1))
else:
print("POS_X not found")
to print that.
DEMO
The whole loop should be:
import re
with open(r'C:\Users\user1\Documents\Python Scripts\test1.txt', 'r', encoding="ISO-8859-1") as txt:
for line in txt:
line = line.strip()
x = re.search(r'POS_X (\d+)', line)
if x:
print(x.group(1))
else:
print("POS_X not found in", line)
For PIN, you could use:
x = re.search(r'PIN (\w+)')
\w matches alphanumeric characters and _.
In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot
I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.
Currently I have a script that finds all the lines across multiple input files that have something in the format of
Matches: 500 (54.3 %) and prints out the top 10 highest matches in percentage.
I want to be able to have it also output the top 10 lines for score ex: Score: 4000
import re
def get_values_from_file(filename):
f = open(filename)
winpat = re.compile("([\d\.]+)\%")
xinpat = re.compile("[\d]") #ISSUE, is this the right regex for it? Score: 500****
values = []
scores = []
for line in f.readlines():
if line.find("Matches") >=0:
percn = float(winpat.findall(line)[0])
values.append(percn)
elif line.find("Score") >=0:
hey = float(xinpat.findall(line)[0])
scores.append(hey)
return (scores,values)
all_values = []
all_scores = []
for filename in ["out0.txt", "out1.txt"]:#and so on
values = get_values_from_file(filename)
all_values += values
all_scores += scores
all_values.sort()
all_values.reverse()
all_scores.sort() #also for scores
all_scores.reverse()
print(all_values[0:10])
print(all_scores[0:10])
Is my regex for the score format correct? I believe that's where I am having the issue, as it doesn't output both correctly.
Any thoughts? Should I split it into two functions?
Thank you.
Is my regex for the score format correct?
No, it should be r"\d+".
You don't need []. Those brackets establish a character class representing all of the characters inside the brackets. Since you only have one character type inside the bracket, they do nothing.
You only match a single character. You need a * or a + to match a sequence of characters.
You have an unescaped backslash in your string. Use the r prefix to allow the regular expression engine to see the backslash.
Commentary:
If it were me, I'd let the regular expression do all of the work, and skip line.find() altogether:
#UNTESTED
def get_values_from_file(filename):
winpat = re.compile(r"Matches:\s*\d+\s*\(([\d\.]+)\%\)")
xinpat = re.compile(r"Score:\s*([\d]+)")
values = []
scores = []
# Note: "with open() as f" automatically closes f
with open(filename) as f:
# Note: "for line in f" more memory efficient
# than "for line in f.readlines()"
for line in f:
win = winpat.match(line)
xin = xinpat.match(line)
if win: values.append(float(win.group(0)))
if xin: scores.append(float(xin.group(0)))
return (scores,values)
Just for fun, here is a version of the routine which calls re.findall exactly once per file:
# TESTED
# Compile this only once to save time
pat = re.compile(r'''
(?mx) # multi-line, verbose
(?:Matches:\s*\d+\s*\(([\d\.]+)\s*%\)) # "Matches: 300 (43.2%)"
|
(?:Score:\s*(\d+)) # "Score: 4000"
''')
def get_values_from_file(filename):
with open(filename) as f:
values, scores = zip(*pat.findall(f.read()))
values = [float(value) for value in values if value]
scores = [float(score) for score in scores if score]
return scores, values
No. xinpat will only match single digits, so findall() will return a list of single digits, which is a bit messy. Change it to
xinpat = re.compile("[\d]+")
Actually, you don't need the square brackets here, so you could simplify it to
xinpat = re.compile("\d+")
BTW, the names winpat and xinpat are a bit opaque. The pat bit is ok, but win & xin? And hey isn't great either. But I guess xin and hey are just temporary names you made up when you decidd to expand the program.
Another thing I just noticed, you don't need to do
all_values.sort()
all_values.reverse()
You can (and should) do that in one hit:
all_values.sort(reverse=True)