How to merge two text files using python and regex - python

This is what I got:(Edited after blhsing answer.)
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)
By using: \b(?:[12]?\d{1,4}|30{4})#[^#]+# I could find:
5173#bunch of text here
of, bunch here, text
text here, bunch of
#
After finding this text I would like to replace it in another file where a similar one exists.
At the current stage, I still need to implement something like:
\b(?:number)#[^#]+#
In order to find a text move and replace it in another file where one with the same number is located, also before doing it checking if there are multiple occurrences.
After doing that I will have another problem which is saving the multiple occurrences and storing it in another text in order to manually do the rest.
Hope u guys can help, any help is appreciated it doesn't need to be a solution. :)

The problem here is that you're reading the file and matching the regex line by line when you actually want to match the regex over multiple lines. You should therefore read the entire file into one string before matching it against the regex:
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)

Related

How can I search for a substring that is between two values ​of a string in python?

someone who can give me an idea about doing the following: I have a text file
of a single line but they send me several data, all as if it were xml but within a txt.
ej:
<Respuesta><ResultEnviado><Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado><EntregaItemResulados><EntregaItem><ItemId>123</ItemId><NameItem>MuebleSala</NameItem><ValorItem>180</ValorItem></EntregaItem><EntregaItem><ItemId>124</ItemId><NameItem>MuebleComedor</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>125</ItemId><NameItem>Cama</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>126</ItemId><NameItem>escritorio</NameItem>ValorItem>200</ValorItem></EntregaItem></EntregaItemResulados></Respuesta>
As you could see, it is a file with the extension txt.
<ResultEnviado><Resultado><Entrega>1213255654</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado>
I am using python for the exercise.
Thank you very much for your comments or ideas.
Here we can use regular expressions .search() and .match() functions to find everything between the set tags. Note you need to import regular expression using the import re.
More info on regular expressions in python: here
import re
#open the file and read it
path = "C:/temp/file.txt"
with open(path, "r") as f:
text = f.read()
#we use regular experssion to find everything between the tags
match = re.search("<ResultEnviado>(.*?)</ResultEnviado>", text)
#prints the text if it matches
if match:
print(match.group(1))
else:
print("No match found.")
this prints:
<Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado>
Please let me know if you need any more help.

Trying to understand how to get import re to work in pycharm

I'm going through a course at work for Python. We're using Pycharm, and I'm not sure if that's what the problem is.
Basically, I have to read in a text file, scrub it, then count the frequency of specific words. The counting is not an issue. (I looped through a scrubbed list, checked the scrubbed list for the specific words, then added the specific words to a dictionary as I looped through the list. It works fine).
My issue is really about scrubbing the data. I ended up doing successive scrubs to get to a final clean list. But when I read the documentation, I should be able to use regex or re and scrub my file with one line of code. No matter what I do, importing re, or regex I get errors that stop my code.
How can I write the below code pythonically?
# Open the file in read mode
with open('chocolate.txt', 'r') as file:
input_col = file.read().replace(',', '')
text3 = input_col.replace('.', '')
text2 = text3.replace('"', '')
text = text2.split()
You could try using a regular expression which looks something like this
import re
result = re.sub(r'("|.|,)', "", text)
print(result)
Here text is the string you would read from the text file
Hope this helps!
x = re.sub(r'("|\.|,)', "", str)

Matching a simple string with regex not working?

I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.

how to : multiline to oneline by removing newlines

I'm a newbie in python who is starting to learn it.
I wanted to make a script that count the same letter pattern in a text file. Problem is my text file has multiple lines. I couldn't find some of my patterns as they went over to the next line.
My file and the pattern are a DNA sequence.
Example:
'attctcgatcagtctctctagtgtgtgagagactctagctagatcgtccactcactgac**ga
tc**agtcagt**gatc**tctcctactacaaggtgacatgagtgtaaattagtgtgagtgagtgaa'
I'm looking for 'gatc'. The second one was counted, but the first wasn't.
So, how can I make this file to a one line text file?
You can join the lines when you read the pattern from the file:
fd = open('dna.txt', 'r')
dnatext = ''.join(fd.readlines())
dnatext.count('gatc')
dnatext = text.replace('\n', '') // join text lines
gatc_count = dnatext.count('gatc') // count 'gatc' occurrences
This should do the trick :
dnatext = "".join(dnatext.split("\n"))

Delete a specific string (not line) from a text file python

I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.

Categories

Resources