Find coincidence and add column - python

I want to achieve this specific task, I have 2 files, the first one with emails and credentials:
xavier.desprez#william.com:Xavier
xavier.locqueneux#william.com:vocojydu
xaviere.chevry#pepe.com:voluzigy
Xavier.Therin#william.com:Pussycat5
xiomara.rivera#william.com:xrhj1971
xiomara.rivera#william-honduras.william.com:xrhj1971
and the second one, with emails and location:
xavier.desprez#william.com:BOSNIA
xaviere.chevry#pepe.com:ROMANIA
I want that, whenever the email from the first file is found on the second file, the row is substituted by EMAIL:CREDENTIAL:LOCATION , and when it is not found, it ends up being: EMAIL:CREDENTIAL:BLANK
so the final file must be like this:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I have do several tries in python, but it is not even worth it to write it because I am not really close to the solution.
Regards !
EDIT:
This is what I tried:
import os
import sys
with open("test.txt", "r") as a_file:
for line_a in a_file:
stripped_email_a = line_a.strip().split(':')[0]
with open("location.txt", "r") as b_file:
for line_b in b_file:
stripped_email_b = line_b.strip().split(':')[0]
location = line_b.strip().split(':')[1]
if stripped_email_a == stripped_email_b:
a = line_a + ":" + location
print(a.replace("\n",""))
else:
b = line_a + ":BLANK"
print (b.replace("\n",""))
This is the result I get:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.desprez#william.com:Xavier:BLANK
xaviere.chevry#pepe.com:voluzigy:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
xavier.locqueneux#william.com:vocojydu:BLANK
xavier.locqueneux#william.com:vocojydu:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I am very close but I get duplicates ;)
Regards

The duplication issue comes from the fact that you are reading two files in a nested way, once a line from the test.txt is read, you open the location.txt file for reading and process it. Then, you read the second line from test.txt, and re-open the location.txt and process it again.
Instead, get all the necessary data from the location.txt, say, into a dictionary, and then use it while reading the test.txt:
email_loc_dict = {}
with open("location.txt", "r") as b_file:
for line_b in b_file:
splits = line_b.strip().split(':')
email_loc_dict[splits[0]] = splits[1]
with open("test.txt", "r") as a_file:
for line_a in a_file:
line_a = line_a.strip()
stripped_email_a = line_a.split(':')[0]
if stripped_email_a in email_loc_dict:
a = line_a + ":" + email_loc_dict[stripped_email_a]
print(a)
else:
b = line_a + ":BLANK"
print(b)
Output:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK

Related

Combine two wordlist in one file Python

I have two wordlists, as per the examples below:
wordlist1.txt
aa
bb
cc
wordlist2.txt
11
22
33
I want to take every line from wordlist2.txt and put it after each line in wordlist1.txt and combine them in wordlist3.txt like this:
aa
11
bb
22
cc
33
.
.
Can you please help me with how to do it? Thanks!
Try to always try to include what you have tried.
However, this is a great place to start.
def read_file_to_list(filename):
with open(filename) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
wordlist1= read_file_to_list("wordlist1.txt")
wordlist2= read_file_to_list("wordlist2.txt")
with open("wordlist3.txt",'w',encoding = 'utf-8') as f:
for x,y in zip(wordlist1,wordlist2):
f.write(x+"\n")
f.write(y+"\n")
Check the following question for more ideas and understanding: How to read a file line-by-line into a list?
Cheers
Open wordlist1.txt and wordlist2.txt for reading and wordlist3.txt for writing. Then it's as simple as:
with open('wordlist3.txt', 'w') as w3, open('wordlist1.txt') as w1, open('wordlist2.txt') as w2:
for l1, l2 in zip(map(str.rstrip, w1), map(str.rstrip, w2)):
print(f'{l1}\n{l2}', file=w3)
Instead of using .splitlines(), you can also iterate over the files directly. Here's the code:
wordlist1 = open("wordlist1.txt", "r")
wordlist2 = open("wordlist2.txt", "r")
wordlist3 = open("wordlist3.txt", "w")
for txt1,txt2 in zip(wordlist1, wordlist2):
if not txt1.endswith("\n"):
txt1+="\n"
wordlist3.write(txt1)
wordlist3.write(txt2)
wordlist1.close()
wordlist2.close()
wordlist3.close()
In the first block, we are opening the files. For the first two, we use "r", which stands for read, as we don't want to change anything to the files. We can omit this, as "r" is the default argument of the open function. For the second one, we use "w", which stands for write. If the file didn't exist yet, it will create a new file.
Next, we use the zip function in the for loop. It creates an iterator containing tuples from all iterables provided as arguments. In this loop, it will contain tuples containing each one line of wordlist1.txt and one of wordlist2.txt. These tuples are directly unpacked into the variables txt1 and txt2.
Next we use an if statement to check whether the line of wordlist1.txt ends with a newline. This might not be the case with the last line, so this needs to be checked. We don't check it with the second line, as it is no problem that the last line has no newline because it will also be at the end of the resulting file.
Next, we are writing the text to wordlist3.txt. This means that the text is appended to the end of the file. However, the text that was already in the file before the opening, is lost.
Finally, we close the files. This is very important to do, as otherwise some progress might not be saved and no other applications can use the file meanwhile.
Try this:
with open('wordlist1.txt', 'r') as f1:
f1_list = f1.read().splitlines()
with open('wordlist2.txt', 'r') as f2:
f2_list = f2.read().splitlines()
f3_list = [x for t in list(zip(f1, f2)) for x in t]
with open('wordlist3.txt', 'w') as f3:
f3.write("\n".join(f3_list))
with open('wordlist1.txt') as w1,\
open('wordlist2.txt') as w2,\
open('wordlist3.txt', 'w') as w3:
for wordlist1, wordlist2 in zip(w1.readlines(), w2.readlines()):
if wordlist1[-1] != '\n':
wordlist1 += '\n'
if wordlist2[-1] != '\n':
wordlist2 += '\n'
w3.write(wordlist1)
w3.write(wordlist2)
Here you go :)
with open('wordlist1.txt', 'r') as f:
file1 = f.readlines()
with open('wordlist2.txt', 'r') as f:
file2 = f.readlines()
with open('wordlist3.txt', 'w') as f:
for x in range(len(file1)):
if not file1[x].endswith('\n'):
file1[x] += '\n'
f.write(file1[x])
if not file2[x].endswith('\n'):
file2[x] += '\n'
f.write(file2[x])
Open wordlist 1 and 2 and make a line paring, separate each pair by a newline character then join all the pairs together and separated again by a newline.
# paths
wordlist1 = #
wordlist2 = #
wordlist3 = #
with open(wordlist1, 'r') as fd1, open(wordlist2, 'r') as fd2:
out = '\n'.join(f'{l1}\n{l2}' for l1, l2 in zip(fd1.read().split(), fd2.read().split()))
with open(wordlist3, 'w') as fd:
fd.write(out)

Python program to number rows

i have a file with data as such.
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>3_DL_2021.1214
>4_DL_2021.1214
>4_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
now as you can see the data is not numbered properly and hence needs to be numbered.
what im aiming for is this:
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>4_DL_2021.1214
>5_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
>9_DL_2021.1214
now the file has a lot of other stuff between these lines starting with > sign. i want only the > sign stuff affected.
could someone please help me out with this.
also there are 563 such lines so manually doing it is out of question.
So, assuming input data file is "input.txt"
You can achieve what you want with this
import re
with open("input.txt", "r") as f:
a = f.readlines()
regex = re.compile(r"^>\d+_DL_2021\.\d+\n$")
counter = 1
for i, line in enumerate(a):
if regex.match(line):
tokens = line.split("_")
tokens[0] = f">{counter}"
a[i] = "_".join(tokens)
counter += 1
with open("input.txt", "w") as f:
f.writelines(a)
So what it does it searches for line with the regex ^>\d+_DL_2021\.\d+\n$, then splits it by _ and gets the first (0th) element and rewrites it, then counts up by 1 and continues the same thing, after all it just writes updated strings back to "input.txt"
sudden_appearance already provided a good answer.
In case you don't like regex too much you can use this code instead:
new_lines = []
with open('test_file.txt', 'r') as f:
c = 1
for line in f:
if line[0] == '>':
after_dash = line.split('_',1)[1]
new_line = '>' + str(c) + '_' + after_dash
c += 1
new_lines.append(new_line)
else:
new_lines.append(line)
with open('test_file.txt', 'w') as f:
f.writelines(new_lines)
Also you can have a look at this split tutorial for more information about how to use split.

Find differences between a file without checking line by line. Python

I'm trying to check the differences between two output files which contain a mixture of IP Addresses and Subnets. These are stripped from a file and are stored on output1.txt and output2.txt. I'm struggling when doing a comparison. These files don't always have the same number of lines so comparing line by line doesn't seem an option. For example, both files could have IP address 192.168.1.1 but in output1.txt it could be on line 1 and in output2.txt it could be on line 60. How do I compare properly identifying which strings are not in both files?
Code below
import difflib
with open('input1.txt','r') as f:
with open('output1.txt', 'w') as g:
for line in f:
ipaddress = line.split(None, 1)[0]
g.write(ipaddress + "\n")
with open('input2.txt', 'r') as f:
with open('output2.txt', 'w') as g:
for line in f:
ipaddress = line.split(None, 1)[0]
g.write(ipaddress + "\n")
with open('output1.txt', 'r') as output1, open('output2.txt', 'r') as output2:
output1_text = output1.read()
output2_text = output2.read()
d = difflib.Differ()
diff = d.compare(output1_text, output2_text)
print(''.join(diff))
I will eventually want the differences written to a file, but for now just printing out the result is fine.
Appreciate the help.
Thanks.
You probably want a set comparison:
with open('output1.txt') as fh1, open('output2.txt') as fh2:
# collect lines into sets
set1, set2 = set(fh1), set(fh2)
diff = set1.symmetric_difference(set2)
print(''.join(diff))
Where symmetric_difference will:
Return a new set with elements in either the set or other but not both.

Split File when string is matches exactly

I have a huge text file that I need to split based on matching a 'EKYC' only value. However, when other values with similar pattern show up my script fails.
I am new in Python and it is wearing me out.
import sys;
import os;
MASTER_TEXT_FILE=sys.argv[1];
OUTPUT_FILE=sys.argv[2];
L = file(MASTER_TEXT_FILE, "r").read().strip().split("EKYC")
i = 0
for l in L:
i = i + 1
f = file(OUTPUT_FILE+"-%d.ekyc" % i , "w")
print >>f, "EKYC" + l
The script breaks when there is EKYCSMRT or EKYCVDA or EKYCTIGO then how can I make the guard to prevent the splitting to occur before the point.
This is the content of all of the messages
EKYC
WIK 12
EKYC
WIK 12
EKYCTIGO
EKYC
WIK 13
TTL
EKYCVD
EKYC
WIK 14
TTL D
Thanks for the assistance.
If possible, you should avoid reading large files into memory all at once. Instead, stream chunks of them at a time.
The sensible chunks of text files are usually lines. This can be done with .readline(), but simply iterating over the file yields its lines too.
After reading a line (which includes the newline), you can .write() it directly to the current output file.
import sys
master_filename = sys.argv[1]
output_filebase = sys.argv[2]
output = None
output_number = 0
for line in open(master_filename):
if line.strip() == 'EKYC':
if output is not None:
output.close()
output = None
else:
if output is None:
output_number += 1
output_filename = '%s-%d.ekyc' % (output_filebase, output_number)
output = open(output_filename, 'w')
output.write(line)
if output is not None:
output.close()
The output file is closed and reset upon encountering 'EKYC' on its own line.
Here, you'll notice that the output file isn't (re)opened until right before there is a line to write to it: this avoids creating an empty output file in case there are no further lines to write to it. You'll have to re-order this slightly if you want the 'EKYC' line to appear in the output file also.
Based on your sample input file, you need to: split('\nEKYC\n')
#!/usr/bin/env python
import sys
MASTER_TEXT_FILE = sys.argv[1]
OUTPUT_FILE = sys.argv[2]
with open(MASTER_TEXT_FILE) as f:
fdata = f.read()
i = 0
for subset in fdata.split('\nEKYC\n'):
i += 1
with open(OUTPUT_FILE+"-%d.ekyc" % i, 'w') as output:
output.write(subset)
Other comments:
Python doesn't use ;.
Your original code wasn't using os.
It's recommended to use with open(<filename>, <mode>) as f: ... since it handles possible errors and closes the file afterward.

Python adding text at the line

I have a file which contains following row:
//hva_SaastonJakaumanMuutos/printData/reallocationAssignment/changeUser/firstName>
I want to add "John" at the end of line.
I have written following code but for some reason it is not working,
def add_text_to_file(self, file, rowTitle, inputText):
f = open("check_files/"+file+".txt", "r")
fileList = list(f)
f.close()
j = 0
for row in fileList :
if fileList[j].find(rowTitle) > 0 :
fileList[j]=fileList[j].replace("\n","")+inputText+"\n"
break
j = j+1
f = open("check_files/"+file+".txt", "w")
f.writelines(fileList)
f.close()
Do you see where am I doing wrong?
str.find may return 0 if the text you are searching is found at the beginning. After all, it returns the index the match begins.
So your condition should be:
if fileList[j].find(rowTitle) >= 0 :
Edit:
The correction above would save the day but it's better if you things the right way, the pythonic way.
If you are looking for a substring in a text, you can use the foo in bar comparison. It will be True if foo can be found in bar and False otherwise.
You rarely need a counter in Python. enumerate built-in is your friend here.
You can combine the iteration and writing and eliminate an unnecessary step.
strip or rstrip is better than replace in your case.
For Python 2.6+, it is better to use with statement when dealing with files. It will deal with the closing of the file right way. For Python 2.5, you need from __future__ import with_statement
Refer to PEP8 for commonly preferred naming conventions.
Here is a cleaned up version:
def add_text_to_file(self, file, row_title, input_text):
with open("check_files/" + file + ".txt", "r") as infile:
file_list = infile.readlines()
with open("check_files/" + file + ".txt", "w") as outfile:
for row in file_list:
if row_title in row:
row = row.rstrip() + input_text + "\n"
outfile.write(row)
You are not giving much informations, so even thoug I wouldn't use the following code (because I'm sure there are better ways) it might help to clear your problem.
import os.path
def add_text_to_file(self, filename, row_title, input_text):
# filename should have the .txt extension in it
filepath = os.path.join("check_files", filename)
with open(filepath, "r") as f:
content = f.readlines()
for j in len(content):
if row_title in content[j]:
content[j] = content[j].strip() + input_text + "\n"
break
with open(filepath, "w") as f:
f.writelines(content)

Categories

Resources