searching content of one file in another file : python

searching content of one file in another file : python - python

I am trying to search names from file 1 in file 2 and merge some data on matched lines
file1:
A 28 sep 1980
B 28 jan 1985
C 25 feb 1990
D 27 march 1995
and file2
A hyd
B alig
C slg
D raj
Using this:
import sys
data1 = open(sys.argv[1]).read().rstrip('\n')
data2 = open(sys.argv[2]).read().rstrip('\n')
list1 = data1.split('\n')
list2 = data2.split('\n')
for line in list1:
for item in list2:
if line.split('\t')[0] in item.split('\t')[0]:
print(item,'\t',line.split('\t')[3])
Result:
A hyd 1980
B alig 1985
C slg 1990
D raj 1995
Two questions (for clarifying the concept):
1 - I was hoping that if I change the order of lines in file2, I should get smaller number of matches but I still get all the matches. Why?
2- Although this program serves the purpose, how memory efficient it is expected to be? please suggest.
Thanks

1 - I was hoping that if I change the order of lines in file2, I should get smaller number of matches but I still get all the matches. Why?
Your program does a full cross-join of all lines, therefore you will always get full results.
2- Although this program serves the purpose, how memory efficient it is expected to be? please suggest.
Awful. Read only the shortest file into memory and iterate over the lines of the longer one once.
with open('bigfile.txt', 'r') as bigfile:
for bigline in bigfile:
for littleline in littlefiledata:
...

Related

How to print out nicely formatted tables from a dictionary

For the sake of practicing how to be more comfortable and fluent in working with dictionaries, I have written a little program that reads the content of a file and adds it to a dictionary as a key: value pair. This is no problem, but when I got curious about how to print the content out again in the same format as the table in the datafile using for-loops, I ran into trouble.
My question is: How can I print out the content of the dictionary onto the terminal using for-loops?
The datafile is:
Name Age School
Anne 10 Eiksmarka
Tom 15 Marienlyst
Vidar 18 Persbråten
Ayla 18 Kongshavn
Johanne 17 Wang
Silje 16 Eikeli
Per 19 UiO
Ali 25 NTNU
My code is:
infile = open("table.dat", "r")
data = {}
headers = infile.readline().split()
for i in range(len(headers)):
data[headers[i]] = []
for line in infile:
words = line.split()
for i in range(len(headers)):
data[headers[i]].append(words[i])
infile.close()
I would like the out print the data back onto the terminal. Ideally, the out print should look something like this
Name Age School
Anne 10 Eiksmarka
Tom 15 Marienlyst
Vidar 18 Persbråten
Ayla 18 Kongshavn
Johanne 17 Wang
Silje 16 Eikeli
Per 19 UiO
Ali 25 NTNU
If someone can help me with this, I would be grateful.

The easiest solution is to use a library such as Tabulate, which you can find here an example of an output (You can customize it further)
>>> from tabulate import tabulate
>>> table = [["Sun",696000,1989100000],["Earth",6371,5973.6],
... ["Moon",1737,73.5],["Mars",3390,641.85]]
>>> print(tabulate(table))
----- ------ -------------
Sun 696000 1.9891e+09
Earth 6371 5973.6
Moon 1737 73.5
Mars 3390 641.85
----- ------ -------------
Otherwise, if you MUST use your own custom for-loop, you can add tabs to fix how it looks as in:
print(a+"\t") where \t is the horizental tabulation escape character
Edit: An example of how this can be utilized is below:
infile = open("table.dat", "r")
data = {}
headers = infile.readline().split()
for i in range(len(headers)):
data[headers[i]] = []
for line in infile:
words = line.split()
for i in range(len(headers)):
data[headers[i]].append(words[i])
print(words[i],end= '\t')
print()
infile.close()
Things to note:
1- For each field, we use print(...,end= '\t'), this causes the output to be a tab instead of a new line, we also might consider adding more tabs (e.g. end='\t\t') or spaces, or any other formating such as a seperator character (e.g. `end='\t|\t')
2- After each line, we use print(), this will only print a new line, moving the cursor for the printing downwards.

Take look at .ljust, .rjust and .center methods of str, consider following simple example
d = {"Alpha": 1, "Beta": 10, "Gamma": 100, "ExcessivelyLongName": 1}
for key, value in d.items():
print(key.ljust(5), str(value).rjust(3))
output
Alpha 1
Beta 10
Gamma 100
ExcessivelyLongName 1
Note that ljust does add (by default) space to attain specified width or do nothing if name is longer than that, also as values are integers they need to be first converted to str if you want to use one of mentioned methods.

You can do this using pandas although it isn't exactly your same styling:
import pandas as pd
with open('filename.csv') as f:
headers, *data = map(str.split, f.readlines())
df = pd.DataFrame(dict(zip(headers, zip(*data)))
print(df.to_string(index=False))
Name Age School
Anne 10 Eiksmarka
Tom 15 Marienlyst
Vidar 18 Persbråten
Ayla 18 Kongshavn
Johanne 17 Wang
Silje 16 Eikeli
Per 19 UiO
Ali 25 NTNU

Computational tractability of algorithm for matching names in two files in python

So I have two .txt files that I'm trying to match up. The first .txt file is just lines of about 12,500 names.
John Smith
Jane Smith
Joe Smith
The second .txt file also contains lines with names (that might repeat) but also extra info, about 17GB total.
584 19423 john smith John Smith 79946792 5 5 11 2016-06-24
584 19434 john smith John Smith 79923732 5 4 11 2018-03-14
584 19423 jane smith Jane Smith 79946792 5 5 11 2016-06-24
My goal is to find all the names from File 1 in File 2, and then spit out the File 2 lines that contain any of those File 1 names.
Here is my python code:
with open("Documents/File1.txt", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
if any([term in line for term in terms]):
w.write(line)
So this code definitely works, but it has been running for 3 days (and is still going!!!). I did some back-of-the-envelope calculations, and I'm very worried that my algorithm is computationally intractable (or hyper inefficient) given the size of the data.
Would anyone be able to provide feedback re: (1) whether this is actually intractable and/or extremely inefficient and if so (2) what an alternative algorithm might be?
Thank you!!

First, when testing membership, set and dict are going to be much, much faster, so terms should be a set:
with open("Documents/File1.txt", "r") as t:
terms = set(line.strip() for line in t)
Next, I would split each line into a list, and check if the name is in the set, not if members of the set are in the line, which is O(N) where N is the length of each line. This way you can directly pick out the column numbers (via slicing) that contain the first and last name:
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
# split the line on whitespace
names = line.split()
# Your names seem to occur here
name = ' '.join(names[4:6])
if name in terms:
w.write(line)

Appending the length of sentences to file

I found the length and index and i want save all of them to new file:
example: index sentences length
my code
file = open("testing_for_tools.txt", "r")
lines_ = file.readlines()
for line in lines_:
lenght=len(line)-1
print(lenght)
for item in lines_:
print(lines_.index(item)+1,item)
output:
64
18
31
31
23
36
21
9
1
1 i went to city center, and i bought xbox5 , and some other stuff
2 i will go to gym !
3 tomorrow i, sill start my diet!
4 i achive some and i need more ?
5 i lost lots of weights؟
6 i have to , g,o home,, then sleep ؟
7 i have things to do )
8 i hope so
9 o
desired output and save to new file :
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18

This can be achieved using the following code. Note the use of with ... as f which means we don't have to worry about closing the file after using it. In addition, I've used f-strings (requires Python 3.6), and enumerate to get the line number and concatenate everything into one string, which is written to the output file.
with open("test.txt", "r") as f:
lines_ = f.readlines()
with open("out.txt", "w") as f:
for i, line in enumerate(lines_, start=1):
line = line.strip()
f.write(f"{i} {line} {len(line)}\n")
Output:
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18
If you wanted to sort the lines based on length, you could just put the following line after the first with block:
lines_.sort(key=len)
This would then give output:
1 i will go to gym ! 18
2 i went to city center, and i bought xbox5 , and some other stuff 64

Text files help(Python 3)

Students.txt
64 Mary Ryan
89 Michael Murphy
22 Pepe
78 Jenny Smith
57 Patrick James McMahon
89 John Kelly
22 Pepe
74 John C. Reilly
My code
f = open("students.txt","r")
for line in f:
words = line.strip().split()
mark = (words[0])
name = " ".join(words[1:])
for i in (mark):
print(i)
The output im getting is
6
4
8
9
2
2
7
8
etc...
My expected output is
64
80
22
78
etc..
Just curious to know how I would print the whole integer, not just a single integer at a time.
Any help would be more than appreciative.

As I can see you have some integer with a string in the text file. You wanted to know about your code will output only full Integer.
You can use the code
f = open("Students.txt","r")
for line in f:
l = line.split(" ")
print(l[0])

In Python, when you do this:
for i in (mark):
print(i)
and mark is of type string, you are asking Python to iterate over each character in the string. So, if your string contains space-separated integers and you iterate over the string, you'll get one integer at a time.
I believe in your code the line
mark = (words[0])name = " ".join(words[1:])
is a typo. If you fix that we can help you with what's missing (it's most likely a statement like mark = something.split(), but not sure what something is based on the code).

You should be using context managers when you open files so that they are automatically closed for you when the scope ends. Also mark should be a list to which you append the first element of the line split. All together it will look like this:
with open("students.txt","r") as f:
mark = []
for line in f:
mark.append(line.strip().split()[0])
for i in mark:
print(i)

The line
for i in (mark):
is same as this because mark is a string:
for i in mark:
I believe you want to make mark an element of some iterable, which you can create a tuple with single item by:
for i in (mark,):
and this should give what you want.

in your line:
line.strip().split()
you're not telling the sting to split based on a space. Try the following:
str(line).strip().split(" ")

A quick one with list comprehensions:
with open("students.txt","r") as f:
mark = [line.strip().split()[0] for line in f]
for i in mark:
print(i)

Python - Parsing Conundrum

I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.

If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.

If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

searching content of one file in another file : python - python

Related

How to print out nicely formatted tables from a dictionary

Computational tractability of algorithm for matching names in two files in python

Appending the length of sentences to file

Text files help(Python 3)

Python - Parsing Conundrum

Categories

Resources