Comparing csv files in python to see what is in both - python

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !

Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something

You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

Related

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

Python fast way to filter lines based on a list of keywords

I have a text file of many thousand lines which contain some String which at some position contains a unique identifer - and a List of identifers I want to filter for.
I want to extract all lines from this file which contain any identifier from my filter list. Currently I am solving this with two nested loops:
found = []
for identifier in ids:
with open("file.txt", 'r') as f:
for line in f.readlines():
if identifier in line:
found.append(line)
This however is extremely slow, as I run two nested loops and both the identifier list and the text file are huge. Is there a smart, more performant way in python to solve this in less than O(n^2)?
Further infos & contraints:
Any line may only contain one or no identifers from my list
I can't sort the file based on the identifers as they do not
necessarily have a form that can be hieraricly structured
Reordering the code should speed it up, such that you read the text file only once.
f = open("demofile.txt", "r")
mylines = f.readlines()
found = []
for line in mylines:
for identifier in ids:
if identifier in line:
found.append(line)
It's not clear how the line is composed, but if it can be tokenized cleanly, you could make the identifier collection a set, then check if the line's identifier is in the set.
set is a hashset, so its lookup is O(1), so the entire thing should run in O(n).
Using readlines also seems unnecessary, iterate on the file lazily.
ids = set(ids)
with open('demofile.txt', 'r', encoding='utf-8') as f:
found = [
line
for line in f
if get_id(line) in ids
]
you just need to provide get_id, which just slices out "a unique identifer" "at some position".
I would recommend using https://pypi.org/project/triegex/ to build a regular expression that matches any of your identifiers and does a minimum of backtracking.
And now you just have a single loop.
Your code will repeat some lines if the line contains more than one id.
So, Here we will try a different technique - I am not sure it will be faster or not - but it will grantee no repeating lines, and this technique based on a strong foundation.
Let us try using panads:
import pandas as pd
ids=['id1', 'id2', 'id3'] # list of string ids
df=pd.read_csv('file.txt', sep = "`", names = ['txt'])
found =list(df.txt.loc[df.txt.apply(lambda x: sum((id in x) for id in ids )>0)])
Some Clarification
df=pd.read_csv('file.txt', sep = "`", names = ['txt']) Load the file into DataFrame. I used ` as a speratore to consider all the line as one column, since ` is rarely to be used in normal text lines.
(id in x) for id in ids generates a list of True and False based on how many exeistance of such ids in each line (row), and the outer sum will get summation of True
if we have a file.txt
fwpep poweripoi id1 ewlfdfkd f p[woer[pwe dlkdfwero0iopwiperew
we;rioepo ,r rtoipweorit ,rt rtopipowerit
werert.rtrtid1eyri id1 id2 pid2oerit poier tpoerit eropitpo
so the found content will be:
Out[1]:found
['fwpep poweripoi id1 ewlfdfkd f p[woer[pwe dlkdfwero0iopwiperew',
'werert.rtrtid1eyri id1 id2 pid2oerit poier tpoerit eropitpo']
meanwhile, your original code content will be
Out[1]:found
['fwpep poweripoi id1 ewlfdfkd f p[woer[pwe dlkdfwero0iopwiperew',
'werert.rtrtid1eyri id1 id2 pid2oerit poier tpoerit eropitpo',
'werert.rtrtid1eyri id1 id2 pid2oerit poier tpoerit eropitpo']

Problems with using a for loop with a string from a text file to locate an index in a list of lists

I have a text file with a string that has a letter (beginning with "A" that is assigned to a random country). I import that line from the text file to be used with my code where I have a list of countries and rates. I strip the string so that I am left with the country and then I want to be able to locate the country in the string on a list of list that I created. The problem is that when I run my for loop to find the name of the country in the string in the list of lists, where each junior list has the name of the country, GDP and a rate, the for loop runs and can't find the country in the string, even though they are the same type and same spelling. Let me post my code and output below.
When I created the txt file or csv file, this is what I used:
f = open("otrasvariables2020.txt", "w")
f.write(str(mis_letras_paises) + "\n")
f.write(str(mis_paises) + "\n") #(This is the string I need)
f.write(str(mis_poblaciones) + "\n")
f.close() #to be ready to use it later
Let me post some of the output.
import linecache
with open("otrasvariables2020.txt") as otras_variables:
mis_paises = (linecache.getline("otrasvariables2020.txt",2))
#Here I get the line of text I need, I clean the string and create a
#list with 5 countries.
lista_mis_paises = mis_paises.translate({ord(i): None for i \
in "[]-\'"}).split(", ")
for i in lista_mis_paises:
if "\n" in i:
print(i)
i.replace("\n", "")
for i in lista_mis_paises:
if len(i) <= 2:
lista_mis_paises.pop(lista_mis_paises.index(i))
Final part of the question: So, ultimately what I want is to find in the array the junior list of the country in the list/string I imported from the text file. Once I locate that junior list I can use the rates and other values there for calculations I need to do. Any ideas what's wrong? The outcome should be the following: Afganistán and other 4 countries should be found in the list of lists, which, for Afganistán, happens to be the 1st item, so I should now be able to create another list of lists but with just the 5 countries instead of the 185 countries I began with.
If the concern you have is to strip special characters you don't want to use, I'll do something like that:
countries = linecache.getline("otrasvariables2020.txt",2).strip('[]-\'"').rstrip('\n').split(', ')
Note: with open("otrasvariables2020.txt") as otras_variables: is not used in the code you shared above, so can be removed.
Hope it helps.

How to parse very big files in python?

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.

A solution to remove the duplicates?

My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])

Categories

Resources