Get unique values of every column from a gz file

Get unique values of every column from a gz file - python

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.
import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique
I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.
any shell script solution?
for example i have the data in my file as
5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,
and in want all unique values from each column.

With the gunzipped file, you could do:
awk -F, 'END { for (i=1;i<=NF;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh
Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.

A shell built pipeline could indeed do this job faster, though likely less memory efficient. The primary reasons are two: parallellism and native code.
First, since we have little description of the task, I'll have to read the Python code and figure out what it does.
from sets import Set is an odd line; sets are part of the standard library, and I don't know what your sets module contains. I'll have to guess it's at best another name for the standard set type, or at least a less efficient variant of the same concept.
gzip.open lets the script read a gzipped file. We can replace this with a zcat process.
csv.readerreads character separated values, in this case splitting on '|'. Deeper inside the code we find only one column (line[i]) is read, so we can replace it with cut or awk ... until i changes. awk can handle that case too, but it's a little trickier.
The trickiest part is the end logic. Every time 10 unique values are found in a column, the program outputs those values and switches to the next column. By the way, Python's for has an else clause specifically for this case, so you don't need a flag variable.
One of the odder parts of the code is how you catch all exceptions from the inner data processing block. Why is this? There are basically only two sources of exceptions in there: Firstly, the indexing could fail if there aren't that many columns. Secondly, the unknown Set type could be throwing exceptions; the standard set type would not.
So, the analysis of your function is: in a diagonal manner (since the file is never rewound, and columns are not processed in parallel), collect unique values from each column until ten are found, and print them. This means, for instance, that if the first column had less than ten unique items nothing is ever printed for any other columns. I'm not sure this is the logic you intended.
With such complicated logic, Python's set functionality actually is a good choice; if we could partition the data more easily then uniq might have been better. What throws us off is how the program moves from column to column and only wants a specific number of values.
Thus, the two big time wasters in the Python program are decompressing in the same thread as we do all the parsing, and splitting into all columns when we only need one. The former can be addressed using a thread, and the latter is probably best done using a regular expression such as r'^(?:[^|]*\|){3}([^|]*)'. That expression would skip three columns and the fourth can be read as group 1. It gets more complicated if the CSV has quoting to contain the separator within some column. We could do the line parsing itself in a separate thread, but that wouldn't solve the issue of the many unneeded string allocations.
Note that the problem actually becomes considerably different if what you really want is to process all columns from the start of the file. I also don't know why you specifically process 400 columns regardless of the amount that exist. If we remove those two constraints, the logic would be more like:
firstline=next(reader)
sets = [{column} for column in firstline]
for line in reader:
for column,columnset in zip(line,sets):
columnset.add(column)

this is a pure python version based on your idea:
from io import StringIO
from csv import reader
txt = '''5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,'''
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
s.add(item)
which yields for your input:
[{'129DC8', '41C528', '4DE8CD', '5C4423', '9E7F41', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094',
'CA39260W1023',
'NL0000344265',
'QA000A0NCQB1',
'US2333774071',
'US37253A1034'},
{'2000-01-01', '2008-03-06', '2012-09-07', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
oops, now that i have posted my answer i see, that this is exactly what Yann Vernier proposes at the end of his answer. please upvote this answer which was here way earlier than mine...
if you want to limit the number of unique values, you could use a deque as data structure:
from io import StringIO
from csv import reader
MAX_LEN = 3
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
if len(s) < MAX_LEN:
s.add(item)
print(unique)
with the result:
[{'41C528', '5C4423', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094', 'NL0000344265', 'US2333774071'},
{'2000-01-01', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
this way you would save some memory if one of your columns holds only unique values.

Related

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.

If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found

Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

check csv every 5 rows with condition using python3.x

csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!

I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !

When I write in csv how do I separate columns in Python

My code is
import pymysql
conn=pymysql.connect(host=.................)
curs=conn.cursor()
import csv
f=open('./kospilist.csv','r')
data=f.readlines()
data_kp=[]
for i in data:
data_kp.append(i[:-1])
c = csv.writer(open("./test_b.csv","wb"))
def exportFunc():
result=[]
for i in range(0,len(data_kp)):
xp="select date from " + data_kp[i] + " where price is null"
curs.execute(xp)
result= curs.fetchall()
for row in result:
c.writerow(data_kp[i])
c.writerow(row)
c.writerow('\n')
exportFunc()
data_kp is reading the tables name
the tables' names are like this (string, ex: a000010)
I collect table names from here.
Then, execute and get the result.
The actual output of my code is ..
My expectation is
(not 3 columns.. there are 2000 tables)
I thought my code is near the answer... but it's not working..
My work is almost done, but I couldn't finish this part.
I had googled for almost 10 hours..
I don't know how.. please help
I think something is wrong with these part
for row in result:
c.writerow(data_kp[i])
c.writerow(row)

The csvwriter.writerow method allows you to write a row in your output csv file. This means that once you have called the writerow method, the line is wrote and you can't come back to it. When you write the code:
for row in result:
c.writerow(data_kp[i])
c.writerow(row)
You are saying:
"For each result, write a line containing data_kp[i] then write a
line containing row."
This way, everything will be wrote verticaly with alternation between data_kp[i] and row.
What is surprising is that it is not what we get in your actual output. I think that you've changed something. Something like that:
c.writerow(data_kp[i])
for row in result:
c.writerow(row)
But this has not entirely solved your issue, obviously: The names of the tables are not correctly displayed (one character on each column) and they are not side-by-side. So you have 2 problems here:
1. Get the table name in one cell and not splitted
First, let's take a look at the documentation about the csvwriter:
A row must be an iterable of strings or numbers for Writer objects
But your data_kp[i] is a String, not an "iterable of String". This can't work! But you don't get any error either, why? This is because a String, in python, may be itself considered as an iterable of String. Try by yourself:
for char in "abcde":
print(char)
And now, you have probably understood what to do in order to make the things work:
# Give an Iterable containing only data_kp[i]
c.writerow([data_kp[i]])
You have now your table name displayed in only 1 cell! But we still have an other problem...
2. Get the table names displayed side by side
Here, it is a problem in the logic of your code. You are browsing your table names, writing lines containing them and expect them to be written side-by-side and get columns of dates!
Your code need a little bit of rethinking because csvwriter is not made for writing columns but lines. We'll then use the zip_longest function of the itertools module. One can ask why don't I use the zip built-in function of Python: this is because the columns are not said to be of equal size and the zip function will stop once it reached the end of the shortest list!
import itertools
c = csv.writer(open("./test_b.csv","wb"))
# each entry of this list will contain a column for your csv file
data_columns = []
def exportFunc():
result=[]
for i in range(0,len(data_kp)):
xp="select date from " + data_kp[i] + " where price is null"
curs.execute(xp)
result= curs.fetchall()
# each column starts with the name of the table
data_columns.append([data_kp[i]] + list(result))
# the * operator explode the list into arguments for the zip function
ziped_columns = itertools.zip_longest(*data_columns, fillvalue=" ")
csvwriter.writerows(ziped_columns)
Note:
The code provided here has not been tested and may contain bugs. Nevertheless, you should be able (by using the documentation I provided) to fix it in order to make it works! Good luck :)

What is the fastest performance tuple for large data sets in python?

Right now, I'm basically running through an excel sheet.
I have about 20 names and then I have 50k total values that match to one of those 20 names, so the excel sheet is 50k rows long, column B showing any random value, and column A showing one of the 20 names.
I'm trying to get a string for each of the names that show all of the values.
Name A: 123,244,123,523,123,5523,12505,142... etc etc.
Name B: 123,244,123,523,123,5523,12505,142... etc etc.
Right now, I created a dictionary that runs through the excel sheet, checks if the name is all ready in the dictionary, if it is, then it does a
strA = strA + "," + foundValue
Then it inserts strA back into the dictionary for that particular name. If the name doesn't exist, it creates that dictionary key and then adds that value to it.
Now, this was working all well at first.. but it's been about 15 or 20 mins and it is only on 5k values added to the dictionary so far and it seems to get slower as time goes on and it keeps running.
I wonder if there is a better way to do this or faster way to do this. I was thinking of building new dictionaries every 1k values and then combine them all together at the end.. but that would be 50 dictionaries total and it sounds complicated.. although maybe not.. I'm not sure, maybe it could work better that way, this seems to not work.
I DO need the string that shows each value with a comma between each value. That is why I am doing the string thing right now.

There are a number of things that are likely causing your program to run slowly.
String concatenation in python can be extremely inefficient when used with large strings.
Strings in Python are immutable. This fact frequently sneaks up and bites novice Python programmers on the rump. Immutability confers some advantages and disadvantages. In the plus column, strings can be used as keys in dictionaries and individual copies can be shared among multiple variable bindings. (Python automatically shares one- and two-character strings.) In the minus column, you can't say something like, "change all the 'a's to 'b's" in any given string. Instead, you have to create a new string with the desired properties. This continual copying can lead to significant inefficiencies in Python programs.
Considering each string in your example could contain thousands of characters, each time you do a concatenation, python has to copy that giant string into memory to create a new object.
This would be much more efficient:
strings = []
strings.append('string')
strings.append('other_string')
...
','.join(strings)
In your case, instead of each dictionary key storing a massive string, it should store a list, and you would just append each match to the list, and only at the very end would you do a string concatenation using str.join.
In addition, printing to stdout is also notoriously slow. If you're printing to stdout on each iteration of your massive 50,000 item loop, each iteration is being held up by the unbuffered write to stdout. Consider only printing every nth iteration, or perhaps writing to a file instead (file writes are normally buffered) and then tailing the file from another terminal.

This answer is based on OP's answer to my comment. I asked what he would do with the dict, suggesting that maybe he doesn't need to build it in the first place. #simon replies:
i add it to an excel sheet, so I take the KEY, which is the name, and
put it in A1, then I take the VALUE, which is
1345,345,135,346,3451,35.. etc etc, and put that into A2. then I do
the rest of my programming with that information...... but i need
those values seperated by commas and acessible inside that excel sheet
like that!
So it looks like the dict doesn't have to be built after all. Here is an alternative: for each name, create a file, and store those files in a dict:
files = {}
name = 'John' # let's say
if name not in files:
files[name] = open(name, 'w')
Then when you loop over the 50k-row excel, you do something like this (pseudo-code):
for row in 50k_rows:
name, value_string = rows.split() # or whatever
file = files[name]
file.write(value_string + ',') # if already ends with ',', no need to add
Since your value_string is already comma separated, your file will be csv-like without any further tweaking on your part (except maybe you want to strip the last trailing comma after you're done). Then when you need the values, say, of John, just value = open('John').read().
Now I've never worked with 50k-row excels, but I'd be very surprised if this is not quite a bit faster than what you currently have. Having persistent data is also (well, maybe) a plus.
EDIT:
Above is a memory-oriented solution. Writing to files is much slower than appending to lists (but probably still faster than recreating many large strings). But if the lists are huge (which seems likely) and you run into a memory problem (not saying you will), you can try the file approach.
An alternative, similar to lists in performance (at least for the toy test I tried) is to use StringIO:
from io import StringIO # python 2: import StringIO import StringIO
string_ios = {'John': StringIO()} # a dict to store StringIO objects
for value in ['ab', 'cd', 'ef']:
string_ios['John'].write(value + ',')
print(string_ios['John'].getvalue())
This will output 'ab,cd,ef,'

Instead of building a string that looks like a list, use an actual list and make the string representation you want out of it when you are done.

The proper way is to collect in lists and join at the end, but if for some reason you want to use strings, you could speed up the string extensions. Pop the string out of the dict so that there's only one reference to it and thus the optimization can kick in.
Demo:
>>> timeit('s = d.pop(k); s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
0.8417842664330237
>>> timeit('s = d[k]; s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
294.2475278390723

Depending on how you have read the excel file, but let's say that lines are read as delimiter-separated tuples or something:
d = {}
for name, foundValue in line_tuples:
try:
d[name].append(foundValue)
except KeyError:
d[name] = [foundValue]
d = {k: ",".join(v) for k, v in d.items()}
Alternatively using pandas:
import pandas as pd
df = pd.read_excel("some_excel_file.xlsx")
d = df.groupby("A")["B"].apply(lambda x: ",".join(x)).to_dict()

Fastest way to grep multiple values from file in python

I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
I also have a list of 1000 unique items (vals).
I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.
What is the fastest way to achieve this?
My best try till now:
newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
info = line.split()
foundVals.append(info[0])
newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]
Example
inputFile:
2 9913
3 9913
4 9646
...
594592886 32630
594592888 32630
594592890 32630
vals:
[1,2,594592888]
wanted dictionary:
{2:9913,594592888:32630}
And in notFound:
[1]

You clarified in a comment that each key occurs at most once in the data. It follows from that and the fact that there are only 1000 keys that the amount of work being done in Python is trivial; almost all your time is spent waiting for output from grep. Which is fine; your strategy of delegating line extraction to a specialized utility remains sound. But it means that performance gains have to be found on the line-extraction side.
You can speed things up some by optimizing your regex. E.g., instead of
^266[[:space:]]\|^801[[:space:]]\|^810[[:space:]]
you could use:
^\(266\|801\|810\)[[:space:]]
so that the anchor doesn't have to be separately matched for each alternative. I see about a 15% improvement on test data (10 million rows, 25 keys) with that change.
A further optimization is to unify common prefixes in the alternation: 266\|801\|810 can be replaced with the equivalent 266\|8\(01\|10\). Rewriting the 25-key regex in this way gives close to a 50% speedup on test data.
At this point grep is starting to show its limits. It seems that it's CPU-bound: iostat reveals that each successive improvement in the regex increases the number of IO requests per second while grep is running. And re-running grep with a warmed page cache and the --mmap option doesn't speed things up (as it likely would if file IO were a bottleneck). Greater speed therefore probably requires a utility with a faster regex engine.
One such is ag(source here), whose regex implementation also performs automatic optimization, so you needn't do much hand-tuning. While I haven't been able to get grep to process the test data in less than ~12s on my machine, ag finishes in ~0.5s for all of the regex variants described above.

If I understand you correctly, you don't want any file row that doesn't match you vals
Since you're talking about huge files and quite smaller number of wanted values, I would go for something like:
vals_set = set(vals)
found_vals = {}
with open(inputfile,"r") as in_file:
for line in in_file:
line = line.split() # Assuming tabs or whitespaces
if line[0] in vals_set:
found_vals[line[0]] = line[1]
not_found_vals = vals_set.difference(found_vals)
It will be memory conservative, and you'll have you dict in found_vals and your list in not_found_vals. In fact, memory usage, AFAIK will depend only on the amount of vals you want to search for, not the size of the files.
EDIT:
I think that the easiest way to parallelize this task would be just by splitting the file and searching separately in each piece with a different process. This way you don't need to deal with communicating between threads (easier and faster, I think).
A good way to do it, since I deduce you're using BASH (you used grep :P ) is what is mentioned in this answer:
split -l 1000000 filename
will generate files with 1000000 lines each.
You could easily modify you script to save your matches into a new file for each process, and then merge the different output file.

This is not terribly memory efficient (for a file of 300 million lines, that may amount to a problem). I can't think of a way to save the not-found values within a comprehension except by saving all the values (or reading the file twice). I don't think threads will help much, since the file I/O is likely going to be the performance bottleneck.
I'm assuming that a tab is the delimiting character in the file. (You didn't say, but the example data looks to have a tab.)
vals = [1,2,594592888]
with open(self.inputfile,'r') as i_file:
all_vals = {
int(t[0]):int(t[1])
for t in (
line.strip().split('\t')
for line in i_file
)
}
newDict = {
t[0]:t[1] for t in filter(lambda t: t[0] in vals, all_vals.items())
}
notFound = list(set(all_vals.keys()).difference(newDict.keys()))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.