Searching for area code in txt-file in Python

Searching for area code in txt-file in Python - python

I have a textfile that looks like this:
Thomas Edgarson, Berliner Str 4, 13359 Berlin
Madeleine Jones, Müller Str 5, 15992 Karlsruhe
etc...
It's always two words, followed by a comma, then two words and number, comma, area code and city. There are no exceptions.
I used
f=open("C:\\Users\\xxxxxx\\Desktop\\useradresses.txt", "r")
text=f.readlines()
f.close()
So now I have a list of all the columns. How can I now search for the area codes in these strings. I need to create a dictionary that looks like this
{'13359':[('Neuss','Wolfgang'),('Juhnke','Harald')]}
Believe me, I've searched, but couldn't find useful information. To me, the whole idea of searching for something like an arbitray area code in a string is new and I haven't come across it so far.
I would be happy if you could give me some pointers as to where I should look for tutorials or give me an idea where to start.

dic = {}
with open('filename') as file:
for name, addr, zcode in (i.split(',') for i in file if i.rstrip()):
dic.setdefault(zcode.split()[0], []).append(name.split())
Further explanation as Sjoerd asked:
Using a generator expression to break each line in 3 variables: name, addr and zcode. Then I split zcode in the desired number and used it as a dictionary key.
As the dict may not have the key yet, I use the setdefault method and that sets the key with a empty list before appending the splitted name.

Loop through the file, reading lines, and split by comma. Then, process each part by splitting by space. Then, add the values to a dictionary.

d={}
for line in open('useradresses.txt','r'):
if line.strip()=='':
continue
(name,strasse,plzort) = line.split(',')
nachname,vorname=name.split()
plz,ort=plzort.split()
if plz in d:
d[plz].append((nachname,vorname))
else:
d[plz]=[(nachname,vorname),]
print d

Python has a lot of libraries dealing with string manipulation, which is what this is. You'll be wanting the re library and the shlex library. I'd suggest the following code:
with open("C:\\Users\\xxxxxx\\Desktop\\useradresses.txt", "r") as f:
for line in f.readlines():
split = shlex.split(line)
mydict[split[6]] = [(split[0], split[1])]
This won't be perfect, it will overwrite identical zip codes, and drops some values. It should point you in the right direction though.

Related

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.

If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found

Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

Read lines from doc using python and compare to a set of words

There's a doc ex.doc with contents like
The doctor set a due date of August 17th.
Now i want to read this line by line using CPython 2.7 and divide them into arrays of words like , as follow:
['The','doctor','set','a','due','date','of','August','17th','.']
If we can get special characters and numbers separate it will be better .
Then i would like to compare each words to existing set of data like
['January'..........}
To know which month / ( profession like i can search doctor from list of professions ) the given line is talking about .
Please try to help by easier and cleaner code.
Here's my progress :
with open("ex.doc") as file:
my_list = file.readlines()
my_list = [x.strip() for x in my_list]
mt = ['Jan','Feb']
mtPresent=0
for lines in my_list:
for words in lines:
if words in mt:
mtPresent=1
if mtPresent == 1:
print(rows)
#if line contains jan or feb then only line will be displayed
mtPresent = 0
We should be able to compare word ':' among the list of ['is',':'].
Thanks !

So, your questions contains two problems. The first is splitting a sentence into its words. There is a number of resources out there how best to do tokenization, as it is called, but for now something like this will do:
import re
text = "The doctor set a due date of August 17th."
words = set(re.findall(r'\w+', text))
Because you only care about which words occur and not their order, I chose to use the set datatype, since finding elements in it is faster and more convenient in python.
Now, given a set of months and professions that you want to find, you can simply do the following:
months = {"August", "February"} # or whatever you need
professions = {"doctor", "carpenter"} # same here
month = months.intersection(words)
profession = professions.intersection(words)
which will print
print(month)
>> {'August'}
print(profession)
>> {'doctor'}
If you have questions regarding regexes, sets, or anyhting else, please feel free to ask.
Edit:
If the .docx file you extract the text from is a simple, unformatted libne based file, this code should be able to convert it into a list of strings to better work with:
from docx import Document
text_lines = []
for line in Document("demo.docx").paragraphs:
text_lines.append(line.text)
The docx library can be installed to python 2.7 to 3.4 with
pip install python-docx
If you have any say in the project, I would advice not to store information that is supposed to be machine readable in something as unhandy as docx. Plain .txt is, in my experience, preferable.

What is the fastest performance tuple for large data sets in python?

Right now, I'm basically running through an excel sheet.
I have about 20 names and then I have 50k total values that match to one of those 20 names, so the excel sheet is 50k rows long, column B showing any random value, and column A showing one of the 20 names.
I'm trying to get a string for each of the names that show all of the values.
Name A: 123,244,123,523,123,5523,12505,142... etc etc.
Name B: 123,244,123,523,123,5523,12505,142... etc etc.
Right now, I created a dictionary that runs through the excel sheet, checks if the name is all ready in the dictionary, if it is, then it does a
strA = strA + "," + foundValue
Then it inserts strA back into the dictionary for that particular name. If the name doesn't exist, it creates that dictionary key and then adds that value to it.
Now, this was working all well at first.. but it's been about 15 or 20 mins and it is only on 5k values added to the dictionary so far and it seems to get slower as time goes on and it keeps running.
I wonder if there is a better way to do this or faster way to do this. I was thinking of building new dictionaries every 1k values and then combine them all together at the end.. but that would be 50 dictionaries total and it sounds complicated.. although maybe not.. I'm not sure, maybe it could work better that way, this seems to not work.
I DO need the string that shows each value with a comma between each value. That is why I am doing the string thing right now.

There are a number of things that are likely causing your program to run slowly.
String concatenation in python can be extremely inefficient when used with large strings.
Strings in Python are immutable. This fact frequently sneaks up and bites novice Python programmers on the rump. Immutability confers some advantages and disadvantages. In the plus column, strings can be used as keys in dictionaries and individual copies can be shared among multiple variable bindings. (Python automatically shares one- and two-character strings.) In the minus column, you can't say something like, "change all the 'a's to 'b's" in any given string. Instead, you have to create a new string with the desired properties. This continual copying can lead to significant inefficiencies in Python programs.
Considering each string in your example could contain thousands of characters, each time you do a concatenation, python has to copy that giant string into memory to create a new object.
This would be much more efficient:
strings = []
strings.append('string')
strings.append('other_string')
...
','.join(strings)
In your case, instead of each dictionary key storing a massive string, it should store a list, and you would just append each match to the list, and only at the very end would you do a string concatenation using str.join.
In addition, printing to stdout is also notoriously slow. If you're printing to stdout on each iteration of your massive 50,000 item loop, each iteration is being held up by the unbuffered write to stdout. Consider only printing every nth iteration, or perhaps writing to a file instead (file writes are normally buffered) and then tailing the file from another terminal.

This answer is based on OP's answer to my comment. I asked what he would do with the dict, suggesting that maybe he doesn't need to build it in the first place. #simon replies:
i add it to an excel sheet, so I take the KEY, which is the name, and
put it in A1, then I take the VALUE, which is
1345,345,135,346,3451,35.. etc etc, and put that into A2. then I do
the rest of my programming with that information...... but i need
those values seperated by commas and acessible inside that excel sheet
like that!
So it looks like the dict doesn't have to be built after all. Here is an alternative: for each name, create a file, and store those files in a dict:
files = {}
name = 'John' # let's say
if name not in files:
files[name] = open(name, 'w')
Then when you loop over the 50k-row excel, you do something like this (pseudo-code):
for row in 50k_rows:
name, value_string = rows.split() # or whatever
file = files[name]
file.write(value_string + ',') # if already ends with ',', no need to add
Since your value_string is already comma separated, your file will be csv-like without any further tweaking on your part (except maybe you want to strip the last trailing comma after you're done). Then when you need the values, say, of John, just value = open('John').read().
Now I've never worked with 50k-row excels, but I'd be very surprised if this is not quite a bit faster than what you currently have. Having persistent data is also (well, maybe) a plus.
EDIT:
Above is a memory-oriented solution. Writing to files is much slower than appending to lists (but probably still faster than recreating many large strings). But if the lists are huge (which seems likely) and you run into a memory problem (not saying you will), you can try the file approach.
An alternative, similar to lists in performance (at least for the toy test I tried) is to use StringIO:
from io import StringIO # python 2: import StringIO import StringIO
string_ios = {'John': StringIO()} # a dict to store StringIO objects
for value in ['ab', 'cd', 'ef']:
string_ios['John'].write(value + ',')
print(string_ios['John'].getvalue())
This will output 'ab,cd,ef,'

Instead of building a string that looks like a list, use an actual list and make the string representation you want out of it when you are done.

The proper way is to collect in lists and join at the end, but if for some reason you want to use strings, you could speed up the string extensions. Pop the string out of the dict so that there's only one reference to it and thus the optimization can kick in.
Demo:
>>> timeit('s = d.pop(k); s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
0.8417842664330237
>>> timeit('s = d[k]; s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
294.2475278390723

Depending on how you have read the excel file, but let's say that lines are read as delimiter-separated tuples or something:
d = {}
for name, foundValue in line_tuples:
try:
d[name].append(foundValue)
except KeyError:
d[name] = [foundValue]
d = {k: ",".join(v) for k, v in d.items()}
Alternatively using pandas:
import pandas as pd
df = pd.read_excel("some_excel_file.xlsx")
d = df.groupby("A")["B"].apply(lambda x: ",".join(x)).to_dict()

Python - Using eval() to put a CSV into a dictionary

Hey I wonder if you can help me, I did some research on using eval() to read the lines of my CSV and then put it into a dictionary. The problem is that my CSV has 4 pieces of data: the name, the first score, the second score and the third score. How would I transfer this data from a CSV into a dictionary within python so then later on I can check if that users name is the dictionary to append to it or edit the score.
I would like to have it so the key the name, and the scores are the list and are kept in a list so they can be appended to/deleted later.
Thanks for your help.

There is a module in the Python standard library that will help you with reading/writing CSV files. Let me assume that your csv file looks like this:
Jim, 45, 78, 90
Mary, 100,98, 99
Molly, 78, 45,46
Mat, 76, 89, 95
Then:
import csv
scores = {}
with open('score.csv') as f:
reader = csv.reader(f)
for row in reader:
scores.setdefault(row[0],[]).extend(row[1:])
This will create a dictionary scores with names as keys and a list of scores as values:
{'Mat': [' 76', ' 89', ' 95'], 'Jim': [' 45', ' 78', ' 90'], 'Molly': [' 78', ' 45', '46'], 'Mary': [' 100', '98', ' 99']}

import csv
from collections import defaultdict
# Your target is a dictionary {name : [scores]}
scores = defaultdict(list)
with open(csvfilename) as csvfile:
for row in csv.reader(csvfile):
scores[row[0]].extend(row[1:])
I don't think eval is a good tool for this. It is really easy to introduce security vulnerabilities with it, as it will parse and execute whatever you pass it. As an exercise, think about why it may not be okay to execute data from some csv-file. Spoiler: your csv-file is a serialization format, this talk Tom Eastman - Serialization formats are not toys - PyCon 2015 shows dangers that may exist there. For bonus insights look at the source of the collections module we imported the defaultdict from and think about why this use of exec by #raymond-hettinger is different from using eval on data.

eval() is not what you want here I don't think. eval() reads a string and interprets it as python code; what you want is simple file I/O manipulation.
data = numpy.genfromtxt("filename.csv", delimter=";") # non-numpy possibilities available
my_dict = {}
for i in data:
my_dict[data[i,0]] = data[i,1:]

If you really, really want to do it with eval: Well, first, you shouldn't, unless you have a very good reason. Just parse the file as CSV, not as Python code. The right way to do that is with the csv module, as in Chris Wesseling's answer (or, if you're already using NumPy or Pandas, using their functions).
But if you really, really, really want to, can you?
Well, sometimes.
The most basic CSV dialect doesn't quote strings, so its lines aren't going to be valid as Python code. And some CSV dialects handle embedded quotes in ways that either aren't valid in Python, or mean something different.
But some dialects do happen to make at least most rows legal, and meaningful, as Python tuple literals consisting of Python str, int, and float literals. And for those dialects, technically, yes, you could parse them with eval, like this:
scores = {}
with open(path) as f:
for line in f:
name, *newscores = eval(line)
scores.setdefault(name, []).extend(newscores)
But again, you shouldn't.
And even if you really, really, really want to do this, you should at least use literal_eval instead; it will handle all the same legal values that eval would without opening the big gaping security holes (e.g., someone putting __import__('os').system('rm -rf /') in a CSV) and painful-to-debug edge cases.
But even with literal_eval, you don't want it. You want to parse the actual CSV dialect you have, not just treat it as a similar but different language and cross your fingers.

detecting and printing the difference between two text files using python 3.2

i have been trying to write a program in python to read two text files and print the differences in their texts.the two files are similar other than the line numberings which are different owing to some comments that have been inserted.I have tried using difflib module but it is giving me errors.
import difflib
from difflib import *
temp3=[]
temp4=[]
with open ("seqdetect",'r') as f:
with open ("seqdetect_2",'r') as g:
for item in f:
temp1 =item.split()
temp3.append(temp1)
for items in g:
temp2 =items.split()
temp4.append(temp2)
d = difflib.Differ()
diff = d.compare(temp3, temp4)
print ('\n'.join(diff))
Could you please suggest an alternative.
Regards,
Mayank

Ok, I've tried out your code and found the issue.
The Differ.compare() method expects to be given two lists of strings, representing the lines of your two texts. However, because of your item.split() calls, your lists temp3 and temp4 are lists of lists of (one character long) strings.
I'm not sure exactly what you were wanting to do with that split, so I'm not sure what the best fix is. If you really do want it to tell you the individual characters that have been added or removed, you can replace your calls to append() with extend() in the two for loops. But that doesn't seem very useful, frankly.
More likely the splitting is a mistake. Rather than looping over the lines in your files, just read them all into lists using readlines() and let the Differ do its work on them.
with open("seqdetect") as f, open("seqdetect_2") as g:
flines = f.readlines()
glines = g.readlines()
d = difflib.Differ()
diff = d.compare(flines, glines)
print("\n".join(diff))
If you want to do some filtering on what counts as a difference (ignoring whitespace differences, or whatever) you should explore the difflib documentation, and pass an appropriate function as the linejunk or charjunk parameters of the Differ's constructor.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.