Email harvest with python - python

I developed and application for harvest any type of emails from files
types : ishani#dolly.lk
ishani(at)dit.dolly.lk
ishani at cs dot dolly dot edu
But the problem is output shows some extra items in a list other than the extracted full email. I coudnt figure out why is that. I tried in various ways.I think there is a problem in my regular expression or the logic
here is my code
data=f.read()
regexp_email = r'(([\w]+)#([\w]+)([.])([\w]+[\w.]+))|(([\w]+)(\(at\))([\w]+)([.])([\w]+[\w.]+))|(([\w]+)(\sat\s)([\w-]+)(\sdot\s)([\w]+(\sdot\s[\w]+)))'
pattern = re.compile(regexp_email)
emailAddresses = re.findall(pattern, data)
print emailAddresses
the output is like this
[('ishani#sliit.lk', 'ishani', 'sliit', '.', 'lk', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', 'ishani(at)dit.sliit.lk', 'ishani', '(at)', 'dit', '.', 'sliit.lk', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', 'ishani at cs dot dolly dot edu', 'ishani', ' at ', 'cs', ' dot ', 'dolly dot edu', ' dot edu')]
but Im expecting a output like this
['ishani#dolly.lk','ishani(at)dit.dolly.lk','ishani at cs dot dolly dot edu']
Is there any method that anyone tried which support my problem?

Change your regexp_email to this:
r'[\w]+#[\w]+[.][\w]+[\w.]+|[\w]+\(at\)[\w]+[.][\w]+[\w.]+|[\w]+\sat\s[\w-]+\sdot\s[\w]+\sdot\s[\w]+'
It doesn't seem that you need the capturing groups, so I have removed all of them.
You also don't need the [] around \w if \w is all you need to specify:
r'\w+#\w+[.]\w+[\w.]+|\w+\(at\)\w+[.]\w+[\w.]+|\w+\sat\s[\w-]+\sdot\s\w+\sdot\s\w+'

You could just skip the blanks
print [e for ea in emailAddresses for e in ea if e]
which produces
['ishani#sliit.lk', 'ishani', 'sliit', '.', 'lk', 'ishani(at)dit.sliit.lk', 'ishani', '(at)', 'dit', '.', 'sliit.lk', 'ishani at cs dot dolly dot edu', 'ishani', ' at ', 'cs', ' dot ', 'dolly dot edu', ' dot edu']
which isn't exactly what you asked for...

Related

Loop Though List of Lists Python and Merge Entries

I'm in the middle of cleaning up some data in Python. I get a load of lists with 6 entries which I eventually want to put into a dataframe, however before doing that I'd like to loop through and check if entry 1 in the list is the only non-empty string in the list. For instance I have a list called transaction_list:
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG', '', '', '5,000.00', '11,377.00'],
['', 'MULTI RESOURCES LIMITED,', '', '', '', ''],
['', 'CMS19274001077, 1094175DAAA107', '', '', '', ''],
['01Oct', 'INTER ACC CREDIT, SY', '', '', '1,000.00', '12,732.07'],
['', 'WTA CO3 (MAUR) LIMITED,', '', '', '', ''],
['', 'CMS19274009397, 729981UAAA298', '', '', '', ''],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,', '639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
I basically need to loop through and store a list in memory, however if the list only has the first entry populated and the other entries are empty strings, I want to merge that entry via a line break with the previous first entry of the loop and then delete that list. So the final should look something like this:
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG \n MULTI RESOURCES LIMITED \n CMS19274001077, 1094175DAAA107',
'', '', '5,000.00', '11,377.00'],
['01Oct', 'INTER ACC CREDIT, SY \n WTA CO3 (MAUR) LIMITED \n CMS19274009397, 729981UAAA298',
'', '', '1,000.00', '12,732.07'],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,',
'639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
I've been cracking my head all night on this with no luck.
Once you find the right groupby level, the rest can be accomplished with a custom agg function.
The groups can be determined with the cumsum of non null col0
import pandas as pd
import numpy as np
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG', '', '', '5,000.00', '11,377.00'],
['', 'MULTI RESOURCES LIMITED,', '', '', '', ''],
['', 'CMS19274001077, 1094175DAAA107', '', '', '', ''],
['01Oct', 'INTER ACC CREDIT, SY', '', '', '1,000.00', '12,732.07'],
['', 'WTA CO3 (MAUR) LIMITED,', '', '', '', ''],
['', 'CMS19274009397, 729981UAAA298', '', '', '', ''],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,', '639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
df = pd.DataFrame(transaction_list)
df = df.replace('',np.nan)
df.groupby((~df[0].isnull()).cumsum()).agg({0:'first',
1: lambda x: '\n'.join(x),
2:'first',
3:'first',
4:'first',
5:'first'}).fillna('').values.tolist()
Output
[['01Oct',
'CMS ACC TRF /MISC CREDIT, AG\nMULTI RESOURCES LIMITED,\nCMS19274001077, 1094175DAAA107',
'',
'',
'5,000.00',
'11,377.00'],
['01Oct',
'INTER ACC CREDIT, SY\nWTA CO3 (MAUR) LIMITED,\nCMS19274009397, 729981UAAA298',
'',
'',
'1,000.00',
'12,732.07'],
['01Oct',
'HOUSE CHEQUE PRESENTED, , 639584,',
'639584',
'400.00',
'',
'12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']]
Try the following code snippet. It checks the required condition, concatenates the string, deletes that inner list. It only increments the index if the list was not deleted, else keeps it the same so looping isn't disrupted.
while (i < len(transaction_list)):
if transaction_list[i][0] == '':
transaction_list[i-1][1] = transaction_list[i-1][1] + ' \n ' + transaction_list[i][1]
del transaction_list[i]
else:
i += 1
print(transaction_list)
Hope it Helps!

Python FutureWarning. new syntax?

Python swears at my syntax, says soon it will be impossible to write like that. Can you please tell me how to change the function?
def cleaning_name1(data):
"""Cleaning name1 minus brand + space+articul + art + round brackets """
data['name1'] = data['name1'].str.split('артикул').str[0]
data['name1'] = data['name1'].str.split('арт').str[0]
data['name1'] = (data['name1'].str.replace('brand ', '', )
.str.replace(' ', '', ).str.replace('(', '', ).str.replace(')', '', ))
return data
Currently, .str.replace() defaults to regex=True. This is planned to change to regex=False in the future. You should make this explicit in your calls.
data['name1'] = data['name1'].str.replace('brand ', '', regex=False)
.str.replace(' ', '', regex=False).str.replace('(', '', regex=False ).str.replace(')', '', regex=False)
Although in your case, it would be better to use a regular expression, so you can do all the replacements in a single call:
data['name1'] = data['name1'].str.replace('brand|[ ()]', '', regex=True)

How can I insert a value into a numpy record array?

This question has been edited as to make more sense.
The original question is how to insert values into a numpy record array, and I have had som success but still have an issue. Based off of the website below I have been inserting values into a record array.
Python code
instance_format={
'names' : ('name','offset'),
'formats' : ('U100','U30')}
instance=np.zeros(20,dtype=instance_format)
#I am placing values in the array similar to this
instance[0]['name']="Wire 1"
instance[1]['name']="Wire 2"
instance[2]['name']="Wire 3"
instance[0]['offset']="0x103"
instance[1]['offset']="0x104"
instance[2]['offset']="0x105"
#Here is the insertion statement that works
instance1 = np.insert(instance1,1,"Module one")
print(instance1)
Output
[('One Wire 1', '0x103')
('Module One', 'Module One')
('One Wire 2', '0x104')
('One Wire 3', '0x105')
So the insert statement works, however it inserts it both in the name and the offset field. I want to insert it just in the name field. How do I this?
Thanks
Your instance
In [470]: instance
Out[470]:
array([('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''),
('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''),
('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''),
('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''),
('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')],
dtype=[('name', '<U100'), ('module', '<U100'), ('offset', '<U30')])
does not look like
['One Wire Instance 1', 'One Wire Instance 2', 'One Wire Instance 3']
Are you talking about one record of instance, which would display as
('One Wire Instance 1', 'One Wire Instance 2', 'One Wire Instance 3')
with each string being the name, module, and offset.
Or are these 3 strings e.g. instance['name'][:3], the 'name' field from 3 records?
Inserting a new record into the instance array is one thing, adding a new field to the array is quite another.
To use np.insert with a structured array, you need provide a 1 element array with the correct dtype.
With your new instance:
In [580]: newone = np.array(("module one",'',''),dtype=instance.dtype)
In [581]: newone
Out[581]:
array(('module one', '', ''),
dtype=[('name', '<U100'), ('module', '<U100'), ('offset', '<U30')])
In [582]: np.insert(instance,1,newone)
Out[582]:
array([('Wire 1', '', '0x103'), ('module one', '', ''),
('Wire 2', '', '0x104'), ('Wire 3', '', '0x105')],
dtype=[('name', '<U100'), ('module', '<U100'), ('offset', '<U30')])
np.insert is just a function that performs these steps:
In [588]: instance2 = np.zeros((4,),dtype=instance.dtype)
In [589]: instance2[:1]=instance[:1]
In [590]: instance2[2:]=instance[1:3]
In [591]: instance2
Out[591]:
array([('Wire 1', '', '0x103'), ('', '', ''), ('Wire 2', '', '0x104'),
('Wire 3', '', '0x105')],
dtype=[('name', '<U100'), ('module', '<U100'), ('offset', '<U30')])
In [592]: instance2[1]=newone
In [593]: instance2
Out[593]:
array([('Wire 1', '', '0x103'), ('module one', '', ''),
('Wire 2', '', '0x104'), ('Wire 3', '', '0x105')],
dtype=[('name', '<U100'), ('module', '<U100'), ('offset', '<U30')])
It creates a new array of the correct target size, copies elements from the original array, and puts the new array into the empty slot.
I can't understand what you mean by:
I want to insert the name "Reserved" in the second element which would make the array have the following contents
['One Wire Instance 1','Reserved' , 'One Wire Instance 2', 'One Wire Instance 3']
Do you want:
instance[1] = 'Reserved','', ''
?

create dictionary from csv file using a key within row in python

How would i create a dictionary using a csv file if the key is the last index (index[9]) in every row. for example:
,,,,,,,,,KEY_1
,,,,,,,,,KEY_1
,,,,,,,,,KEY_1
,,,,,,,,,KEY_2
,,,,,,,,,KEY_2
,,,,,,,,,KEY_2
,,,,,,,,,KEY_3
,,,,,,,,,KEY_3
,,,,,,,,,KEY_3
Is there a way to create a dictionary that would look like this:
dictt = {
'KEY_1':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
'KEY_2':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
'KEY_3':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
}
I only have 6mons of self taught python and I am working out the growing pains. Any help is greatly appreciated. thank you in advanced
In answer to your "is it possible" question, one must say "not quite", because no Python construct matches the syntax you show:
dictt = {
'KEY_1':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
'KEY_2':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
'KEY_3':[,,,,,,,,], [,,,,,,,,], [,,,,,,,,],
}
Entering this would be a syntax error, and no code can thus build the equivalent.
But if you actually mean, e.g,
dictt = {
'KEY_1':[['','',,,,,,,], [,,,,,,,,], [,,,,,,,,]],
'KEY_2':[[,,,,,,,,], [,,,,,,,,], [,,,,,,,,]],
'KEY_3':[[,,,,,,,,], [,,,,,,,,], [,,,,,,,,]],
}
(and so on replacing each ,, to have something inside, e.g an empty string -- not gonna spend a long time editing this to fix it!-), then sure, it is possible.
E.g:
import collections
import csv
dictt = collections.defaultdict(list)
with open('some.csv') as f:
r = csv.reader(f)
for row in r:
dictt[r[-1]].append(r[:-1])
When this is done dictt will be an instance of collections.defaultdict (a subclass of dict) but you can use it as a dict. Or if you absolutely insist on its being a dict and not a subclass thereof (though there is no conceivably good reason to thus insist), follow up with
dictt = dict(dictt)
and voila, it's converted:-)
Another way:
txt='''\
,,,,,,,,,KEY_1
,,,,,,,,,KEY_1
,,,,,,,,,KEY_1
,,,,,,,,,KEY_2
,,,,,,,,,KEY_2
,,,,,,,,,KEY_2
,,,,,,,,,KEY_3
,,,,,,,,,KEY_3
,,,,,,,,,KEY_3
'''
import csv
result={}
for line in csv.reader(txt.splitlines()):
result.setdefault(line[-1], []).append(line[:-1])
>>> result
{'KEY_1': [['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '']], 'KEY_3': [['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '']], 'KEY_2': [['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '']]}

Python: extracting patterns from CSV [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Below is a sample of the typical contents of a CSV file.
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S** ]", '']
['06:09:11PM', '', '', 'Event Description', '0', "89.0 near Someother Street; Suburb Ext 3; in Town Park; [Long 37\xb0 14' 34.9 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S ]", '']
['Report Line Header ', '', '', '', '', '', '']
['HeaderX', ': HeaderY', '', 'HeaderZ', '', 'HeaderAA', '']
['From Date', ': 2014/01/17 06:00:00 AM', '', 'To Date : 2014/01/17 06:15:36 PM', '', 'HeaderBB', '']
['HeaderA', 'HeaderB', 'Header0', 'Header1', 'Header2', 'Header3', '']
['', '', '', '', 'Header 4', 'Header5', '']
From each line containing the Date/Time and the location ( marked with ** -- ** ), I would like to just extract those relevant info, while ignoring the rest.
Even if I can just print results to screen, that is OK, ideally, create a new CSV containing only the time and lat/long.
If you really want to extract the data of this file formatted as in your example, then you could use the following since the data in every line has a list representation:
>>> import ast
>>> f = open('data.txt', 'r')
>>> lines = f.readlines()
>>> for line in lines:
... list_representation_of_line = ast.literal_eval(line)
... for element in list_representation_of_line:
... if element.startswith('**') and element.endswith('**'):
... print list_representation_of_line
... # or print single fields, e.g. timeIndex = 0 or another index
... # print list_representation_of_line[timeindex]
... break
...
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
>>>
otherwise you should reformat your data as csv
If that's really what your CSV file looks like, I wouldn't even bother. It's got different data on different rows, and a huge mess of nested ad-hoc strings, with separators within separators.
Even once you get to your lat and long figures, they look like a bizarre mix of decimal, hex and character data.
I think you'd be asking for trouble by giving the impression that you can deal with data in that format programmatically. If it's just a once off task, and that's the extent of the data, I'd do it by hand.
If not, I think the correct solution is to push back and try to get some cleaner data.

Categories

Resources