Python: extracting patterns from CSV [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Below is a sample of the typical contents of a CSV file.
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S** ]", '']
['06:09:11PM', '', '', 'Event Description', '0', "89.0 near Someother Street; Suburb Ext 3; in Town Park; [Long 37\xb0 14' 34.9 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S ]", '']
['Report Line Header ', '', '', '', '', '', '']
['HeaderX', ': HeaderY', '', 'HeaderZ', '', 'HeaderAA', '']
['From Date', ': 2014/01/17 06:00:00 AM', '', 'To Date : 2014/01/17 06:15:36 PM', '', 'HeaderBB', '']
['HeaderA', 'HeaderB', 'Header0', 'Header1', 'Header2', 'Header3', '']
['', '', '', '', 'Header 4', 'Header5', '']
From each line containing the Date/Time and the location ( marked with ** -- ** ), I would like to just extract those relevant info, while ignoring the rest.
Even if I can just print results to screen, that is OK, ideally, create a new CSV containing only the time and lat/long.

If you really want to extract the data of this file formatted as in your example, then you could use the following since the data in every line has a list representation:
>>> import ast
>>> f = open('data.txt', 'r')
>>> lines = f.readlines()
>>> for line in lines:
... list_representation_of_line = ast.literal_eval(line)
... for element in list_representation_of_line:
... if element.startswith('**') and element.endswith('**'):
... print list_representation_of_line
... # or print single fields, e.g. timeIndex = 0 or another index
... # print list_representation_of_line[timeindex]
... break
...
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
>>>
otherwise you should reformat your data as csv

If that's really what your CSV file looks like, I wouldn't even bother. It's got different data on different rows, and a huge mess of nested ad-hoc strings, with separators within separators.
Even once you get to your lat and long figures, they look like a bizarre mix of decimal, hex and character data.
I think you'd be asking for trouble by giving the impression that you can deal with data in that format programmatically. If it's just a once off task, and that's the extent of the data, I'd do it by hand.
If not, I think the correct solution is to push back and try to get some cleaner data.

Related

Loop Though List of Lists Python and Merge Entries

I'm in the middle of cleaning up some data in Python. I get a load of lists with 6 entries which I eventually want to put into a dataframe, however before doing that I'd like to loop through and check if entry 1 in the list is the only non-empty string in the list. For instance I have a list called transaction_list:
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG', '', '', '5,000.00', '11,377.00'],
['', 'MULTI RESOURCES LIMITED,', '', '', '', ''],
['', 'CMS19274001077, 1094175DAAA107', '', '', '', ''],
['01Oct', 'INTER ACC CREDIT, SY', '', '', '1,000.00', '12,732.07'],
['', 'WTA CO3 (MAUR) LIMITED,', '', '', '', ''],
['', 'CMS19274009397, 729981UAAA298', '', '', '', ''],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,', '639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
I basically need to loop through and store a list in memory, however if the list only has the first entry populated and the other entries are empty strings, I want to merge that entry via a line break with the previous first entry of the loop and then delete that list. So the final should look something like this:
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG \n MULTI RESOURCES LIMITED \n CMS19274001077, 1094175DAAA107',
'', '', '5,000.00', '11,377.00'],
['01Oct', 'INTER ACC CREDIT, SY \n WTA CO3 (MAUR) LIMITED \n CMS19274009397, 729981UAAA298',
'', '', '1,000.00', '12,732.07'],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,',
'639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
I've been cracking my head all night on this with no luck.
Once you find the right groupby level, the rest can be accomplished with a custom agg function.
The groups can be determined with the cumsum of non null col0
import pandas as pd
import numpy as np
transaction_list = [
['01Oct', 'CMS ACC TRF /MISC CREDIT, AG', '', '', '5,000.00', '11,377.00'],
['', 'MULTI RESOURCES LIMITED,', '', '', '', ''],
['', 'CMS19274001077, 1094175DAAA107', '', '', '', ''],
['01Oct', 'INTER ACC CREDIT, SY', '', '', '1,000.00', '12,732.07'],
['', 'WTA CO3 (MAUR) LIMITED,', '', '', '', ''],
['', 'CMS19274009397, 729981UAAA298', '', '', '', ''],
['01Oct', 'HOUSE CHEQUE PRESENTED, , 639584,', '639584', '400.00', '', '12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']
]
df = pd.DataFrame(transaction_list)
df = df.replace('',np.nan)
df.groupby((~df[0].isnull()).cumsum()).agg({0:'first',
1: lambda x: '\n'.join(x),
2:'first',
3:'first',
4:'first',
5:'first'}).fillna('').values.tolist()
Output
[['01Oct',
'CMS ACC TRF /MISC CREDIT, AG\nMULTI RESOURCES LIMITED,\nCMS19274001077, 1094175DAAA107',
'',
'',
'5,000.00',
'11,377.00'],
['01Oct',
'INTER ACC CREDIT, SY\nWTA CO3 (MAUR) LIMITED,\nCMS19274009397, 729981UAAA298',
'',
'',
'1,000.00',
'12,732.07'],
['01Oct',
'HOUSE CHEQUE PRESENTED, , 639584,',
'639584',
'400.00',
'',
'12,332.07'],
['01Oct', 'CHEQUE PROCESSING FEE, , ,', '', '0.50', '', '12,331.57']]
Try the following code snippet. It checks the required condition, concatenates the string, deletes that inner list. It only increments the index if the list was not deleted, else keeps it the same so looping isn't disrupted.
while (i < len(transaction_list)):
if transaction_list[i][0] == '':
transaction_list[i-1][1] = transaction_list[i-1][1] + ' \n ' + transaction_list[i][1]
del transaction_list[i]
else:
i += 1
print(transaction_list)
Hope it Helps!

Python FutureWarning. new syntax?

Python swears at my syntax, says soon it will be impossible to write like that. Can you please tell me how to change the function?
def cleaning_name1(data):
"""Cleaning name1 minus brand + space+articul + art + round brackets """
data['name1'] = data['name1'].str.split('артикул').str[0]
data['name1'] = data['name1'].str.split('арт').str[0]
data['name1'] = (data['name1'].str.replace('brand ', '', )
.str.replace(' ', '', ).str.replace('(', '', ).str.replace(')', '', ))
return data
Currently, .str.replace() defaults to regex=True. This is planned to change to regex=False in the future. You should make this explicit in your calls.
data['name1'] = data['name1'].str.replace('brand ', '', regex=False)
.str.replace(' ', '', regex=False).str.replace('(', '', regex=False ).str.replace(')', '', regex=False)
Although in your case, it would be better to use a regular expression, so you can do all the replacements in a single call:
data['name1'] = data['name1'].str.replace('brand|[ ()]', '', regex=True)

Using regex on Python to find any numerical value in an expression

I am trying to get all numerical value (integers,decimal,float,scientific notation) from an expression and want to differentiate them from digits that are not realy number but part of a name. For example in the expression below.
230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1
the first 230 is not a numerical value as it is part of a tag (230FIC100.PV).
Using the web tool regexp.com I come up with the following expression that works for the expression above.
(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$
However when I try to use the above expression in python re.findall() I receive as result a list with 5 tuples with 6 elements on each.
import re
pat = r'(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$'
exp = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1 '
matches = re.findall(pat,exp)
The result is
special variables
function variables
0:('2', '', '', '2', 'e3', ' ')
1:('20', '', '', '20', '', ' ')
2:('20.4', '20.4', '', '', '', ' ')
3:('45', '', '', '45', '', ' ')
4:('2', '', '', '2', 'e4', ' ')
len():5
I would like some help to undestand what is happening and if there is any way to get this done in a similar way that happen on the regexp.com.
This should take care of it. (All the items are strings)
import re
st = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1'
re.findall(r'-?[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)|-?\d+\.\d+|\b\d+\b', st)
referred: How to extract numbers from strings,
Extracting scientific numbers from string,
and Extracting decimal values from string

Preserve whitespaces when using split() and join() in python

I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split() on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join() the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don't know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?
You want to use re.split() in that case, with a group:
re.split(r'(\s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(\s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.
Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after 'join' function.
For lines that have whitespace at the beginning and/or end, a more robust pattern is (\S+) to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orange\n'
line2 = '12 82.1 apple\n'
re_S = re.compile(r'(\S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', '\n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', '\n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orange\n'
print(repr(''.join(items2))) # '12 82.1 apple\n'
To contrast the example with a similar pattern (\s+) (lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(\s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', '\n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', '\n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.

Email harvest with python

I developed and application for harvest any type of emails from files
types : ishani#dolly.lk
ishani(at)dit.dolly.lk
ishani at cs dot dolly dot edu
But the problem is output shows some extra items in a list other than the extracted full email. I coudnt figure out why is that. I tried in various ways.I think there is a problem in my regular expression or the logic
here is my code
data=f.read()
regexp_email = r'(([\w]+)#([\w]+)([.])([\w]+[\w.]+))|(([\w]+)(\(at\))([\w]+)([.])([\w]+[\w.]+))|(([\w]+)(\sat\s)([\w-]+)(\sdot\s)([\w]+(\sdot\s[\w]+)))'
pattern = re.compile(regexp_email)
emailAddresses = re.findall(pattern, data)
print emailAddresses
the output is like this
[('ishani#sliit.lk', 'ishani', 'sliit', '.', 'lk', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', 'ishani(at)dit.sliit.lk', 'ishani', '(at)', 'dit', '.', 'sliit.lk', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', 'ishani at cs dot dolly dot edu', 'ishani', ' at ', 'cs', ' dot ', 'dolly dot edu', ' dot edu')]
but Im expecting a output like this
['ishani#dolly.lk','ishani(at)dit.dolly.lk','ishani at cs dot dolly dot edu']
Is there any method that anyone tried which support my problem?
Change your regexp_email to this:
r'[\w]+#[\w]+[.][\w]+[\w.]+|[\w]+\(at\)[\w]+[.][\w]+[\w.]+|[\w]+\sat\s[\w-]+\sdot\s[\w]+\sdot\s[\w]+'
It doesn't seem that you need the capturing groups, so I have removed all of them.
You also don't need the [] around \w if \w is all you need to specify:
r'\w+#\w+[.]\w+[\w.]+|\w+\(at\)\w+[.]\w+[\w.]+|\w+\sat\s[\w-]+\sdot\s\w+\sdot\s\w+'
You could just skip the blanks
print [e for ea in emailAddresses for e in ea if e]
which produces
['ishani#sliit.lk', 'ishani', 'sliit', '.', 'lk', 'ishani(at)dit.sliit.lk', 'ishani', '(at)', 'dit', '.', 'sliit.lk', 'ishani at cs dot dolly dot edu', 'ishani', ' at ', 'cs', ' dot ', 'dolly dot edu', ' dot edu']
which isn't exactly what you asked for...

Categories

Resources