Parsing and reformatting CSV/text data using Python

Parsing and reformatting CSV/text data using Python - python

sorry if this a bit of a beginner's question, but I haven't had much experience with python, and could really use some help in figuring this out. If there is a better programming language for tackling this, I'd be more than open to hearing it
I'm working on a small project, and I have two blocks of data, formatted differently from each other. They're all spreadsheets saved as CSV files, and I'd really like to make one group match the other without having to manually edit all the data.
What I need to do is go through a CSV, and format any data saved like this:
10W
20E
15-16N
17-18S
To a format like this (respective line to respective format):
10,W
20,E
,,15,16,N
,,17,18,S
So that they can just be copied over when opened as spreadsheets
I'm able to get the files into a string in python, but I'm unsure of how to properly write something to search for a number-hyphen-number-letter format.
I'd be immensely grateful for any help I can get. Thanks

This sounds like a good use-case for regular expressions. Once you've split the lines up into individual strings and stripped the whitespace (using s.strip()) these should work (I'm assuming those are cardinal directions; you'll need to change [NESW] to something else if that assumption is incorrect.):
>>> import re
>>> re.findall('\A(\d+)([NESW])', '16N')
[('16', 'N')]
>>> re.findall('\A(\d+)([NESW])', '15-16N')
[]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '15-16N')
[('15', '16', 'N')]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '16N')
[]
The first regex '\A(\d+)([NESW])' matches only a string that begins with a sequence of digits followed by a capital letter N, E, S, or W. The second matches only a string that begins with a sequence of digits followed by a hyphen, followed by another sequence of digits, followed by a capital letter N, E, S, or W. Forcing it to match at the beginning ensures that these regexes don't match a suffix of a longer string.
Then you can do something like this:
>>> vals = re.findall('\A(\d+)([NESW])', '16N')[0]
>>> ','.join(vals)
'16,N'
>>> vals = re.findall('(\d+)-(\d+)([NESW])', '15-16N')[0]
>>> ',,' + ','.join(vals)
',,15,16,N'

This is a whole solution that uses regexs. #senderle has beat me to the answer, so feel free to tick his response. This is just added here as I know how difficult it was to wrap my head around re in my code at first.
import re
dash = re.compile('(\d{2})-(\d{2})([WENS])')
no_dash = re.compile( '(\d{2})([WENS])' )
raw = '''10W
20E
15-16N
17-18S'''
lines = raw.split('\n')
data = []
for l in lines:
if '-' in l:
match = re.search(dash, l).groups()
data.append( ',,%s,%s,%s' % (match[0], match[1], match[2] ) )
else:
match = re.search(no_dash, l).groups()
data.append( '%s,%s' % (match[0], match[1] ) )
print '\n'.join(data)

In your case, I think the quick solution would involve regexps
You can either use the match method to extract your different tokens when they match a given regular expression, or the split method to split your string into tokens given a separator.
However, in your case, the separator would be a single character, so you can use the split method from the str class.

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?

You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo

Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")

Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".

I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I don´t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.

You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'

Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.

You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.

I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.

One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing and reformatting CSV/text data using Python - python

Related

In python, find tokens in line

Replace Items with regex (kind of)

Regular expression help to find space after a long string

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Search for string in file while ignoring id and replacing only a substring

Categories

Resources