Automatically split off CSV that has unusual characters

Automatically split off CSV that has unusual characters - python

While I can manipulate a CSV file with Python without an issue if it's strictly comma delimited, I'm running into a massive problem with this format I'm working with. It's comma delimited but the last column conists of a mesh of about six commans withiin the following figure:
"{""EvidenceDetails"": [{""MitigationString"": """", ""Criticality"": 2, ""Timestamp"": ""2018-05-07T13:51:02.000Z"", ""CriticalityLabel"": ""Suspicious"", ""EvidenceString"": ""1 sighting on 1 source: item. Most recent item: Item4: item. I've never seen this IP before. Most recent link (May 7, 2018): link"", ""Rule"": ""Recent""}, {""MitigationString"": """", ""Criticality"": 2, ""Timestamp"": ""2018-05-09T05:32:41.316Z"", "etc"}]}"
The other columns are standard comma separation, but this one column is a mess. I need to only pull out the timestamps' YYYY-MM-DD; nothing else. I can't seem to figure out a way to strip out the unnecessary characters, however.
Any suggestions? I'm working with Python specifically, but if there's something else I should look to, let me know!
Thanks!

Rather than splitting/stripping, it may be easier to use a regular expression to extract the datestamps you want directly.
Here is an example with the line you provided in your question:
import re
pattern_to_use = "[0-9]{4}-[0-9]{2}-[0-9]{2}"
string_to_search = """""{""EvidenceDetails"": [{""MitigationString"": """", ""Criticality"": 2, ""Timestamp"": ""2018-05-07T13:51:02.000Z"", ""CriticalityLabel"": ""Suspicious"", ""EvidenceString"": ""1 sighting on 1 source: item. Most recent item: Item4: item. I've never seen this IP before. Most recent link (May 7, 2018): link"", ""Rule"": ""Recent""}, {""MitigationString"": """", ""Criticality"": 2, ""Timestamp"": ""2018-05-09T05:32:41.316Z"", "etc"}]}"""""
print(re.findall(pattern, string_to_search))
This will print an array containing the datestamps, in the order they appeared in the string (i.e, ['2018-05-07', '2018-05-09'])
See The Python 3 Docs for more information on regular expressions.

You're looking at JSON format, so try using the json module:
import json
# if data is in a file
with open('your filename here','r') as f:
data = json.load(f)
# if data is stored in a string variable
data = json.loads(stringvar)
The data variable should now contain your data in a more easily accessible format.

Related

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

I'm using Open Refine to do something that I KNOW Python can do. I'm using it to convert a csv into an XML metadata document. I can figure out most of it, but the one thing that trips me up, is this GREL line:
{{forEach(cells["subjectTopicsLocal"].value.split('; '), v, '<subject authority="local"><topic>'+v.escape("xml")+'</topic></subject>')}}
What this does, is beautiful for me. I've got a "subject" field in my Excel spreadsheet. My volunteers enter keywords, separated with a "; ". I don't know how many keywords they'll come up with, and sometimes there is only one. That GREL line creates a new <subject authority="local"><topic></topic></subject> for each term created, and of course slides it into the field.
I know there has to be a Python expression that can do this. Could someone recommend best practice for this? I'd appreciate it!

Basically you want to use 'split' in Python to convert the string from your subject field into a Python list, and then you can iterate over the list.
So assuming you've read the content of the 'subject' field from a line in your csv/excel document already and assigned it to a string variable 'subj' you could do something like:
subjList = subj.split(";")
for subject in subjList:
#do what you need to do to output 'subject' in an xml element here

This Python expression is the equivalent to your GREL expression:
['<subject authority="local"><topic>'+escape(v)+'</topic></subject>') for v in split(value,'; ')]
It will create an array of XML snippets containing your subjects. It assumes that you've created or imported an appropriate escape function, such as
from xml.sax.saxutils import escape

Formating csv file to utilize zabbix sender

I have a csv file that i need to format first before i can send the data to zabbix, but i enconter a problem in 1 of the csv that i have to use.
This is a example of the part of the file that i have problems:
Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01[ifcontainsCell01removehere]-[1],aass
so this is 2 lines from the file, the other lines i already treated.
i need to check if Cell01 is in the line
if 'Cell01' in h: do something.
i need to remove all the content beetwen the [ ] included this [] that contains the Cell01 word and leave only this:
Some_String_That_ineed_01,xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01-[1],aass
else my script already treats quite easy. There must be a better way then what i think which is use h.split in the first [ then split again on the , then remove the content that i wanna then add what is left sums strings. since i cant use replace because i need this data([1]).
later on with the desired result i will add this to zabbix sender as the item and item key_. I already have the hosts,timestamps,values.

You should use a regexp (re module):
import re
s = "Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02"
replaced = re.sub('\[.*Cell01.*\]', '', s)
print (replaced)
will return:
[root#somwhere test]# python replace.py
Some_String_That_ineed_01,xxxxxxxxx02
You can also experiment here.

How do I preserve new lines when extracting text from html using lxml.text_content()

I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above.
This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance

You could use this solution:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('I Want This <b>text!</b>')
>>> 'I Want This text!'
Found here: using python, Remove HTML tags/formatting from a string

Python: If dict keys in line

Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..

change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.

This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)

Why don't I find the words in their original source list?

I am trying to find chinesse words in two differnet files, but It didn't work so I tried to search for the words in the same file I get them from, but it seems it doesn't find it neither? how is it possible?
chin_split = codecs.open("CHIN_split.txt","r+",encoding="utf-8")
used this for the regex code.
import re
for n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read()):
print n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read())
how comes I get only falses printed???
FYI I tried to do this and it works:
for x in [1,2,3,4,5,6,6]:
print x in [1,2,3,4,5,6,6]
BTW
chin_split contains words in English Hebrew and Chinese
some lines from chin_split.txt:
he daodan 核导弹 טיל גרעיני
hedantou 核弹头 ראש חץ גרעיני
helu 阖庐 "ביתו, מעונו
helu 阖庐 שם מלך וו בתקופת ה'אביב והסתיו'"
huiwu 会晤 להיפגש עם

You are reading a file descriptor many times and that is wrong.
The first chin_split.read() will yield all the content but the others (inside the loop) will just get an empty string.
That loop makes no sense, but if you want to keep it, save the file content in a variable first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Automatically split off CSV that has unusual characters - python

Related

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

Formating csv file to utilize zabbix sender

How do I preserve new lines when extracting text from html using lxml.text_content()

Python: If dict keys in line

Why don't I find the words in their original source list?

Categories

Resources