Clean up string extracted from csv file

Clean up string extracted from csv file - python

I am extracting certain data from a csv file using Ruby and I want to cleanup the extracted string by removing the unwanted characters.
This is how I extract the data so far:
CSV.foreach(data_file, :encoding => 'windows-1251:utf-8', :headers => true) do |row|
#create an array for each page
page_data = []
#For each page, get the data we are interested in and save it to the page_data
page_data.push(row['dID'])
page_data.push(row['xTerm'])
pages_to_import.push(page_data)
Then I output the csv file with the extracted data
The output extracted is exactly as it is on the csv data file:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | ##106#107#my##106#term## |
| 13345 | ##63#hello## |
| 11436 | ##55#rock##20#my##10015#18#world## |
However, My desired result that I want to achieve is:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | my, term |
| 13345 | hello |
| 11436 | rock, my, world |
Any suggestions on how to achieve this?
Libraries that Im using:
require 'nokogiri'
require 'cgi'
require 'csv'

Using a regular expression, I'd do:
%w[
##106#107#term1##106#term2##
##63#term1##
##55#term1##20#term2##10015#18#term3##
##106#107#my##106#term##
##63#hello##
##55#rock##20#my##10015#18#world##
].map{ |str|
str.scan(/[^##]+?)(?=#/)
}
# => [["term1", "term2"], ["term1"], ["term1", "term2", "term3"], ["my", "term"], ["hello"], ["rock", "my", "world"]]
My str is the equivalent of the contents of your row['xTerm'].
The regular expression /[^##]+?(?=#)/ searches for patterns in str that don't contain # or # and end with #.
From the garbage in the string, and your comment that you're using Nokogiri and CSV, and because you didn't show your input data as CSV or HTML, I have to wonder if you're not mangling the incoming data somehow, and trying to wiggle out of it in post-processing. If so, show us what you're actually doing and maybe we can help you get clean data to start.

I'm assuming your terms are bookended and separated by ## and consist of one or more numbers followed by the actual term separated by #. To get the terms into an array:
row['xTerm'].split('##')[1..-1].map { |term| term.split(?#)[-1] }
Then you can join or do whatever you want with it.

Related

Converting Json to CSV using command line(JQ,json2csv, Python, other)

I am wanting to write a script to
fetch information then returning Json file
filter Json file
then converting that Json to CSV.
I have figured out steps 1 and 2, but am stuck on steps 3. Currently I have to use an online Json to CSV converter to get the desired output.
The Online Json to CSV tool uses python for users to connect to it's API to use the conversation tool. Possibly means that the tool itself is a python module.
Json file to convert
[{
"matchId":"2068447050405",
"timestamp":1658361314,
"clubs": {
"39335": {
"toa":"486",
"details": {
"name":"Team one",
"clubId":39335
}},
"111655": {
"toa":"229",
"details": {
"name":"Team two",
"clubId":111655
}}},
"players": {
"39335": {
"189908959": {
"position":"defenseMen",
"toiseconds":"3600",
"playername":"player one"
},
"828715674": {
"position":"rightWing",
"toiseconds":"3600",
"playername":"player two"
}},
"111655": {
"515447555": {
"position":"defenseMen",
"toiseconds":"3600",
"playername":"player three"
},
"806370074": {
"position":"center",
"toiseconds":"3600",
"playername":"player four"
}}}}]
Desired output csv code
"matchId","timestamp","clubs__|","clubs__|__toa","clubs__|__details__name","clubs__|__details__clubId","players__|","players__||","players__||__position","players__||__toiseconds","players__||__playername"
"2068447050405","1658361314","39335","486","Team one","39335","39335","189908959","defenseMen","3600","player one"
"2068447050405","1658361314","111655","229","Team two","111655","39335","828715674","rightWing","3600","player two"
"2068447050405","1658361314","","","","","111655","515447555","defenseMen","3600","player three"
"2068447050405","1658361314","","","","","111655","806370074","center","3600","player four"
How it looks in a spreadsheet
Sheet example
Some believe the filter is having an effect on how the csv out put is formatted, here is a link to the full json file and csv output of that file. Code is to long to post on this page.
Original JSON before filter
Original JSON
CSV output of original JSON file
CSV output
Edit
I should have mentioned this, The "Jason file to convert is only a small sample of the actual Json I wish to convert. I assumed I would be able to simple add to the code used to answer, I was wrong.
The Json I intend to use has 9 total columns for clubs and 52 columns for Players.

I'm working hard to really grok jq, so here you go: with no explanation:
jq -r '
.[]
| [.matchId, .timestamp] as [$matchId, $timestamp]
| (.players | [to_entries[] | .key as $id1 | .value | to_entries[] | [$id1, .key, .value.position, .value.toiseconds, .value.playername]]) as $players
| (.clubs | [to_entries[] | [.key, .value.toa, .value.details.name, .value.details.clubId]]) as $clubs
| range([$players, $clubs] | map(length) | max)
| [$matchId, $timestamp] + ($clubs[.] // ["","","",""]) + ($players[.] // ["","","","",""])
| #csv
' file.json
"2068447050405",1658361314,"39335","486","Team one",39335,"39335","189908959","defenseMen","3600","player one"
"2068447050405",1658361314,"111655","229","Team two",111655,"39335","828715674","rightWing","3600","player two"
"2068447050405",1658361314,"","","","","111655","515447555","defenseMen","3600","player three"
"2068447050405",1658361314,"","","","","111655","806370074","center","3600","player four"
The default value arrays of empty strings needs to be the same size as the amount of "real" data you're grabbing.
Since this is a PITA to keep aligned, an update:
jq -r '
def empty_strings: reduce range(length) as $i ([]; . + [""]);
.[]
| [.matchId, .timestamp] as [$matchId, $timestamp]
| (.players | [to_entries[] | .key as $id1 | .value | to_entries[] | [$id1, .key, .value.position, .value.toiseconds, .value.playername]]) as $players
| (.clubs | [to_entries[] | [.key, .value.toa, .value.details.name, .value.details.clubId]]) as $clubs
| range([$players, $clubs] | map(length) | max)
| [$matchId, $timestamp]
+ ($clubs[.] // ($clubs[0] | empty_strings))
+ ($players[.] // ($players[0] | empty_strings))
| #csv
' file.json

Compare three files using bash/shell/python scripting and print detailed difference report (match or no match)

I have three files that gets loaded every month on a UNIX AIX system which needs to be manually cross referenced and validated for correctness. I would like to automate this process by using bash/shell/python scripting.
File Names
First File: /check> cat check.txt
TMCL031 - 25879455
TMCL032 - 25838936
TMPP039 - 4522783
TMPP077 - 39944
TPOLAGENT - 307048
TPOLBASEX - 1185340
TPOLBASIS - 1185340
TPOLBEWGX - 0
Second File: /check/load> cat b12cke.B_Jun290640
+--------------------------------------------+
| | | |
+--------------------------------------------+
1_| MCL031 | REC LENGTH = 73 | 25879455 |
+--------------------------------------------+
+--------------------------------------------+
| | | |
+--------------------------------------------+
1_| MCL032 | REC LENGTH = 464 | 25838936 |
+--------------------------------------------+
+--------------------------------------------+
| | | |
+--------------------------------------------+
1_| MPP039 | REC LENGTH = 44 | 4522783 |
+--------------------------------------------+
+--------------------------------------------+
| | | |
+--------------------------------------------+
1_| MPP077 | REC LENGTH = 95 | 39944 |
+--------------------------------------------+
Third File: /check/load> cat b60cke.B_Jun290600
POLBASEX RECORDS EXTRACTED = 001185340
POLAGENT RECORDS EXTRACTED = 000307048
POLEISMM RECORDS EXTRACTED = 000085125
POLSMMEX RECORDS EXTRACTED = 000085125
N.B: The file names for the second and third files gets named based on the first three letters of the month the files get loaded on the system:
e.g. For June the files will be named and loaded on the system as “b12cke.B_Jun290640” “b60cke.B_Jun290600”
For July the files will be named and loaded on the system as “b12cke.B_Jul290640” “b60cke.B_Jul290600”
For August the files will be named and loaded on the system as “b12cke.B_Aug290640” “b60cke.B_Aug290600” and so on.
List item
The logic should be as follows:
Check and cross reference the contents of the first file “check.txt“ by matching or validating if the both word (TMCL031) and number (25879455) strings in the first file exists in the second “b12cke.B_Jun290640” or third file “b60cke.B_Jun290600”.
e.g. The string "TMCL031 - 25879455" in the first file should match with the string "MCL031 | REC
LENGTH = 73 | 25879455" (MCL031 – 25879455) in the second file.
Notice the letter "T" in the string "TMCL031 - 25879455" in the first file does not exist in the string "MCL031 | REC LENGTH = 73 | 25879455" in the second file.
This is the same case for the third file where e.g. the letter "T" in the string "TPOLAGENT - 307048" in the first file does not exist in the string "POLAGENT RECORDS EXTRACTED = 000307048".
List item
Generally the rule of thumb is that all "TMCL*" and "TMPP*" strings in the first file normally exists in the second file whereas all "TPOL*" strings in the first file normally exists in the third file.
List item
There are usually additional integers (000) in front of the numerical digits (307048) in the contents of the third file e.g. the string POLAGENT RECORDS EXTRACTED = "000307048". These integers should be ignored/disregarded.
List item
In the first file there are strings that have just the integer (0) which indicates that the string will not exist in either of the other two files. e.g. the string "TPOLBEWGX - 0" in the first file will not exist in the second or third file. This should be ignored/disregarded.
The end goal should then be that the results/output be created in table form which states whether there is a match between the different strings. e.g. attached output table image:
enter image description here
This table should be sent via email with the message “The loads have completed successfully” if there is a positive match for all the strings and “The loads have completed unsuccessfully” if there is not a positive match with some or all of the strings.
Don’t know if it is possible with bash/shell/python scripting.
The point at which I am currently stuck at is at trying to cross-reference and validate the three files for correctness.
Thanks in advance

String to Csv file using Python

I have the following string
string = "OGC Number | LT No | Job /n 9625878 | EPP3234 | 1206545/n" and continues on
I am trying to write it to a .CSV file where it will look like this:
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
where each newline in the string is a new row
where each "|" in the sting is a new column
I am having trouble getting the formatting.
I think I need to use:
string.split('/n')
string.split('|')
Thanks.
Windows 7, Python 2.6

Untested:
text="""
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454"""
import csv
lines = text.splitlines()
with open('outputfile.csv', 'wb') as fout:
csvout = csv.writer(fout)
csvout.writerow(lines[0]) # header
for row in lines[2:]: # content
csvout.writerow([col.strip() for col in row.split('|')])

If you are interested in using a third party module. Prettytable is very useful and has a nice set of features to deal with and print tabular data.

EDIT: Oops, I missunderstood your question!
The code below will use two regular expressions to do the modifications.
import re
str="""OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
"""
# just setup above
# remove all lines with at least 4 dashes
str=re.sub( r'----+\n', '', str )
# replace all pipe symbols with their
# surrounding spaces by single semicolons
str=re.sub( r' +\| +', ';', str )
print str

Python -- how to read and change specific fields from file? (specifically, numbers)

I just started learning python scripting yesterday and I've already gotten stuck. :(
So I have a data file with a lot of different information in various fields.
Formatted basically like...
Name (tab) Start# (tab) End# (tab) A bunch of fields I need but do not do anything with
Repeat
I need to write a script that takes the start and end numbers, and add/subtract a number accordingly depending on whether another field says + or -.
I know that I can replace words with something like this:
x = open("infile")
y = open("outfile","a")
while 1:
line = f.readline()
if not line: break
line = line.replace("blah","blahblahblah")
y.write(line + "\n")
y.close()
But I've looked at all sorts of different places and I can't figure out how to extract specific fields from each line, read one field, and change other fields. I read that you can read the lines into arrays, but can't seem to find out how to do it.
Any help would be great!
EDIT:
Example of a line from the data here: (Each | represents a tab character)
| |
V V
chr21 | 33025905 | 33031813 | ENST00000449339.1 | 0 | **-** | 33031813 | 33031813 | 0 | 3 | 1835,294,104, | 0,4341,5804,
chr21 | 33036618 | 33036795 | ENST00000458922.1 | 0 | **+** | 33036795 | 33036795 | 0 | 1 | 177, | 0,
The second and third columns (indicated by arrows) would be the ones that I'd need to read/change.

You can use csv to do the splitting, although for these sorts of problems, I usually just use str.split:
with open(infile) as fin,open('outfile','w') as fout:
for line in fin:
#use line.split('\t'3) if the name of the field can contain spaces
name,start,end,rest = line.split(None,3)
#do something to change start and end here.
#Note that `start` and `end` are strings, but they can easily be changed
#using `int` or `float` builtins.
fout.write('\t'.join((name,start,end,rest)))
csv is nice if you want to split lines like this:
this is a "single argument"
into:
['this','is','a','single argument']
but it doesn't seem like you need that here.

Extracting each line from a file and passing it as a variable to "foreach" loop

Could somebody help me figure out a simple way of doing this using any script ? I will be running the script on Linux
1 ) I have a file1 which has the following lines :
(Bank8GntR[3] | Bank8GntR[2] | Bank8GntR[1] | Bank8GntR[0] ),
(Bank7GntR[3] | Bank7GntR[2] | Bank7GntR[1] | Bank7GntR[0] ),
(Bank6GntR[3] | Bank6GntR[2] | Bank6GntR[1] | Bank6GntR[0] ),
(Bank5GntR[3] | Bank5GntR[2] | Bank5GntR[1] | Bank5GntR[0] ),
2 ) I need the contents of file1 to be modified as following and written to a file2
(Bank15GntR[3] | Bank15GntR[2] | Bank15GntR[1] | Bank15GntR[0] ),
(Bank14GntR[3] | Bank14GntR[2] | Bank14GntR[1] | Bank14GntR[0] ),
(Bank13GntR[3] | Bank13GntR[2] | Bank13GntR[1] | Bank13GntR[0] ),
(Bank12GntR[3] | Bank12GntR[2] | Bank12GntR[1] | Bank12GntR[0] ),
So I have to:
read each line from the file1,
use "search" using regular expression,
to match Bank[0-9]GntR,
replace \1 with "7 added to number matched",
insert it back into the line,
write the line into a new file.

How about something like this in Python:
# a function that adds 7 to a matched group.
# groups 1 and 2, we grabbed (Bank) to avoid catching the digits in brackets.
def plus7(matchobj):
return '%s%d' % (matchobj.group(1), int(matchobj.group(2)) + 7)
# iterate over the input file, have access to the output file.
with open('in.txt') as fhi, open('out.txt', 'w') as fho:
for line in fhi:
fho.write(re.sub('(Bank)(\d+)', plus7, line))

Assuming you don't have to use python, you can do this using awk:
cat test.txt | awk 'match($0, /Bank([0-9]+)GntR/, nums) { d=nums[1]+7; gsub(/Bank[0-9]+GntR\[/, "Bank" d "GntR["); print }'
This gives the desired output.
The point here is that match will match your data and allows capturing groups which you can use to extract out the number. As awk supports arithmetic, you can then add 7 within awk and then do a replacement on all the values in the rest of the line. Note, I've assumed all the values in the line have the same digit in them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Clean up string extracted from csv file - python

I'm assuming your terms are bookended and separated by ## and consist of one or more numbers followed by the actual term separated by #. To get the terms into an array: row['xTerm'].split('##')[1..-1].map { |term| term.split(?#)[-1] } Then you can join or do whatever you want with it.

Related

Converting Json to CSV using command line(JQ,json2csv, Python, other)

Compare three files using bash/shell/python scripting and print detailed difference report (match or no match)

String to Csv file using Python

Python -- how to read and change specific fields from file? (specifically, numbers)

Extracting each line from a file and passing it as a variable to "foreach" loop

Categories

Resources