String to Csv file using Python - python

I have the following string
string = "OGC Number | LT No | Job /n 9625878 | EPP3234 | 1206545/n" and continues on
I am trying to write it to a .CSV file where it will look like this:
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
where each newline in the string is a new row
where each "|" in the sting is a new column
I am having trouble getting the formatting.
I think I need to use:
string.split('/n')
string.split('|')
Thanks.
Windows 7, Python 2.6

Untested:
text="""
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454"""
import csv
lines = text.splitlines()
with open('outputfile.csv', 'wb') as fout:
csvout = csv.writer(fout)
csvout.writerow(lines[0]) # header
for row in lines[2:]: # content
csvout.writerow([col.strip() for col in row.split('|')])

If you are interested in using a third party module. Prettytable is very useful and has a nice set of features to deal with and print tabular data.

EDIT: Oops, I missunderstood your question!
The code below will use two regular expressions to do the modifications.
import re
str="""OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
"""
# just setup above
# remove all lines with at least 4 dashes
str=re.sub( r'----+\n', '', str )
# replace all pipe symbols with their
# surrounding spaces by single semicolons
str=re.sub( r' +\| +', ';', str )
print str

Related

how to identify and print a pattern inside an ascii file in python 2?

I am trying to develop a program that can read patterns from a txt file using Python 2.x. This pattern is supposed to be a bug:
| |
###O
| |
And the pattern doesn't include the whitespaces.
So far I have come up with a way to open the txt file, read it and process the data inside of it but I can't think of a way to make Python understand this pattern as 1, instead of counting each character. I've tried regular expressions but it ended up showing an output similar to this:
| |
###O
| |
| |
###O
| |
| |
###O
| |
Instead of just saying how many of this pattern were detected inside the file, for example:
There were 3 occurrences.
Update: So far i got this
file = open('bug.txt', 'r')
data = file.read() #read content from file to a string
occurrences = data.count('| |\n\'###O\'\n| |\n')
print('Number of occurrences of the pattern:', occurrences)
But this is not working. The file itself has the patterns 3 times but with whitespaces in between, but the whitespace is not part of the pattern and when i try to paste the pattern from the file it breaks the lines, and if i correct the pattern to | | ###O | | it shows 0 occurrences because its not really the pattern.
It depends on how you store your ASCII data, but if you convert it to a string you can use the python .count() function.
For example:
# define string
ascii_string = "| | ###O | | | | ###O | | | | ###O | |"
pattern = "| | ###O | |"
count = ascii_string.count(pattern)
# print count in python 3+
print("Occurrences:", count)
# print count in python 2.7
print "Occurrences:", count
This will result in:
Occurrences: 3
>>> import re
>>> data = '''| |
... ###O
... | |
... | |
... ###O
... | |
... | |
... ###O
... | |'''
>>> result = re.findall('[ ]*\| \|\n[ ]*###O\n[ ]*\| \|', data)
>>> len(result)
3
>>>
Result being occurrences.
How to do it from a file:
import re
with open('some file.txt') as fd:
data = fd.read()
result = re.findall('[ ]*\| \|\n[ ]*###O\n[ ]*\| \|', data)
len(result)
Alternative way of doing it to accommodate for edit on OP:
>>> data = '''| |
... ###O
... | |
... | |
... ###O
... | |
... | |
... ###O
... | |
... | | ###O | |'''
>>> data.replace('\n', '').replace(' ', '').count('||###O||')
4
>>>
I solved the problem this way.
def somefuntion(file_name):
ascii_str = ''
with open(file_name, 'r') as reader:
for line in reader.readlines():
for character in line.replace('\n', '').replace(' ', ''):
ascii_str += str(ord(character))
return ascii_str
if __name__ == "__main__":
bug = somefuntion('bug.txt')
landscape = somefuntion('landscape.txt')
print(landscape.count(bug))

Only the last line of a multiline file / string is printed

I searched a bit on Stack Overflow and stumbled on different answers but nothing fitted for my situation...
I got a map.txt file like this:
+----------------------+
| |
| |
| |
| test |
| |
| |
| |
+------------------------------------------------+
| | |
| | |
| | |
| Science | Bibliothek |
| | |
| | |
| | |
+----------------------+-------------------------+
when I want to print it using this:
def display_map():
s = open("map.txt").read()
return s
print display_map()
it just prints me:
+----------------------+-------------------------+
When I try the same method with another text file like:
line 1
line 2
line 3
it works perfectly.
What I do wrong?
I guess this file uses the CR (Carriage Return) character (Ascii 13, or '\r') for newlines; on Windows and Linux this would just move the cursor back to column 1, but not move the cursor down to the beginning of a new line.
(Of course such line terminators would not survive copy-paste to Stack Overflow, which is why this cannot be replicated).
You can debug strange characters in a string with repr:
print(repr(read_map())
It will print out the string with all special characters escaped.
If you see \r in the repred string, you could try this instead:
def read_map():
with open('map.txt') as f: # with ensures the file is closed properly
return f.read().replace('\r', '\n') # replace \r with \n
Alternatively supply the U flag to open for universal newlines, which would convert '\r', '\r\n' and '\n' all to the \n upon reading despite the underlying operating system's conventions:
def read_map():
with open('map.txt', 'rU') as f:
return f.read()

Clean up string extracted from csv file

I am extracting certain data from a csv file using Ruby and I want to cleanup the extracted string by removing the unwanted characters.
This is how I extract the data so far:
CSV.foreach(data_file, :encoding => 'windows-1251:utf-8', :headers => true) do |row|
#create an array for each page
page_data = []
#For each page, get the data we are interested in and save it to the page_data
page_data.push(row['dID'])
page_data.push(row['xTerm'])
pages_to_import.push(page_data)
Then I output the csv file with the extracted data
The output extracted is exactly as it is on the csv data file:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | ##106#107#my##106#term## |
| 13345 | ##63#hello## |
| 11436 | ##55#rock##20#my##10015#18#world## |
However, My desired result that I want to achieve is:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | my, term |
| 13345 | hello |
| 11436 | rock, my, world |
Any suggestions on how to achieve this?
Libraries that Im using:
require 'nokogiri'
require 'cgi'
require 'csv'
Using a regular expression, I'd do:
%w[
##106#107#term1##106#term2##
##63#term1##
##55#term1##20#term2##10015#18#term3##
##106#107#my##106#term##
##63#hello##
##55#rock##20#my##10015#18#world##
].map{ |str|
str.scan(/[^##]+?)(?=#/)
}
# => [["term1", "term2"], ["term1"], ["term1", "term2", "term3"], ["my", "term"], ["hello"], ["rock", "my", "world"]]
My str is the equivalent of the contents of your row['xTerm'].
The regular expression /[^##]+?(?=#)/ searches for patterns in str that don't contain # or # and end with #.
From the garbage in the string, and your comment that you're using Nokogiri and CSV, and because you didn't show your input data as CSV or HTML, I have to wonder if you're not mangling the incoming data somehow, and trying to wiggle out of it in post-processing. If so, show us what you're actually doing and maybe we can help you get clean data to start.
I'm assuming your terms are bookended and separated by ## and consist of one or more numbers followed by the actual term separated by #. To get the terms into an array:
row['xTerm'].split('##')[1..-1].map { |term| term.split(?#)[-1] }
Then you can join or do whatever you want with it.

Python -- how to read and change specific fields from file? (specifically, numbers)

I just started learning python scripting yesterday and I've already gotten stuck. :(
So I have a data file with a lot of different information in various fields.
Formatted basically like...
Name (tab) Start# (tab) End# (tab) A bunch of fields I need but do not do anything with
Repeat
I need to write a script that takes the start and end numbers, and add/subtract a number accordingly depending on whether another field says + or -.
I know that I can replace words with something like this:
x = open("infile")
y = open("outfile","a")
while 1:
line = f.readline()
if not line: break
line = line.replace("blah","blahblahblah")
y.write(line + "\n")
y.close()
But I've looked at all sorts of different places and I can't figure out how to extract specific fields from each line, read one field, and change other fields. I read that you can read the lines into arrays, but can't seem to find out how to do it.
Any help would be great!
EDIT:
Example of a line from the data here: (Each | represents a tab character)
| |
V V
chr21 | 33025905 | 33031813 | ENST00000449339.1 | 0 | **-** | 33031813 | 33031813 | 0 | 3 | 1835,294,104, | 0,4341,5804,
chr21 | 33036618 | 33036795 | ENST00000458922.1 | 0 | **+** | 33036795 | 33036795 | 0 | 1 | 177, | 0,
The second and third columns (indicated by arrows) would be the ones that I'd need to read/change.
You can use csv to do the splitting, although for these sorts of problems, I usually just use str.split:
with open(infile) as fin,open('outfile','w') as fout:
for line in fin:
#use line.split('\t'3) if the name of the field can contain spaces
name,start,end,rest = line.split(None,3)
#do something to change start and end here.
#Note that `start` and `end` are strings, but they can easily be changed
#using `int` or `float` builtins.
fout.write('\t'.join((name,start,end,rest)))
csv is nice if you want to split lines like this:
this is a "single argument"
into:
['this','is','a','single argument']
but it doesn't seem like you need that here.

Extracting each line from a file and passing it as a variable to "foreach" loop

Could somebody help me figure out a simple way of doing this using any script ? I will be running the script on Linux
1 ) I have a file1 which has the following lines :
(Bank8GntR[3] | Bank8GntR[2] | Bank8GntR[1] | Bank8GntR[0] ),
(Bank7GntR[3] | Bank7GntR[2] | Bank7GntR[1] | Bank7GntR[0] ),
(Bank6GntR[3] | Bank6GntR[2] | Bank6GntR[1] | Bank6GntR[0] ),
(Bank5GntR[3] | Bank5GntR[2] | Bank5GntR[1] | Bank5GntR[0] ),
2 ) I need the contents of file1 to be modified as following and written to a file2
(Bank15GntR[3] | Bank15GntR[2] | Bank15GntR[1] | Bank15GntR[0] ),
(Bank14GntR[3] | Bank14GntR[2] | Bank14GntR[1] | Bank14GntR[0] ),
(Bank13GntR[3] | Bank13GntR[2] | Bank13GntR[1] | Bank13GntR[0] ),
(Bank12GntR[3] | Bank12GntR[2] | Bank12GntR[1] | Bank12GntR[0] ),
So I have to:
read each line from the file1,
use "search" using regular expression,
to match Bank[0-9]GntR,
replace \1 with "7 added to number matched",
insert it back into the line,
write the line into a new file.
How about something like this in Python:
# a function that adds 7 to a matched group.
# groups 1 and 2, we grabbed (Bank) to avoid catching the digits in brackets.
def plus7(matchobj):
return '%s%d' % (matchobj.group(1), int(matchobj.group(2)) + 7)
# iterate over the input file, have access to the output file.
with open('in.txt') as fhi, open('out.txt', 'w') as fho:
for line in fhi:
fho.write(re.sub('(Bank)(\d+)', plus7, line))
Assuming you don't have to use python, you can do this using awk:
cat test.txt | awk 'match($0, /Bank([0-9]+)GntR/, nums) { d=nums[1]+7; gsub(/Bank[0-9]+GntR\[/, "Bank" d "GntR["); print }'
This gives the desired output.
The point here is that match will match your data and allows capturing groups which you can use to extract out the number. As awk supports arithmetic, you can then add 7 within awk and then do a replacement on all the values in the rest of the line. Note, I've assumed all the values in the line have the same digit in them.

Categories

Resources