This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'
Related
I've got a text file that has several of these blocks of text in it:
Module Resistor_SMD:R_0402_1005Metric (layer B.Cu) (tedit 5B301BBD) (tstamp 5CC0A687)
(at 120.316179 97.92138 90)
(descr "Resistor SMD 0402 (1005 Metric), square (rectangular) end terminal, IPC_7351 nominal, (Body size source: http://www.tortai-tech.com/upload/download/2011102023233369053.pdf), generated with kicad-footprint-generator")
(tags resistor)
(path /610532D4)
(attr smd)
(fp_text reference R59 (at 0 1.17 90) (layer B.SilkS)
I want to pull out the following:
120.316179, 97.92138 90 and R59
and store it somewhere...
Then, I want to take that collection of line items, and throw some away depending on the value(s) of the first two numbers....They're XY coordinates.
Then, write it to a list.
How can I do that with regular expressions?
I'm loading the file and trying to follow along here, but I'm getting lost in the addition of the pandas library.
IMO you don't need re for this task. You can iterate through the lines of your file and, depending on signal strings like '(at ' and 'fp_text reference', you can fill a list of lists of all your resistor data, e.g.:
with open('textfile.txt') as f:
data = []
row = []
for line in f:
if row:
if '(fp_text ref' in line.strip():
row.append(line.strip().split()[2])
data.append(row)
row = []
else:
if '(at ' in line.strip():
row = line.strip()[:-1].split()[1:4]
print(data)
# [['120.316179', '97.92138', '90', 'R59']]
And if you want a pandas dataframe from this data:
import pandas as pd
df = pd.DataFrame(data, columns=['x', 'y', 'z', 'R'])
print(df)
# x y z R
# 0 120.316179 97.92138 90 R59
This RegEx might help you to capture your three desired strings:
([\d]+\.[\d]{5,}|R[0-9]+)
There are two simple pattern connected using an | (OR):
the one on the left ([\d]+\.[\d]{5,}) checks for your desired float numbers with a 5+ boundary for the float part, and
the one on the right (R[0-9]+) has a left-side R boundary.
You can simply change these boundaries, however you wish, and call the captured output using $1 and do the coding.
You can escape language specific metachars such as . using a \, if necessary.
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!
You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!
String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)
This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.
I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.
Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)
In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.
Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.
entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever
entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops