Python - Parsing Conundrum - python

I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.

If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.

If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.

Related

who does IndexError: list index out of range appears ? i did some test still cant find out

Im working on a simple project python to practice , im trying to retreive data from file and do some test on a value
in my case i do retreive data as table from a file , and i do test the last value of the table if its true i add the whole line in another file
Here my data
AE300812 AFROUKH HAMZA 21 admis
AE400928 VIEGO SAN 22 refuse
AE400599 IBN KHYAT mohammed 22 admis
B305050 BOUNNEDI SALEM 39 refuse
here my code :
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
contenu = fichier.read()
tab = contenu.split()
for i in range(0,len(tab),5):
if tab[i+4]=="admis":
fichier2.write(tab[i]+" "+tab[i+1]+" "+tab[i+2]+" "+tab[i+3]+" "+tab[i+4]+" "+"\n")
fichier.close()
And here the following error :
if tab[i+4]=="admis":
IndexError: list index out of range
You look at tab[i+4], so you have to make sure you stop the loop before that, e.g. with range(0, len(tab)-4, 5). The step=5 alone does not guarantee that you have a full "block" of 5 elements left.
But why does this occur, since each of the lines has 5 elements? They don't! Notice how one line has 6 elements (maybe a double name?), so if you just read and then split, you will run out of sync with the lines. Better iterate lines, and then split each line individually. Also, the actual separator seems to be either a tab \t or double-spaces, not entirely clear from your data. Just split() will split at any whitespace.
Something like this (not tested):
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
for line in fichier:
tab = line.strip().split(" ") # actual separator seems to be tab or double-space
if tab[4]=="admis":
fichier2.write(tab[0]+" "+tab[1]+" "+tab[2]+" "+tab[3]+" "+tab[4]+" "+"\n")
Depending on what you actually want to do, you might also try this:
with open("concours.txt","r") as fichier, open("admis.txt","w") as fichier2:
for line in fichier:
if line.strip().endswith("admis"):
fichier2.write(line)
This should just copy the admis lines to the second file, with the origial double-space separator.

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards
It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Data reading - csv

I have some datas in a .dfx file and I trying to read it as a csv with pandas. But it has some special characters which are not read by pandas. They are separators as well.I attached one line from it
The "DC4" is being removed when I print the file. The SI is read as space, correctly. I tried some encoding (utf-8, latin1 etc), but no success.
I attached the printed first line as well. I marked the place where the characters should be.
My code is simple:
import pandas
file_log = pandas.read_csv("file_log.DFX", header=None)
print(file_log)
I hope I was clear and someone has an idea.
Thanks in advance!
EDIT:
The input. LINK: drive.google.com/open?id=0BxMDhep-LHOIVGcybmsya2JVM28
The expected output:
88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839
30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033
By examining the example.DFX in hex (with xxd), the two separators are 0x14 and 0x0f accordingly.
Read the csv with multiple separators using python engine:
import pandas
sep1 = chr(0x14) # the one shows dc4
sep2 = chr(0x0f) # the one shows si
file_log = pandas.read_csv('example.DFX', header=None, sep='{}|{}'.format(sep1, sep2), engine='python')
print file_log
And you get:
0 1 2 3 4 5 6 7
0 88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839 NaN
1 30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033 NaN
It seems it has an empty column at the end. But I'm sure you can handle that.
The encoding seems to be ASCII here. DC4 stands for "device control 4" and SI for "Shift In". These are control characters in an ASCII file and not printable. Thus you cannot see them when you issue a "print(file_log)", although it might do something depending on your terminal to view this (like \n would do a new-line).
Try typing file_log in your interpreter to get the representation of that variable and check if those special characters are included. Chances are that you'll see DC4 in the representation as '\x14' which means hexadecimal 14.
You may then further process these strings in your program by using string manipulation like replace.

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Categories

Resources