Hello guys, after managing to get some data by scraping with Beautiful Soup...
I want to format that data so as I could easily export it to CSV and JSON.
My Question here is how can one translate this:
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Into this:
[
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]],
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]]
]
Indented for clarity
Any rescue attempt would be appreciated by a warm thank you!
So far with help we got:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack
The problem with that code is that it returns...
[
'Heading ',
['Subheading '],
'AnotherHeading ',
['AnotherSubheading ', ['Somedata'], 'Heading ', 'Subheading '], 'AnotherHeading ',
['AnotherSubheading ', ['Somedata']]
]
Here is a repl:
https://repl.it/yvM/1
Thank you both kirbyfan64sos and SuperBiasedMan
def parse(data):
currentTab = 0
currentList = []
result = [currentList]
i = 0
tabCount = 0
for line in data.splitlines():
tabCount = len(line)-len(line.lstrip())
line = line.strip().rstrip(' :')
if tabCount == currentTab:
currentList.append(line)
elif tabCount > currentTab:
newList = [line]
currentList.append(newList)
currentList = newList
elif tabCount == 0:
currentList = [line]
result.append(currentList)
elif tabCount == 1:
currentList = [line]
result[-1].append(currentList)
currentTab = tabCount
tabCount = tabCount + 1
i = i + 1
print(result)
Well first you want to clear out unnecessary whitespace, so you make a list of all the lines that contain something more than whitespace and set up all the defaults that you start from for the main loop.
teststring = [line for line in teststring.split('\n') if line.strip()]
currentTab = 0
currentList = []
result = [currentList]
This method replies on the mutability of lists, so setting currentList as an empty list and then setting result to [currentList] is an important step, since we can now append to currentList.
for line in teststring:
i, tabCount = 0, 0
while line[i] == ' ':
tabCount += 1
i += 1
tabCount /= 8
This is the best way I could think of to check for tab characters at the start of each line. Also, yes you'll notice I actually checked for spaces, not tabs. Tabs just 100% didn't work, I think it was because I was using repl.it since I don't have Python 3 installed. It works perfectly fine on Python 2.7 but I wont put code I haven't verified works. I can edit this if you confirm that using \t and removing tabCount /= 8 produces the desired results.
Now, check how indented the line is. If it's the same as our currentTab value, then just append to the currentList.
if tabCount == currentTab:
currentList.append(line.strip())
If it's higher, that means we've gone to a deeper list level. We need a new list nested in currentList.
elif tabCount > currentTab:
newList = [line.strip()]
currentList.append(newList)
currentList = newList
Going backwards is trickier, since the data only contains 3 nesting levels I opted to hardcode what to do with the values 0 and 1 (2 should always result in one of the above blocks). If there are no tabs, we can append a new list to result.
elif tabCount == 0:
currentList = [line.strip()]
result.append(currentList)
It's mostly the same for a one tab deep heading, except that you should append to result[-1], as that's the last main heading to nest into.
elif tabCount == 1:
currentList = [line.strip()]
result[-1].append(currentList)
Lastly, make sure currentTab is updated to what our current tabCount is so the next iteration behaves properly.
currentTab = tabCount
Something like:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack[0]
Your format looks a lot like YAML, though; you may want to look into PyYAML.
Related
I've been trying to encode a string (ex: aabbbacc) to something like a2b3a1c2
this is the code i've tried:
string_value = "aabbbacc"
temp_string = ""
for i in range(0, len(string_value)):
if i != len(string_value) or i > len(string_value):
temp_count = 1
while string_value[i] == string_value[i+1]:
temp_count += 1
i += 1
temp_string += string_value[i] + str(temp_count)
print(temp_string)
the problem is even though I've added an if condition to stop out of bounds from happening, I still get the error
Traceback (most recent call last):
File "C:run_length_encoding.py", line 6, in <module>
while string_value[i] == string_value[i+1]:
IndexError: string index out of range
I've also tried
string_value = "aabbbacc"
temp_string = ""
for i in range(0, len(string_value)):
count = 1
while string_value[i] == string_value[i+1]:
count += 1
i += 1
if i == len(string_value):
break
temp_string += string_value[i]+ str(count)
print(temp_string)
now, I know there might be a better way to solve this, but I'm trying to understand why I'm getting the out of bounds exception even though i have an if condition to prevent it, at what part of the logic am I going wrong please explain...
The problem is here:
for i in range(0, len(string_value)): # if i is the last index of the string
count = 1
while string_value[i] == string_value[i+1]: # i+1 is now out of bounds
The easiest way to avoid out-of-bounds is to not index the strings at all:
def encode(s):
if s == '': # handle empty string
return s
current = s[0] # start with first character (won't fail since we checked for empty)
count = 1
temp = ''
for c in s[1:]: # iterate through remaining characters (string slicing won't fail)
if current == c:
count += 1
else: # character changed, output count and reset current character and count
temp += f'{current}{count}'
current = c
count = 1
temp += f'{current}{count}' # output last count accumulated
return temp
print(encode('aabbbacc'))
print(encode(''))
print(encode('a'))
print(encode('abc'))
print(encode('abb'))
Output:
a2b3a1c2
a1
a1b1c1
a1b2
First this check is odd :
if i != len(string_value) or i > len(string_value):
Second, you check i but read value for i+1, and potentially next...
So my suggestion is to put the condition inside of your while.
And do not allow string_value[i] to be read after you have checked that i==len(string_value).
(I remind you that : "The break statement, like in C, breaks out of the innermost enclosing for or while loop.")
Iterate thru each char in the string then check if the next char is the same with current. If yes, then add one else add the count to temp string and reset the count to 1.
string_value = "aabbbacc"
temp_string = ""
count = 1
for i in range(len(string_value)-1):
if string_value[i] == string_value[i+1]:
count += 1
else:
temp_string += string_value[i]+ str(count)
count = 1
#add the last char count
temp_string += string_value[i+1]+ str(count)
print(temp_string)
Out: a2b3a1c2
I need a Python function which gives reversed string with the following conditions.
$ position should not change in the reversed string.
Should not use Python built-in functions.
Function should be an efficient one.
Example : 'pytho$n'
Result : 'nohty$p'
I have already tried with this code:
list = "$asdasdas"
list1 = []
position = ''
for index, i in enumerate(list):
if i == '$':
position = index
elif i != '$':
list1.append(i)
reverse = []
for index, j in enumerate( list1[::-1] ):
if index == position:
reverse.append( '$' )
reverse.append(j)
print reverse
Thanks in advance.
Recognise that it's a variation on the partitioning step of the Quicksort algorithm, using two pointers (array indices) thus:
data = list("foo$barbaz$$")
i, j = 0, len(data) - 1
while i < j:
while i < j and data[i] == "$": i += 1
while i < j and data[j] == "$": j -= 1
data[i], data[j] = data[j], data[i]
i, j = i + 1, j - 1
"".join(data)
'zab$raboof$$'
P.S. it's a travesty to write this in Python!
A Pythonic solution could look like this:
def merge(template, data):
for c in template:
yield c if c == "$" else next(data)
data = "foo$barbaz$$"
"".join(merge(data, reversed([c for c in data if c != "$"])))
'zab$raboof$$'
Wrote this without using any inbuilt functions. Hope it fulfils your criteria -
string = "zytho$n"
def reverse(string):
string_new = string[::-1]
i = 0
position = 0
position_new = 0
for char in string:
if char=="$":
position = i
break
else:
i = i + 1
j = 0
for char in string_new:
if char=="$":
position_new = i
break
else:
j = j + 1
final_string = string_new[:position_new]+string_new[position_new+1:position+1]+"$"+string_new[position+1:]
return(final_string)
string_new = reverse(string)
print(string_new)
The output of this is-
nohty$x
To explain the code to you, first I used [::-1], which is just taking the last position of the string and moving forward so as to reverse the string. Then I found the position of the $ in both the new and the old string. I found the position in the form of an array, in case you have more than one $ present. However, I took for granted that you have just one $ present, and so took the [0] index of the array. Next I stitched back the string using four things - The part of the new string upto the $ sign, the part of the new string from after the dollar sign to the position of the $ sign in the old string, then the $ sign and after that the rest of the new string.
I have a file, each line of which contains data that I would like to read into a dictionary, resulting in a list of dictionaries. Or a dictionary of dictionaries, keyed by the first element from each line. The first element of each line is the only one that I can guarantee will be of the same type from line to line, i.e. it's a name.
The data in the file looks something like this:
name:value1, var2:('str1', 'str2','str3'), var3:[0.1,1,10] , var4:range(1,10)
name:value2, var5:('str1', 'str2'), var6:range(1,10)
And I'd like to have it end up something like this:
dictionaryList=[
{"name": "value1", "var2":('str1', 'str2','str3'), var3:[0.1,1,10] , var4:range(1,10)},
{name:value2, var5:('str1', 'str2'), var6:range(1,10)}
]
There's a number of questions about reading lines into elements of a single dictionary or reading a file into a nested dictionary. They all rely on splitting the line on a comma though. i.e.
content = f.readlines()
for line in content:
line = line.strip('\r').strip('\n').split(':')
If I do that, I end up with breaks in the ranges and arrays and wotnot. I was borderline going to use : as a separator, but that feels like horribly bad form and I have no way to automatically convert the correct commas to colons when I get sent more data. Is there a way to get around this?
f = open("test.txt","r")
lines = f.readlines()
f.close()
dictList = []
stack = []
brack = "([{"
opbrack = ")]}"
for line in lines:
d = {}
key = ""
val = ""
k = 0
for i in line:
if i == ',':
if not(stack) or stack[-1] not in brack:
d[key] = val
key = ""
val = ""
k = 0
elif k==0:
key += i
else:
val += i
elif i in brack:
stack.append(i)
if k==0:
key += i
else:
val += i
elif i in opbrack:
stack.pop()
if k==0:
key += i
else:
val += i
elif i == ":":
if not(stack) or stack[-1] not in brack:
k = 1
elif k==0:
key += i
else:
val += i
else:
if k==0:
key += i
else:
val += i
dictList.append(d)
print dictList
This code should do what you need. It assumes that the file is already in proper format.
This is the output:
[{'name': 'value1', ' var2': "('str1', 'str2','str3')", ' var3': '[0.1,1,10] '}, {' var5': "('str1', 'str2')", 'name': 'value2'}]
I have a max length of a list item that I need to enforce. How would I accomplish the following:
MAX_LENGTH = 13
>>> str(["hello","david"])[:MAX_LENGTH]
"['hello', 'da"
==> ["hello", "da"]
I was thinking using ast.literal_eval, but was wondering what might be recommended here.
I would caution against this. There has to be safer things to do than this. At the very least you should never be splitting elements in half. For instance:
import sys
overrun = []
data = ["Hello,"] + ( ["buffer"] * 80 )
maxsize = 800
# sys.getsizeof(data) is now 840 in my implementation.
while True:
while sys.getsizeof(data) > maxsize:
overrun.append(data.pop())
do_something_with(data)
if overrun:
data, overrun = overrun, []
else:
break
Here is a simplified version of #AdamSmith's answer which I ended up using:
import sys
from copy import copy
def limit_list_size(ls, maxsize=800):
data = copy(ls)
while (sys.getsizeof(str(data)) > maxsize):
if not data: break
data.pop()
return data
Note that this will not split mid-word. And because this is returning a copy of the data, the user can see which items were excluded in the output. For example:
old_ls = [...]
new_ls = limit_list_size(old_ls)
overflow_ls = list(set(old_ls) - set(new_ls))
If you want MAX_LENGTH of your strings concatenated, you could do it with a loop pretty simply, using something like this:
def truncateStringList(myArray)
currentLength = 0
result = []
for string in myArray:
if (currentLength + len(string)) > MAX_LENGTH:
result.append(string[:len(string) + currentLength - MAX_LENGTH])
return result
else:
result.append(string)
return result
If you want it of the string representation, you are effectively adding 2 chars at the beginning of each element, [' or ', and two at the end, ', or '], so add 2 to current length before and after each element in the loop:
for string in myArray:
currentLength += 2
if (currentLength + len(string)) > MAX_LENGTH:
result.append(string[:len(string) + currentLength - MAX_LENGTH])
return result
else:
result.append(string)
currentLength += 2
return result
with loops:
max = 11
mylist = ['awdawdwad', 'uppps']
newlist = ','.join(mylist)
print mylist
c= [x for x in newlist if not x==',']
if len(c)>max:
newlist= list(newlist)
newlist.reverse()
for x in range(len(c)-max):
if newlist[0]==',':
del newlist[0]
del newlist[0] # delete two times
else:
del newlist[0]
while newlist[0]==',':
del newlist[0]
newlist.reverse()
cappedlist = ''.join(newlist).split(',')
print cappedlist
I am wondering how to append a newline every time the list reaches the size of the checker board (8). Heres my code so far. It works but I want to put a newline every 8 characters.
saveFile=input("Please enter the name of the file you want to save in: ")
outputFile=open(saveFile,"w")
pieceList=[]
for row_index in range (self.SIZE):
for column_index in range(self.SIZE):
pieceRow=[]
char=" "
if self.grid[row_index][column_index]==Piece(Piece.WHITE):
char="w"
elif self.grid[row_index][column_index]==Piece(Piece.RED):
char="r"
pieceRow.append(char)
pieceList.append(pieceRow)
for item in pieceList:
for char in item:
outputFile.write("%s" %char)
Use
if row_index % 8 == 0:
# put newline
saveFile=input("Please enter the name of the file you want to save in: ")
outputFile=open(saveFile,"w")
pieceList=[]
characterCounter =0
for row_index in range (self.SIZE):
for column_index in range(self.SIZE):
pieceRow=[]
char=" "
if self.grid[row_index][column_index]==Piece(Piece.WHITE):
char="w"
elif self.grid[row_index][column_index]==Piece(Piece.RED):
char="r"
pieceRow.append(char)
characterCounter++
if characterCounter==8:
pieceRow.append("\n")
characterCounter=0
pieceList.append(pieceRow)
for item in pieceList:
for char in item:
outputFile.write("%s" %char)
cnt = 0
for item in pieceList:
for char in item:
outputFile.write("%s" %char)
cnt += 1
if cnt == 8:
outputFile.write("\n")
cnt = 0
Do you mean you want to append newline after each row?
Why not add
pieceRow.append("\n")
before
pieceList.append(pieceRow)
You could use enumerate,
For instance,
CONST_SIZE = 8
for index, item in enumerate(item_list):
output_file.write(item)
if index % 8 == 0: output_file.write('\n')
This will append a newline each time you've written 8 items to the file.