data = cursor.fetchone ( )
while data is not None:
data = str(data)
print(data)
data.split(",")
for index in data:
if index.startswith("K"):
print(index, end=';')
data = cursor.fetchone()
Here is the relevant part of my code. The data is retrieved from a mysql server, and is a long string of text separated by commas. I can split the string with the commas fine, however, I then need to search for 4 letter strings. I know they start with K, but when I run the program it only prints the K. How do I get it to print the whole string.
Sample Data:
"N54,W130,KRET,KMPI,COFFEE"
Expected output:
"N54,W130,KRET,KMPI,COFFEE"
"KRET;KMPI;"
Your line data.split(",") does nothing because you need to assign that value. You also said you want to only print the K? If so you only want to print the first character of the string so d[0]
data = cursor.fetchone ( )
while data is not None:
data = str(data)
print(data)
data = data.split(",")
for d in data:
if d.startswith("K"):
print(d, end=';')
data = cursor.fetchone()
EDIT: Based on your edit it seems you want to entire string to be printed so I updated it for that
If you are looking for 4 letters string starting with "K", What about using regular expressions?
import re
regex = r"K[a-zA-Z0-9]{3}"
test_str="N54,W130,KRET,KMPI,COFFEE"
matches = re.finditer(regex,test_str)
output=""
for matchNum, match in enumerate(matches):
output+=match.group()+";"
print(output)
The output is: KRET;KMPI;
Related
I have text file having this content
group11#,['631', '1051']#,ADD/H/U_LS_FR_U#,group12#,['1', '1501']#,ADD/H/U_LS_FR_U#,group13#,['31', '28']#,ADD/H/UC_DT_SS#,group14#,['18', '27', '1017', '1073']#,AN/H/UC_HR_BAN#,group15#,['13']#,AD/H/U_LI_NW#,group16#,['1031']#,AN/HE/U_LE_NW_IES#
Requirment is to pull each element separated by #, and to store it in separate variable. And text file above is not having fixed length. So if there are 200 #, separated values then, those should be stored in 200 varaiables.
So the expected output would be
a = group11, b = [631, 1051] c = ADD/H/U_LS_FR_U, d = group12, e = [1, 1501] f = ADD/H/U_LS_FR_U and so on
I'd use those a,b,c,d further as
url = (url+c)
rjson = {"reqparam":{"ids":[str(b)]+str(b)}]}
freq = json.dumps(rjson)
resp = request.request("Post",url,rjson)
Actually in reqparam 'b' have to use values like 631 and 1051
Not sure how to achieve this?
I've started with
with open("filename.txt", "r") as f:
data = f.readlines()
for line in data:
value = line.strip().split('#')
print(value)
You should not use new variable for each object, there are different containers for this, e.g. list.
To parse this string into a list, you can just split string using "#," as a divider and cut last symbol (which is "#") from source before strip:
result = src[:-1].split(",#")
But in output sample you show that you want items which contains list to be converted into a list. You can do this using ast.literal_eval():
import ast
result = [ast.literal_eval(s) if "[" in s else s for s in src[:-1].split("#,")]
I used list comprehesion in previous example, but you can write it using regular for loop:
import ast
result = []
for s in src[:-1].split(",#"):
if "[" in s:
try:
converted = ast.literal_eval(s) # string repr of list into a list
except Exception as e:
print(f"\"{s}\" throws an error: {e}")
else:
result.append(converted)
else:
result.append(s)
You can also use str.strip() to cut "#" and "," from the end of the string (and from the start):
src.strip(",#").split(",#")
I am using the following code to bring back prices from an ecommerce website:
response.css('div.price.regularPrice::text').extract()
but getting the following result:
'\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t',
I do not want the slashes and letters and only the number 5. How do I get this?
First you can use strip() to remove tabs "\t" and enters "\n".
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
data = [item.strip() for item in data]
and you get
['Dhs 5.00', '']
Next you can use if to skip empty elements
data = [item for item in data if item]
and you get
['Dhs 5.00']
If item always has the same structure Dns XXX.00
then you can use slicing [4:-3] to remove "Dhs " and ".00"
data = [item[4:-3] for item in data]
and you get
['5']
So now you have to only get first element data[0] to get 5.
If you need you can convert string "5" to integer 5 using int()
result = int(data[0])
You can even put all in one line
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
data = [item.strip()[4:-3] for item in data if item.strip()]
result = int(data[0])
If you always need only first element from list then you can write it
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
result = int( data[0].strip()[4:-3] )
Use regex to fetch only the numbers.
\d+ regex expression should do the trick.
I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()
I need to parse a multi-line string into a data structure containing (1) the identifier and (2) the text after the identifier (but before the next > symbol). the identifier always comes on its own line, but the text can take up multiple lines.
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
after execution I might have the data structured something like this:
id = ['identifier1', 'identifier2', 'identifier3']
and
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
It seems I would want to use regex to find (1) things after > but before carriage return, and (2) things between >'s, having temporarily deleted the identifier string and EOL, replacing with "".
The thing is I will have hundreds of these identifiers so I need to run the regex sequentially. Any ideas on how to attack this problem? I am working in python but feel free to use whatever language you want in your response.
*Update 1: code from slater getting closer but things are still not partitioned sequentially into id, text, id, text, etc *
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
but the output was:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
note: it needs to work for a multiline string, dealing with all the \n's. a better test case might be:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
current output:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
Personally, I feel that you should use regex as little as possible. It's slow, difficult to maintain, and generally unreadable.
That said, solving this in python is extremely straightforward. I'm a little unclear on what exactly you mean by running this "sequentially", but let me know if this solution doesn't fit your needs.
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
Obviously, you could make the code more efficient, but if you're only dealing with hundreds of identifiers it really shouldn't be needed.
If you want to remove any blank entries that might occur, you can do the following:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
Let me know if you have any more questions.
Unless I misunderstood the question, it's as easy as
for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
Edit: to concatenate multiple lines:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
I found a solution. It's certainly not very pythonic but it works.
======================================================================
======================================================================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
this gives the following output:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3
When I try to print splited_data[1] I'm getting error message IndexError: list index out of range, On the other hand splited_data[0] is working fine.
I want to insert data into MySQL. splited_data[0] are my MySQL columns and splited_data[1] is mysql column values. I want if splited_data[1] is empty then insert empty string in mysql. But I'm getting IndexError: list index out of range. How to avoid this error? Please help me. thank you
Here is my code. Which is working fine. I'm only get this error message when splited_data[1] is empty.
def clean(data):
data = data.replace('[[','')
data = data.replace(']]','')
data = data.replace(']','')
data = data.replace('[','')
data = data.replace('|','')
data = data.replace("''",'')
data = data.replace("<br/>",',')
return data
for t in xml.findall('//{http://www.mediawiki.org/xml/export-0.5/}text'):
m = re.search(r'(?ms).*?{{(Infobox film.*?)}}', t.text)
if m:
k = m.group(1)
k.encode('utf-8')
clean_data = clean(k) #Clean function is used to replace garbase data from text
filter_data = clean_data.splitlines(True) # splited data with lines
filter_data.pop(0)
for index,item in enumerate(filter_data):
splited_data = item.split(' = ',1)
print splited_data[0],splited_data[1]
# splited_data[0] used as mysql column
# splited_data[1] used as mysql values
here is Splited_data data
[u' music ', u'Jatin Sharma\n']
[u' cinematography', u'\n']
[u' released ', u'Film datedf=y201124']
split_data = item.partition('=')
# If there was an '=', then it is now in split_data[1],
# and the pieces you want are split_data[0] and split_data[2].
# Otherwise, split_data[0] is the whole string, and
# split_data[1] and split_data[2] are empty strings ('').
Try removing the whitespace on both sides of the equals sign, like this:
splited_data = item.split('=',1)
A list is contiguous. So you need to make sure it's length is greater than your index before you try to access it.
'' if len(splited_data) < 2 else splited_data[1]
You could also check before you split:
if '=' in item:
col, val=item.split('=',1)
else:
col, val=item, ''