Not getting the proper output with re.findall - python

I'm trying to write a code that reads my input csv file with pandas (df_input) and then uses re.findall for any occurrence of the variables in a list. This list is imported from another .csv file where column[0] (df_expression) contains the variables I want the code to search for and column[1] (df_translation) contains the values I want code to return when there's an exact match. This way, when I search for colors like 'burgundy' and 'maroon' it get's translated to 'red'. I've been trying this setup so I can make changes in my expressions translations without having to change the code itself.
df_name = df_input[0]
def expression(expr, string):
return True if len(re.findall(r'\b' + expr + r'\b', string, re.I)) > 0 else False
resultlist = []
for lineIndex in range(0, len(df_input)):
matches_list = []
for expIndex in range(0, len(df_expressions)):
if expression(str(df_expressions.ix[expIndex]), str(df_name.ix[lineIndex])):
matches_list.append(df_translation.ix[expIndex])
df_input['Color'] = resultlist
These are the return values:
resultlist
[['Black'], ['White'], ['Blue'], ['Red', 'Black'], ['Pink'], .....
Current output as found in my output.csv after df_input.to_csv(filepath+filename):
Name,Color
a black car,['Black']
a white paper,['White']
the sky is blue,['Blue']
this product is burgundy and black,['Red, Black']
just pink,['Pink']
Preferred output.csv:
Name,Color
a black car,Black
a white paper,White
the sky is blue,Blue
this product is burgundy and black,Red;Black
just pink,Pink
Is it possible to lose the brackets and quotes so whenever I do df_input.to_csv(filepath+filename) I get a clean output?
I've tried df.replace() - doesn't work, neither does adding [0] at the end of my re.findall and a bunch of other stuff. Only thing that seems to do the job is to str(resultlist).replace(), but then the index-match combination is pretty messed up. Any suggestions?

Try following changes and see how it behaves.
Replace
df_input['Color'] = resultless
With
df_input['Color'] = [', '.join(c) for c in resultlist]
This should transform resultless into ['Black', 'White', 'Blue', 'Red, Black', 'Pink', ...]

Related

Excel cell into list in Python

So I have an Excel column which contains Python lists.
The problem is that when I'm trying to loop through it in Python it reads the cells as str. Attempt to split it makes the items in a list generate as e.g.:
list = ["['Gdynia',", "'(2262011)']"]
list[0] = "['Gdynia,'"
list1 = "'(2261011)']"
I want only to get the city name which is e.g. 'Gdynia' or 'Tczew'. Any idea how can I make it possible?
You can split the string at a desired symbol, ' would be good for your example.
Then you get a list of strings and you can chose the part you need.
str = "['Gdynia',", "'(2262011)']"
str_parts = str.split("'") #['[', 'Gdynia', ',', '(2262011)', ']']
city = str_parts[1] #'Gdynia'
Solution with re:
import re
data = ["['Gdynia', '(2262011)'",
"['Tczew', '(2214011)']",
"['Zory', ’(2479011)']"]
r = re.compile("'(.*?)'")
print(*[r.search(s).group(1) for s in data], sep='\n')
Output
Gdynia
Tczew
Zory

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

Removing '\n', [, and ] in python3

I'm writing a GUI that will generate a random names for taverns for some tabletop gameplay. I have .txt docs that have something like this.
Red
Green
Yellow
Resting
Young
And
King
Dragon
Horse
Salmon
I'm reading and randomly joining them together using the following
x = 1
tavern1 = open('tavnames1.txt', 'r')
name1 = tavern1.readlines()
tav1 = random.sample(name1, int(x))
tav1 = str(tav1)
tav1 =tav1.strip()
tav1 =tav1.replace('\n', '')
tavern2 = open('tavnames2.txt', 'r')
name2 = tavern2.readlines()
tav2 = random.sample(name2, int(x))
tav2 = str(tav2)
TavernName = 'The' + tav1 + tav2
print(TavernName)
The output I get will look something like
The['Young\n']['Salmon\n']
I've tried using .replace() and .strip() on the string but it doesn't seem to work.
Any ideas?
Cheers.
sample() always returns list - even if there is one element. And you use str() to convert list into string so Python adds [ ], and strip() doesn't work because \n is not at the end of string.
But you can use random.choice() which returns only one element - so you don't have to convert to string and you don't get [ ]. And then you can use strip() to remove \n
tavern1 = open('tavnames1.txt')
name1 = tavern1.readlines()
tav1 = random.choice(name1).strip()
tavern2 = open('tavnames2.txt')
name2 = tavern2.readlines()
tav2 = random.choice(name2).strip()
tavern_name = 'The {} {}'.format(tav1, tav2)
print(tavern_name)
A way to get rid of the newlines is to read the whole file and use splitlines(): (see Reading a file without newlines)
tavern1 = open('tavnames1.txt', 'r')
name1 = tavern1.read().splitlines()
To pick a random item of the list name1 you can use tav1 = random.choice(name1) (see https://docs.python.org/3.6/library/random.html#random.choice).
Take the first value from tav1 and tav2, by doing tav1[0].strip(). The .strip() takes care of the \n.
By taking a random.sample, you get a list of values. Because you take just one sample, you get a list with just one item in it, in your example "Young". But, it is in a list, so it is more like ["Young"]. To access only "Young", take the first (and only) item from the list, by saying tav1[0].

Split specific items in list into two

I'm building an XML parser in python for an SVG file. It will eventually become specific instructions for stepper motors.
SVG files contain commands such as 'M', 'C' and 'L.' The path data might look like this:
[M199.66, 0.50C199.6, 0.50...0.50Z]
When I extracted the path data, it's a list of one item (which is a string). I split the long string into multiple strings:
[u'M199.6', u'0.50C199.66', u'0.50']
The 'M, C and L' commands are important - I'm having difficulty splitting '0.5C199.6' into '0.5' and 'C199.6' because it only exists for certain items in the list, and I'd like to retain the C and not discard it. This is what I have so far:
for item in path_strings[0]:
s=string.split(path_strings[0], ',')
print s
break
for i in range(len(s)):
coordinates=string.split(s[i],'C')
print coordinates
break
You could try breaking it into substrings like this:
whole = "0.5C199.66"
start = whole[0:whole.find("C")]
end = whole[whole.find("C"):]
That should give you start == "0.5" and end == "C199.66"
Alternatively you could use the index function instead of find, which raises a ValueError when the substring can't be found. That would give you the benefit of easily determining that for the current string, no 'C' command is present.
http://docs.python.org/2/library/string.html#string-functions
Use a regex to search for the commands ([MCL]).
import re
lst = [u'M199.6', u'0.50C199.66', u'0.50']
for i, j in enumerate(lst):
m = re.search('(.+?)([MCL].+)', j)
if m:
print [m.group(1), m.group(2)] # = coordinates from your example
lst[i:i+1] = [m.group(1), m.group(2)] # replace the item in the lst with the splitted thing
# or do something else with the coordinates, whatever you want.
print lst
splits your list in:
[u'M199.6', u'0.50', u'C199.66', u'0.50']

Using conditionals with variable strings in python

I'm pretty new to python, but I think I catch on fast.
Anyways, I'm making a program (not for class, but to help me) and have come across a problem.
I'm trying to document a list of things, and by things I mean close to a thousand of them, with some repeating. So my problem is this:
I would not like to add redundant names to the list, instead I would just like to add a 2x or 3x before (or after, whichever is simpler) it, and then write that to a txt document.
I'm fine with reading and writing from text documents, but my only problem is the conditional statement, I don't know how to write it, nor can I find it online.
for lines in list_of_things:
if(lines=="XXXX x (name of object here)"):
And then whatever under the if statement. My only problem is that the "XXXX" can be replaced with any string number, but I don't know how to include a variable within a string, if that makes any sense. Even if it is turned into an int, I still don't know how to use a variable within a conditional.
The only thing I can think of is making multiple if statements, which would be really long.
Any suggestions? I apologize for the wall of text.
I'd suggest looping over the lines in the input file and inserting a key in a dictionary for each one you find, then incrementing the value at the key by one for each instance of the value you find thereafter, then generating your output file from that dictionary.
catalog = {}
for line in input_file:
if line in catalog:
catalog[line] += 1
else:
catalog[line] = 1
alternatively
from collections import defaultdict
catalog = defaultdict(int)
for line in input_file:
catalog[line] += 1
Then just run through that dict and print it out to a file.
You may be looking for regular expressions and something like
for line in text:
match = re.match(r'(\d+) x (.*)', line)
if match:
count = int(match.group(1))
object_name = match.group(2)
...
Something like this?
list_of_things=['XXXX 1', 'YYYY 1', 'ZZZZ 1', 'AAAA 1', 'ZZZZ 2']
for line in list_of_things:
for e in ['ZZZZ','YYYY']:
if e in line:
print line
Output:
YYYY 1
ZZZZ 1
ZZZZ 2
You can also use if line.startswith(e): or a regex (if I am understanding your question...)
To include a variable in a string, use format():
>>> i = 123
>>> s = "This is an example {0}".format(i)
>>> s
'This is an example 123'
In this case, the {0} indicates that you're going to put a variable there. If you have more variables, use "This is an example {0} and more {1}".format(i, j)" (so a number for each variable, starting from 0).
This should do it:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
from itertools import groupby
print ["%dx %s" % (len(list(group)), key) for key, group in groupby(a)]
There are two options to approach this. 1) something like the following using a dictionary to capture the count of items and then a list to format each item with its count
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = {}
countedList = []
for lines in list_of_thing:
if lines in listItemCount:
listItemCount[lines] += 1
else:
listItemCount[lines] = 1
for id in listItemCount:
if listItemCount[id] > 1:
countedList.append(id+' - x'str(listItemCount[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon
or 2) using collections to make things simpler as shown below
import collections
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = collections.Counter(list_of_things)
listItemCountDict = dict(listItemCount)
countedList = []
for id in listItemCountDict:
if listItemCountDict[id] > 1:
countedList.append(id+' - x'str(listItemCountDict[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon

Categories

Resources