I have a question:
This is list of lists, formed by ElementTree library.
[['word1', <Element tag at b719a4cc>], ['word2', <Element tag at b719a6cc>], ['word3', <Element tag at b719a78c>], ['word4', <Element tag at b719a82c>]]
word1..4 may contain unicode characters i.e (â,ü,ç).
I want to sort this list of lists by my custom alphabet.
I know how to sort by custom alphabet from here
sorting words in python
I also know how to sort by key from here http://wiki.python.org/moin/HowTo/Sorting
The problem is that I couldn't find the way how to apply these two method to sort my "list of lists".
Your first link more or less solves the problem. You just need to have the lambda function only look at the first item in your list:
alphabet = "zyxwvutsrqpomnlkjihgfedcba"
new_list = sorted(inputList, key=lambda word: [alphabet.index(c) for c in word[0]])
One modification I might suggest, if you're sorting a reasonably large list, is to change the alphabet structure into a dict first, so that index lookup is faster:
alphabet_dict = dict([(x, alphabet.index(x)) for x in alphabet)
new_list = sorted(inputList, key=lambda word: [alphabet_dict[c] for c in word[0]])
If I'm understanding you correctly, you want to know how to apply the key sorting technique when the key should apply to an element of your object. In other words, you want to apply the key function to 'wordx', not the ['wordx', ...] element you are actually sorting. In that case, you can do this:
my_alphabet = "..."
def my_key(elem):
word = elem[0]
return [my_alphabet.index(c) for c in word]
my_list.sort(key=my_key)
or using the style in your first link:
my_alphabet = "..."
my_list.sort(key=lambda elem: [my_alphabet.index(c) for c in elem[0]])
Keep in mind that my_list.sort will sort in place, actually modifying your list. sorted(my_list, ...) will return a new sorted list.
Works great!!! Thank you for your help
Here is my story:
I have turkish-russian dictionary in xdxf format. the problem was to sort it.
I've found solution here http://effbot.org/zone/element-sort.htm but it didn't sort unicode characters.
here is final source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import codecs
alphabet = u"aâbcçdefgğhiıjklmnoöpqrstuüvwxyz"
tree = ET.parse("dict.xml")
# this element holds the phonebook entries
container = tree.find("entries")
data = []
for elem in container:
keyd = elem.findtext("k")
data.append([keyd, elem])
data.sort(key=lambda data: [alphabet.index(c) for c in data[0]])
container[:] = [item[-1] for item in data]
tree.write("new-dict.xml", encoding="utf-8")
sample content of dict.xml
<cont>
<entries>
<ar><k>â</k>def1</ar>
<ar><k>a</k>def1</ar>
<ar><k>g</k>def1</ar>
<ar><k>w</k>def1</ar>
<ar><k>n</k>def1</ar>
<ar><k>u</k>def1</ar>
<ar><k>ü</k>def1</ar>
<ar><k>âb</k>def1</ar>
<ar><k>ç</k>def1</ar>
<ar><k>v</k>def1</ar>
<ar><k>ac</k>def1</ar>
</entries>
</cont>
Thank to all
Related
I have a list of hyperlinks, with three types of links; htm, csv and pdf. And I would like to just pick out those that are csv.
The list contains strings of the form: csv/damlbmp/20160701damlbmp_zone_csv.zip
I was thinking of running a for loop across the string and just returning values that have first 3 string values are equal to csv, but I am not really sure how to do this.
I would use link.endswith('csv') (or link.endswith('csv.zip')), where link is a string containing that link)
For example:
lst = ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
'pdf/damlbmp/20160701damlbmp_zone_pdf.zip',
'html/damlbmp/20160701damlbmp_zone_html.zip',
'csv/damlbmp/20160801damlbmp_zone_csv.zip']
csv_files = [link for link in lst if link.endswith('csv.zip')]
If your list is called links:
[x for x in links if 'csv/' in x]
You can try this
import re
l=["www.h.com","abc.csv","test.pdf","another.csv"] #list of links
def MatchCSV(list):
matches=[]
for string in list:
m=re.findall('[^\.]*\.csv',string)
if(len(m)>0):
matches.append(m)
return matches
print(MatchCSV(l))
[['abc.csv'], ['another.csv']]
(endswith is a good option too)
This is one way:
lst = ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
'pdf/damlbmp/20160701damlbmp_zone_pdf.zip',
'html/damlbmp/20160701damlbmp_zone_html.zip',
'csv/damlbmp/20160801damlbmp_zone_csv.zip']
[i for i in lst if i[:3]=='csv']
# ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
# 'csv/damlbmp/20160801damlbmp_zone_csv.zip']
I have the following code:
from xml.etree import ElementTree
tree = ElementTree.parse(file)
my_val = tree.find('./abc').text
and here is an xml snippet:
<item>
<abc>
<a>hello</a>
<b>world</b>
awesome
</abc>
</item>
I need my_val of type string to contain
<a>hello</a>
<b>world</b>
awesome
But it obviously resolves to None
Iteration overfindall will give you a list of subtrees elements.
>>> elements = [ElementTree.tostring(x) for x in tree.findall('./abc/')]
['<a>hello</a>\n ', '<b>world</b>\n awesome\n ']
The problem with this is that text without is tags are appended to the previous tag. So you need to clean that too:
>>> split_elements = [x.split() for x in elements]
[['<a>hello</a>'], ['<b>world</b>', 'awesome']]
Now we have a list of lists that needs to be flatten:
>>> from itertools import chain
>>> flatten_list = list(chain(*split_elements))
['<a>hello</a>', '<b>world</b>', 'awesome']
Finally, you can print it one per line with:
>>> print("\n".join(flatten_list))
One way could be to start by getting the root element
from xml.etree import ElementTree
import string
tree=ElementTree.parse(file)
rootElem=tree.getroot()
Then we can get element abc from root and iterate over its children, formatting into a string using attributes of the children:
abcElem=root.find("abc")
my_list = ["<{0.tag}>{0.text}</{0.tag}>".format(child) for child in abcElem]
my_list.append(abcElem.text)
my_val = string.join(my_list,"\n")
I'm sure some other helpful soul knows a way to print these elements out using ElementTree or some other xml utility rather than formatting them yourself but this should start you off.
Answering my own question:
This might be not the best solution but it worked for me
my_val = ElementTree.tostring(tree.find('./abc'), 'utf-8', 'xml').decode('utf-8')
my_val = my_val.replace('<abc>', '').replace('</abc>', '')
my_val = my_val.strip()
I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.
Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())
t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d
why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.
Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted
I have a file that contains this:
((S000383212:0.0,JC0:0.244562):0.142727,(S002923086:0.0,(JC1:0.0,JC2:0.0):0.19717200000000001):0.222151,((S000594619:0.0,JC3:0.21869):0.13418400000000003,(S000964423:0.122312,JC4:0.084707):0.18147100000000002):0.011521999999999977);
I have two dictionaries that contain:
org = {'JC4': 'a','JC0': 'b','JC1': 'c','JC2': 'c','JC3': 'd'}
RDP = {'S000383212': 'hello', 'S002923086': 'this', 'S000594619': 'is'}
How would I find every time it says one of the words in one of the dictionaries and convert it to its alternative term?
i.e. if it encounters 'JC0' then it would translate it to 'b'
for key in org.keys() + RDP.keys():
text = text.replace(key, org.get(key, None) or RDP.get(key, None))
Of course, as TryPyPy said, if you just merge the dicts, it becomes much simpler:
org.update(RDP)
for item in org.items():
text = text.replace(*item)
If the performance isn't very important you can use the following code:
with open('your_file_name.txt') as f:
text = f.read()
for key, value in org.items() + RDP.items():
text = text.replace(key, value)
This code has the O(n * k) time complexity, where n is the length of the text and k is the count of entries in both dictionaries. If this complexity doesn't suit your task, the Aho-Corasick algorithm can help you.
You should use the replace string method.
i've searched pretty hard and cant find a question that exactly pertains to what i want to..
I have a file called "words" that has about 1000 lines of random A-Z sorted words...
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
abacus
abalone
abandon
abase
abash
abate
abater
abbas
abbe
abbey
abbot
Abbott
abbreviate
abc
abdicate
abdomen
abdominal
abduct
Abe
abed
Abel
Abelian
I am trying to load this file into a dictionary, where using the word are the key values and the keys are actually auto-gen/auto-incremented for each word
e.g {0:10th, 1:1st, 2:2nd} ...etc..etc...
below is the code i've hobbled together so far, it seems to sort of works but its only showing me the last entry in the file as the only dict pair element
f3data = open('words')
mydict = {}
for line in f3data:
print line.strip()
cmyline = line.split()
key = +1
mydict [key] = cmyline
print mydict
key = +1
+1 is the same thing as 1. I assume you meant key += 1. I also can't see a reason why you'd split each line when there's only one item per line.
However, there's really no reason to do the looping yourself.
with open('words') as f3data:
mydict = dict(enumerate(line.strip() for line in f3data))
dict(enumerate(x.rstrip() for x in f3data))
But your error is key += 1.
f3data = open('words')
print f3data.readlines()
The use of zero-based numeric keys in a dict is very suspicious. Consider whether a simple list would suffice.
Here is an example using a list comprehension:
>>> mylist = [word.strip() for word in open('/usr/share/dict/words')]
>>> mylist[1]
'A'
>>> mylist[10]
"Aaron's"
>>> mylist[100]
"Addie's"
>>> mylist[1000]
"Armand's"
>>> mylist[10000]
"Loyd's"
I use str.strip() to remove whitespace and newlines, which are present in /usr/share/dict/words. This may not be necessary with your data.
However, if you really need a dictionary, Python's enumerate() built-in function is your friend here, and you can pass the output directly into the dict() function to create it:
>>> mydict = dict(enumerate(word.strip() for word in open('/usr/share/dict/words')))
>>> mydict[1]
'A'
>>> mydict[10]
"Aaron's"
>>> mydict[100]
"Addie's"
>>> mydict[1000]
"Armand's"
>>> mydict[10000]
"Loyd's"
With keys that dense, you don't want a dict, you want a list.
with open('words') as fp:
data = map(str.strip, fp.readlines())
But if you really can't live without a dict:
with open('words') as fp:
data = dict(enumerate(X.strip() for X in fp))
{index: x.strip() for index, x in enumerate(open('filename.txt'))}
This code uses a dictionary comprehension and the enumerate built-in, which takes an input sequence (in this case, the file object, which yields each line when iterated through) and returns an index along with the item. Then, a dictionary is built up with the index and text.
One question: why not just use a list if all of your keys are integers?
Finally, your original code should be
f3data = open('words')
mydict = {}
for index, line in enumerate(f3data):
cmyline = line.strip()
mydict[index] = cmyline
print mydict
Putting the words in a dict makes no sense. If you're using numbers as keys you should be using a list.
from __future__ import with_statement
with open('words.txt', 'r') as f:
lines = f.readlines()
words = {}
for n, line in enumerate(lines):
words[n] = line.strip()
print words