Related
I have a list that contains pair of keywords ('k1', 'k2'). Here's a sample:
print (word_pairs)
--->[('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
I have a text file r_isa.txt (205MB) that contain words that share an "isa" relationship. Here's a sample, where \t represents a literal tab character:
égalité de Parseval\tformule_0.9333\tégalité_1.0
filiation illégitime\tfiliation_1.0
Loi reconnaissant l'égalité\tloi_1.0
égalité entre les sexes\tégalité_1.0
liberté égalité fraternité\tliberté_1.0
This basically means, "égalité de Parseval" isa "formule" with a score of 0.9333 and isa "égalité" with a score of 1. And so go on..
I want to know based on the r_isa file, if the keyword k1 isa k2, and if k2 is-a k1. On the output file, I want to save on each line the pair of words that do have the is-a relationship.
Here's what I did:
#Reading data as list
keywords = [line for line in open('version_final_PMI_espace.txt', encoding='utf8')]
keywords = ast.literal_eval(keywords[0])
word_pairs = []
for k,v in keywords.items():
if v:
word_pairs.append((k,v[0][0]))
len(list(set(word_pairs)))
#####
with open("r_isa.txt",encoding="utf-8") as readfile, open('Hyperonymy_file_pair.txt', 'w') as writefile:
for line in readfile:
firstfield = line.split('\t')[0].lower()
for w in word_pairs:
if w[0]==firstfield:
if w[1] in line:
writefile.write("".join(w[0]) + "\t"+"".join(w[1]) +"\n" )
This returns random pairs to me, for exemple:
salaire\targent
dépense\tcadres
unstead of ( in case of an existing isa relationship)
salaire\tdépense
causesmwedemwelamwemort\tcadres
Where did I go wrong ?
Updated Answer
The statement if w[1] in line: is highly suspect. See the following code for what I believe the logic should be. Since I don't have access to your files, I have turned readfile into a list of strings for testing purposes and instead of writing output to writefile, I am just printing some results. I have added some values to word_pairs and readfile so that I get some results. Also note that if you are converting the input file to lower case, then your word pairs must also be lower case.
This code checks if k1 isa k2 and if not, then checks if k2 isa k1.
word_pairs = [('égalité de parseval', 'égalité'), ('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
word_pairs2 = [(pair[1], pair[0]) for pair in word_pairs] # reverse the words
word_dict = dict(word_pairs) # create a dictionary for fast searching
word_dict2 = dict(word_pairs2)
readfile = [
'égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'dépense\tsalaire_.9'
]
for line in readfile:
fields = line.lower().split('\t')
first_word = fields.pop(0)
isa_word = word_dict.get(first_word, word_dict2.get(first_word)) # check k2 isa k1 if k1 isa k2 is false
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
Prints:
égalité de parseval égalité 1.0
dépense salaire .9
If it is possible that k1 isa k2 and k2 isa k1, then you need the more general (but more complicated) code:
word_pairs = [('égalité de parseval', 'égalité'), ('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
word_pairs2 = [(pair[1], pair[0]) for pair in word_pairs] # reverse the words
word_dict = dict(word_pairs) # create a dictionary for fast searching
word_dict2 = dict(word_pairs2)
readfile = [
'égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'salaire\tdépense_1.0',
'dépense\tsalaire_.9'
]
for line in readfile:
fields = line.lower().split('\t')
first_word = fields.pop(0)
# k1 isa k2?
isa_word = word_dict.get(first_word)
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
# k2 isa k1?
isa_word = word_dict2.get(first_word)
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
Prints:
égalité de parseval égalité 1.0
salaire dépense 1.0
dépense salaire .9
kw = [('salaire', 'dépense'),
('gratuité', 'argent'),
('causesmwedemwelamwemort', 'cadres'),
('caractèresmwedumwedispositif', 'historique'),
('psychomotricienmwediplôme', 'infirmier'),
('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'),
('affichage', 'affichagemweopinion'),
('délaimweprorogation', 'défaillance'),
('créancemwenotion', 'généralités')]
lines_from_file = ['égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'créancemwenotion\tgénéralités_1.0',
'généralités\tcréancemwenotion_1.0']
who_is_who_dict = {}
for line in lines_from_file:
words = line.split('\t')
key = words[0]
other_words = [w.split('_')[0] for w in words[1:]]
if key in who_is_who_dict:
who_is_who_dict[key] = who_is_who_dict[key] + other_words
else:
who_is_who_dict[key] = other_words
pairs_to_write = []
for kw1, kw2 in kw:
if (kw1 in who_is_who_dict and kw2 in who_is_who_dict[kw1]
and kw2 in who_is_who_dict and kw1 in who_is_who_dict[kw2]):
pairs_to_write.append((kw1, kw2))
print(pairs_to_write)
output :
[('créancemwenotion', 'généralités')]
I have to write a program that correlates smoking with lung cancer risk. For that I have data in two files.
My code is computing the data given in the same lines (eg:America,23.3 with Spain,77.9 and
Italy,24.2 with Russia,60.8)
How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file)
Thank you so much for your help in advance:)
smoking file:
Country, Percent Cigarette Smokers Data
America,23.3
Italy,24.2
Russia,23.7
France,14.9
England,17.9
Spain,17
Germany,21.7
second file:
Cases Lung Cancer per 100000
Spain,77.9
Russia,60.8
Korea,61.3
America,73.3
China,66.8
Vietnam,64.5
Italy,43.9
and my code:
def readFiles(smoking_datafile, cancer_datafile):
'''
Reads the data from the provided file objects smoking_datafile
and cancer_datafile. Returns a list of the data read from each
in a tuple of the form (smoking_datafile, cancer_datafile).
'''
# init
smoking_data = []
cancer_data = []
empty_str = ''
# read past file headers
smoking_datafile.readline()
cancer_datafile.readline()
# read data files
eof = False
while not eof:
# read line of data from each file
s_line = smoking_datafile.readline()
c_line = cancer_datafile.readline()
# check if at end-of-file of both files
if s_line == empty_str and c_line == empty_str:
eof = True
# check if end of smoking data file only
elif s_line == empty_str:
raise OSError('Unexpected end-of-file for smoking data file')
# check if at end of cancer data file only
elif c_line == empty_str:
raise OSError('Unexpected end-of-file for cancer data file')
# append line of data to each list
else:
smoking_data.append(s_line.strip().split(','))
cancer_data.append(c_line.strip().split(','))
# return list of data from each file
return (smoking_data, cancer_data)
def calculateCorrelation(smoking_data, cancer_data):
'''
Calculates and returns the correlation value for the data
provided in lists smoking_data and cancer_data
'''
# init
sum_smoking_vals = sum_cancer_vals = 0
sum_smoking_sqrd = sum_cancer_sqrd = 0
sum_products = 0
# calculate intermediate correlation values
num_values = len(smoking_data)
for k in range(0,num_values):
sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])
sum_smoking_sqrd = sum_smoking_sqrd + \
float(smoking_data[k][1]) ** 2
sum_cancer_sqrd = sum_cancer_sqrd + \
float(cancer_data[k][1]) ** 2
sum_products = sum_products + float(smoking_data[k][1]) * \
float(cancer_data[k][1])
# calculate and display correlation value
numer = (num_values * sum_products) - \
(sum_smoking_vals * sum_cancer_vals)
denom = math.sqrt(abs( \
((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
))
return numer / denom
Let's just focus on getting the data into a format that is easy to work with. The code below will get you a dictionary of the form ...
smokers_cancer_data = {
'America': {
'smokers': '23.3',
'cancer': '73.3'
},
'Italy': {
'smokers': '24.2',
'cancer': '43.9'
},
...
}
Once you have this you can get any values you need and perform your calculations. See the code below.
def read_data(filename: str) -> dict:
with open(filename, 'r') as file:
next(file) # Skip the header
data = dict();
for line in file:
cleaned_line = line.rstrip()
# Skip blank lines
if cleaned_line:
data_item = (cleaned_line.split(','))
data[data_item[0]] = float(data_item[1])
return data
# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')
# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
if key in cancer_data:
smokers_cancer_data[key] = {
'smokers': smokers_data[key],
'cancer' : cancer_data[key]
}
print(smokers_cancer_data)
For example, if you want to calculate the sum of the smoker and cancer values.
smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
smokers_total += value['smokers']
cancer_total += value['cancer']
This will return a list of all the countries that have datas, along with the data:
l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
l1, l2 = f1.readlines(), f2.readlines()
for s1 in l1:
for s2 in l2:
if s1.split(',')[0] == s2.split(',')[0]:
cty = s1.split(',')[0]
smk = s1.split(',')[1].strip()
cnr = s2.split(',')[1].strip()
l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")
print(l3)
Output:
['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']
I have to perform some analysis on a PSL record which contains information on DNA sequence fragments. Basically I have to find entries that are from the same read in the same contig (these are both values in the PSL entry). The problem is the PSL records are large (10-30 Mb text documents). I wrote a program that works on short records and on the long records given enough time but it took way longer than specified. I was told the program shouldn't take more than ~15 seconds. Mine took over 15 minutes.
PSL records look like this:
275 11 0 0 0 0 0 0 - M02034:35:000000000-A7UU0:1:1101:19443:1992/2 286 0 286 NODE_406138_length_13407_cov_13.425076 13465 408 694 1 286, 0, 408,
171 5 0 0 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:13497:2001/2 294 0 176 NODE_500869_length_34598_cov_30.643419 34656 34334 34510 1 176, 0, 34334,
188 14 0 10 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:18225:2002/1 257 45 257 NODE_455027_length_12018_cov_13.759444 12076 11322 11534 1 212, 45, 11322,
My code looks like this:
import sys
class PSLreader :
'''
Class to provide reading of a file containing psl alignments
formatted sequences:
object instantiation:
myPSLreader = PSLreader(<file name>):
object attributes:
fname: the initial file name
methods:
readPSL() : reads psl file, yielding those alignments that are within the first or last
1000 nt
readPSLpairs() : yields psl pairs that support a circular hypothesis
Author: David Bernick
Date: May 12, 2013
'''
def __init__ (self, fname=''):
'''contructor: saves attribute fname '''
self.fname = fname
def doOpen (self):
if self.fname is '':
return sys.stdin
else:
return open(self.fname)
def readPSL (self):
'''
using filename given in init, returns each filtered psl records
that contain alignments that are within the terminal 1000nt of
the target. Incomplete psl records are discarded.
If filename was not provided, stdin is used.
This method selects for alignments that could may be part of a
circle.
Illumina pairs aligned to the top strand would have read1(+) and read2(-).
For the bottoms trand, read1(-) and read2(+).
For potential circularity,
these are the conditions that can support circularity:
read1(+) near the 3' terminus
read1(-) near the 5' terminus
read2(-) near the 5' terminus
read2(+) near the 3' terminus
so...
any read(+) near the 3', or
any read(-) near the 5'
'''
nearEnd = 1000 # this constant determines "near the end"
with self.doOpen() as fileH:
for line in fileH:
pslList = line.split()
if len(pslList) < 17:
continue
tSize = int(pslList[14])
tStart = int(pslList[15])
strand = str(pslList[8])
if strand.startswith('+') and (tSize - tStart > nearEnd):
continue
elif strand.startswith('-') and (tStart > nearEnd):
continue
yield line
def readPSLpairs (self):
read1 = []
read2 = []
for psl in self.readPSL():
parsed_psl = psl.split()
strand = parsed_psl[9][-1]
if strand == '1':
read1.append(parsed_psl)
elif strand == '2':
read2.append(parsed_psl)
output = {}
for psl1 in read1:
name1 = psl1[9][:-1]
contig1 = psl1[13]
for psl2 in read2:
name2 = psl2[9][:-1]
contig2 = psl2[13]
if name1 == name2 and contig1 == contig2:
try:
output[contig1] += 1
break
except:
output[contig1] = 1
break
print(output)
PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
PSL_obj.readPSLpairs()
I was given some example code that looks like this:
def doSomethingPairwise (a):
for leftItem in a[1]:
for rightItem in a[2]:
if leftItem[1] is rightItem[1]:
print (a)
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2],
['John', 'violin', 1], ['John', 'oboe', 2],
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ]
thisGroup = None
thisGroupList = [ [], [], [] ]
for name, instrument, num in thisStream:
if name != thisGroup:
doSomethingPairwise(thisGroupList)
thisGroup = name
thisGroupList = [ [], [], [] ]
thisGroupList[num].append([name, instrument, num])
doSomethingPairwise(thisGroupList)
But when I tried to implement it my program still took a long time. Am I thinking about this the wrong way? I realize the nested loop is slow but I don't see an alternative.
Edit: I figured it out, the data was presorted which made my brute force solution very impractical and unnecessary.
I hope help you, since, the question needs a best input example file
#is better create PSLRecord class
class PSLRecord:
def __init__(self, line):
pslList = line.split()
properties = ("matches", "misMatches", "repMatches", "nCount",
"qNumInsert", "qBaseInsert", "tNumInsert",
"tBaseInsert", "strand", "qName", "qSize", "qStart",
"qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount",
"blockSizes", "qStarts", "tStarts")
self.__dict__.update(dict(zip(properties, pslList)))
class PSLreader :
def __init__ (self, fname=''):
self.fname = fname
def doOpen (self):
if self.fname is '':
return sys.stdin
else:
return open(self.fname)
def readPSL (self):
with self.doOpen() as fileH:
for line in fileH:
pslrc = PSLRecord(line)
yield pslrc
#return a dictionary with all psl records group by qName and tName
def readPSLpairs (self):
dictpsl = {}
for pslrc in self.readPSL():
#OP requirement, remove '1' or '2' char, in pslrc.qName[:-1]
key = (pslrc.qName[:-1], pslrc.tName)
if not key in dictpsl:
dictpsl[key] = []
dictpsl[key].append(pslrc)
return dictpsl
#Function filter .... is better out and self-contained
def f_filter(pslrec, nearEnd = 1000):
if (pslrec.strand.startswith('+') and
(int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)):
return False
if (pslrec.strand.startswith('-') and
(int(pslrec.tStart) > nearEnd)):
return False
return True
PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
#read dictionary of pairs
dictpsl = PSL_obj.readPSLpairs()
from itertools import product
#product from itertools
#(1) x (2,3) = (1,2),(1,3)
output = {}
for key, v in dictpsl.items():
name, contig = key
#i get filters aligns in principal strand
strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and
pslrec.qName[-1] == '1']
#i get filters aligns in secondary strand
strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and
pslrec.qName[-1] == '2']
for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec):
#This For has fewer comparisons, since I was grouped before
if not contig in output:
output[contig] = 1
output[contig] += 1
Note: 10-30 Mb isn't large file, if you ask me
I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.
wordlist A: book jesus christ son david son abraham jacob judah his brothers perez amminadab
wordlist B: akwụkwọ jizọs kraịst nwa devid nwa ebreham jekọb juda ya ụmụnne pirez aminadab
file.txt A:
the book of the history of jesus christ , son of david , son of abraham :
abraham became father to isaac ; isaac became father to jacob ; jacob became father to judah and his brothers ;
file.txt B:
akwụkwọ nke kọrọ akụkọ banyere jizọs kraịst , nwa devid , nwa ebreham :
ebreham mụrụ aịzik ; aịzik amụọ jekọb ; jekọb amụọ juda na ụmụnne ya ndị ikom ;
I have 2 above word-lists (say A & B) of 2 diff. languages. Both contain word translation of each other in order. My task is to run these word-lists through 2 separate files.txt of both languages like word-list A through file.txt A and vice versa, then return a line for both txt files, each will contain the index numbers of both word-list where they were found on each line of the txt paired like:
2:1 7:6 8:7 10:9 12:10 14:12 16:13 [ 2:1 = 2 index of book in txt.file A and 1-akwụkwọ in txt.file B and so on]
1:1 11:6 13:8 17:10 19:12 20:13 [ 1:1 = 1 index of abraham in txt.file A and 1- ebreham in txt.file B and so on].
see codes below:
import sys
def wordlist(filename):
wordlist = []
with open(filename, 'rb') as f:
for line in f:
wordlist.append(line)
return wordlist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
eng.append(line)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
igb.append(line)
i = 0
while i < len(eng):
eng_igb_verse_pair = eng[i] + " " + igb[i]
line = eng_igb_verse_pair.strip().split()
for n in range(0, len(wordlist('eng_wordlist.txt'))):
eng_word = wordlist('eng_wordlist.txt').pop(n)
igb_word = wordlist('igb_wordlist.txt').pop(n)
if eng_word in line and igb_word in line:
print '{0} {1}:{2}'.format(i, line.index[eng_word], line.index[igb_word])
i += 1
This actually prints empty. I know my problem is in the last segment of the program. Can someone help. I am not that experienced python programmer. Apologies if I didn't construct my explanation well.
You mean something like this:
import sys
def checkLine(line_eng, line_igb):
eng_words = line_eng.split()
igb_words = line_igb.split()
for word in eng_words:
if word in eng:
igb_word = igb[eng.index(word)]
print "%d:%d" % ( eng_words.index(word)+1, igb_words.index(igb_word)+1),
def linelist(filename):
lineslist = []
for line in open(filename, 'rb'):
lineslist.append(line)
return lineslist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
eng.append(w)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
igb.append(w)
eng_lines = linelist("eng_wordlist.txt")
igb_lines = linelist("igb_wordlist.txt")
for n in range(0, len(eng_lines)):
print "%d. " % (n+1),
checkLine(eng_lines[n],igb_lines[n])
print
For your files i got result:
1. 2:1 7:6 8:7 10:9 12:10 10:9 16:13
2. 1:1 11:7 11:7 17:11 19:14 20:13
BR
Parasit Hendersson