Python removing duplicate names - python

I have plain text file with words in each line:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3521 >India<TOPONYM>/O
3526 >Zimbabwe<TOPONYM>/O
3531 >England<TOPONYM>/O
3536 >Melbourne<TOPONYM>/O
3541 >England<TOPONYM>/O
3546 >England<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3556 >England<TOPONYM>/O
3561 >England<TOPONYM>/O
3566 >Australia<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3821 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4234 >Hampden<TOPONYM>/O
4239 >Hampden<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4845 >Edinburgh<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
I want to remove same location names in this list and it should look like this:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O
I want to remove the duplicate locations name and docid should remain in the file. I know there is a way through linux using uniq but if I'll run that it will remove locations within different docid.
Is there anyway to split it through every docid and within docid if location names are same then it should remove duplicate names.

I am writing from mobile, so this will not be a complete solution, but the key points:
import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={}
for line in file:
if re.match(Docid,line):
Lines={}
print line
else:
loc=re.findall(Location, line)[0]
if loc not in Lines.keys():
print line
Lines[loc] = True
Basically it checks each line of it is not a new docid. If it isn't, it then tries to read location and see if it already was read. If not, it prints the location and adds it to the list of locations tead.
If there is a new docid, it resets the last of read locations.

Here is a way to do it.
import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))
final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
currentline = str(line)
if 'DOCID' in currentline:
unique_list = [] # this resets itself every docid
final_list.append(line)
else:
exclude = set(string.punctuation)
currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
city = currentline.split()[1]
if city not in unique_list:
unique_list.append(city)
final_list.append(line)
for line in final_list:
print(line)
output:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
Note: The testfileis a text file with your input text. You can optimize the code if necessary.

Related

How to append/update new values to the rows of a existing csv file from a new csv file as a new column in python using pandas or something else

Old file
Name, 2015
Jack, 205
Jill, 215
Joy, 369
New file
Name, 2016
Hill, 289
Jill, 501
Rauf, 631
Jack, 520
Kay, 236
Joy, 615
Here what i want:
Name, 2015, 2016
Jack, 205, 520
Jill, 215, 501
Joy, 369, 615
Hill, , 289
Rauf, , 631
Kay, , 236
Here is a post about how to create a new column in Pandas DataFrame based on the existing columns:
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
It would help if you were to explain your bug in the following schema:
post your code
post your error or return code
post what you would have expected
it took me a while to get this neat.
First of all you extract the values of the files:
import csv
with open('old.csv', 'r') as old_file:
old_csv = [row for row in csv.reader(old_file)]
with open('new.csv', 'r') as new_file:
new_csv = [row for row in csv.reader(new_file)]
then we need to get the names of the new file:
new_names = [row[0] for row in new_csv]
then we can iterate over all old rows so we can modify the new file and update the values
for name, number in old_csv:
index = None
#Check if the name is already in the file
if name in new_names:
index = new_names.index(name)
new_csv[index].append(number)
#If not, add the new name with the number. This is maybe not neccessay
else:
new_entry = [name, number]
new_csv.append(new_entry)
After we merged the lists, we write the new file
with open('merged_file.csv', 'w') as merge_file:
merger = csv.writer(merge_file)
for row in new_csv:
merger.writerow(row)
the File looks like this:
Name, 2016, 2015
Hill, 289
Jill, 501, 215
Rauf, 631
Jack, 520, 205
Kay, 236
Joy, 615, 369
wasn't sure if "name" is the header or not. This need to be added in the csv reader
Thanks everyone for replying
I found a way as follows:
import pandas as pd
old_file = pd.read_csv('old file.csv')
new_file = pd.read_csv('new file.csv')
old_file = old_file[old_file['Name'].isna() == False]
new_file = new_file[new_file['Name'].isna() == False]
data_combined = pd.merge(old_file, new_file, left_on='Name', right_on='Name', how='outer')
print(data_combined.fillna(0).convert_dtypes())
This gives the desired output:
Name 2015 2016
0 Jack 205 520
1 Jill 215 501
2 Joy 369 615
3 Hill 0 289
4 Rauf 0 631
5 Kay 0 236

Selecting rows based on multiple conditions using Python pandas

Hi I am trying to find a row that satisfies multiple user inputs, I want the result to return a single line that matches the flight date and destination, with origin airport being Atlanta. If they input anything else, it gives back an error and quits.
The input data is a CSV that looks like this:
FL_DATE ORIGIN DEST DEP_TIME
5/1/2017 ATL IAD 1442
5/1/2017 MCO EWR 932
5/1/2017 IAH MIA 1011
5/1/2017 EWR TPA 1646
5/1/2017 RSW EWR 1054
5/1/2017 IAD RDU 2216
5/1/2017 IAD BDL 1755
5/1/2017 EWR RSW 1055
5/1/2017 MCO EWR 744
My current code:
import pandas as pd
df=pd.read_csv("flights.data.csv") #import data frame
input1 = input ('Enter your flight date in MM/DD/YYYY: ') #input flight date
try:
date = str(input1) #flight date is a string
except:
print('Invalid date') #error message if it isn't a string
quit()
input2 = input('Enter your destination airport code: ') #input airport code
try:
destination = str(input2) #destination is a string
except:
print('Invalid destination airport code') #error message if it isn't a string
quit()
df.loc[df['FL_DATE'] == date] & df[df['ORIGIN'] == 'ATL'] & df[df['DEST'] == destination]
#matches flight date, destination, and origin has to equal to GNV
Ideal output is just returning the first row, if I input 5/1/2017 as 'date' and 'IAD' as destination.
You should be able to resolve your issue with below example. The syntax of yours was wrong for multiple conditions
import pandas as pd
df=pd.DataFrame({'FL_DATE':['5/1/2017'],'ORIGIN':['ATL'],'DEST':['IAD'],'DEP_TIME':[1442]})
df.loc[(df['FL_DATE'] == '5/1/2017') & (df['ORIGIN'] == 'ATL') & (df['DEST'] == 'IAD')]
Gives
DEP_TIME DEST FL_DATE ORIGIN
1442 IAD 5/1/2017 ATL
You should change your code to something like this
df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]
In your loc statement, you need to fix your brackets and add parentheses between conditions:
df.loc[(df['FL_DATE'] == input1) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == input2)]
Then it works:
>>> df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]
FL_DATE ORIGIN DEST DEP_TIME
0 5/1/2017 ATL IAD 1442

splitting a text document while making combinations in python

I have two text files, one file contains Neo4j script and other contains list of countries and cities with some document ID and indexes. As given below:
Cypher file:
MATCH (t:Country {name:'%a'}),(o:City {name:'%b'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
Text file:
5 <DOCID>GH950102-000000<DOCID>/O
114 Cardiff/LOCATION
321 United States'/LOCATION
898 Alps/LOCATION
1029 Dresden/LOCATION
1150 Scotland/LOCATION
1162 Gasforth/LOCATION
1258 Arabia/LOCATION
1261 Hejaz/LOCATION
1265 Aleppo/LOCATION
1267 Northern Syria/LOCATION
1269 Aqaba/LOCATION
1271 Jordan./LOCATION
1543 London/LOCATION
1556 London/LOCATION
1609 London/LOCATION
2040 <DOCID>GH950102-000001<DOCID>/O
2317 America/LOCATION
3096 New York./LOCATION
3131 Great Britain/LOCATION
3147 <DOCID>GH950102-000002<DOCID>/O
3184 Edinburgh/LOCATION
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
My question is how to split this document whenever DOCID appears and take the combinations between all the location names between each DOCID. Index Number should be removed and /location should also be removed while copying the location name in Cypher script
I tried with this code but it didn't help.
from itertools import combinations
with open ("results.txt") as f:
for line in f:
for "DOCID" in line.split():
cities = (city.strip() for city in f.readlines())
with open ("cypher.txt") as g:
cypher_query =g.readlines()
with open ("resultfile.txt","w") as f:
for city1,city2 in combinations (cities,2):
f.writelines(line.replace("%a",city1).replace("%b",city2) for line in cypher_query)
f.write("\n")
I dont know cypher so you might have to fit that in by yourself, but this gives you the combinations:
import re
import itertools
with open ("cypher.txt") as g:
cypher_query =g.readlines()
with open("textFile", "r") as inputFile:
locations = set()
for line in inputFile:
if "DOCID" in line and len(locations) > 1:
for city1, city2 in itertools.combinations(locations,2):
#
# here call cypher script with cities as parameter
#
with open ("resultfile.txt","a") as f:
f.writelines(line.replace("%a",city1.strip()).replace("%b",city2.strip()) for line in cypher_query)
f.write("\n")
locations.clear()
else:
location = re.search("(\D+)/LOCATION$", line)
if location:
locations.add(location.group(1))
EDIT: fixed a line, this now produces a file with 1 cypher command for each 2-combination of locations, if you want seperate files, add a counter or similar to the resultfile-filename. Also note there are names like Jordan. (with . at end) if that makes any difference.
Example output:
MATCH (t:Country {name:'Alps'}),(o:City {name:'Scotland'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
MATCH (t:Country {name:'Alps'}),(o:City {name:'Dresden'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
MATCH (t:Country {name:'Alps'}),(o:City {name:'Gasforth'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)

How to select data between two special characters? python

I have a file abcd.txt containing:
"""
hello,123 [1231,12312]1231231
hello, world[3r45t,3242]6542
123 213 135
4234 gdfg gfd 32
sd23 234 sdf 23
hi, hello[234,23423]561
hello, hi[123,123]985
"""
I want to print the string which is after the second ',' character till the ']'.
My output should be:
12312
3242
23423
123
I tried this:
def select(self):
file = open('gis.dat')
list1 = []
for line in file:
line = line.strip()
if re.search('[a-zA-Z]',line):
list1.append(line.partition(',')[-1].rpartition(']')[0])
return list1
You may use:
import re
for line in open("abcd.txt"):
match = re.findall(r".*?,.*?,(\d+)", line)
if match:
print match[0]
Output:
12312
3242
23423
123

get best line when coordinates overlap

I have a dataset like this:
FBti0018875 2031 2045 - TTCCAGAAACTGTTG hsf 62 0.9763443383736672
FBti0018875 2038 2052 + TTCTGGAATGGCAAC hsf 61 0.96581136371138
FBti0018925 2628 2642 - GACTGGAACTTTTCA hsf 60 0.9532992561656318
FBti0018925 2828 2842 + AGCTGGAAGCTTCTT hsf 63 0.9657036377575696
FBti0018949 184 198 - TTCGAGAAACTTAAC hsf 61 0.965931072979605
FBti0018986 2036 2050 + TTCTAGCATATTCTT hsf 63 0.9943559469645551
FBti0018993 1207 1221 - TTCTAGCATATTCTT hsf 63 0.9943559469645575
FBti0018996 2039 2053 + TTCTAGCATATTCTT hsf 63 0.9943559469645575
FBti0019092 2985 2999 - TTCTAGCATATTCTT hsf 63 0.9943559469645409
FBti0019118 1257 1271 + TTCCAGAATCTTGGA hsf 60 0.9523907773798134
The first column is an identifier, and the second and third are coordinates. I only want to keep one line for each range of coordinates. Meaning that I want to keep the best identifier if there is an overlap for it (the best is defined based on the last column, higher value = better).
For example for identifier FBti0018875 I would keep the first one because a) there is overlap with the second line and b) its last column value is higher (0.97>0.96).
If there was not an overlap between the first and second line I would keep both. Sometimes I can have 5 or 6 lines for each identifier, so it's not as simple as comparing the current one with the previous.
So far I have this code that doesn't work.
def test(lista, listb): #Compare lists of coordinates
a = 0
b = 0
found = False
while a<len(lista) and b<len(listb):
result = check( lista[a] , listb[b] )
if result < 0:
a += 1
continue
if result > 0:
b += 1
continue
# we found overlapping intervals
found = True
return (found, a, lista[a], b, listb[b] )
return found
def check( (astart, aend) , (bstart, bend) ):
if aend < bstart:
return -1
if bend < astart:
return 1
return 0
refine = open("tffm/tffm_te_hits95.txt", "r")
refined = open("tffm/tffm_te_unique_hits95.txt", "w")
current_TE=[]
for hit in refine:
info=hit.rstrip().split('\t')
if len(current_TE)==0 or info[0]==current_TE[0][0]:
current_TE.append(info)
elif info[0]!=current_TE[0][0]:
to_keep=[]
i=0
if len(current_TE)==1:
to_keep.append(0)
else:
for i in range(len(current_TE)-1):
if [current_TE[i][1], current_TE[i][2]] == [current_TE[i+1][1], current_TE[i+1][2]]:
if current_TE[i][7]<current_TE[i+1][7]:
to_keep.append(i+1)
elif test([(current_TE[i][1], current_TE[i][2])], [(current_TE[i+1][1], current_TE[i+1][2])])!='False':
if current_TE[i][7]<current_TE[i+1][7]:
to_keep.append(i+1)
try:
to_keep.remove(i)
except:
pass
else:
to_keep.append(i)
else:
to_keep.append(i)
if i==len(current_TE)-1:
to_keep.append(i+1)
for item in set(to_keep):
print current_TE[item]
current_TE=[]
The expected outcome in this case would be (only losing one FBti0018875)
FBti0018875 2031 2045 - TTCCAGAAACTGTTG hsf 62 0.9763443383736672
FBti0018925 2628 2642 - GACTGGAACTTTTCA hsf 60 0.9532992561656318
FBti0018925 2828 2842 + AGCTGGAAGCTTCTT hsf 63 0.9657036377575696
FBti0018949 184 198 - TTCGAGAAACTTAAC hsf 61 0.965931072979605
FBti0018986 2036 2050 + TTCTAGCATATTCTT hsf 63 0.9943559469645551
FBti0018993 1207 1221 - TTCTAGCATATTCTT hsf 63 0.9943559469645575
FBti0018996 2039 2053 + TTCTAGCATATTCTT hsf 63 0.9943559469645575
FBti0019092 2985 2999 - TTCTAGCATATTCTT hsf 63 0.9943559469645409
FBti0019118 1257 1271 + TTCCAGAATCTTGGA hsf 60 0.9523907773798134
I have tried (with the code) to generate a list containing several lines with the same identifier and then parse it for the ones with overlapping coordinates and if that was the case select one according to the last column. It succeeds in checking the overlap but I only retrieve a handful of lines in some versions of it or:
Traceback (most recent call last):
File "<stdin>", line 29, in <module>
IndexError: list index out of range
Finally I solved. There was a silly mistake with 'False' instead of False.
Here is the solution:
def test(lista, listb):
a = 0
b = 0
found = False
while a<len(lista) and b<len(listb):
result = check( lista[a] , listb[b] )
if result < 0:
a += 1
continue
if result > 0:
b += 1
continue
# we found overlapping intervals
found = True
return (found, a, lista[a], b, listb[b] )
return found
def check( (astart, aend) , (bstart, bend) ):
if aend < bstart:
return -1
if bend < astart:
return 1
return 0
def get_unique_sre(current_TE):
to_keep = range(0,len(current_TE))
for i in range(len(current_TE)-1):
if [current_TE[i][1], current_TE[i][2]] == [current_TE[i+1][1], current_TE[i+1][2]]:
if current_TE[i][7]<current_TE[i+1][7]:
try:
to_keep.remove(i)
except:
pass
elif test([(current_TE[i][1], current_TE[i][2])], [(current_TE[i+1][1], current_TE[i+1][2])])!=False:
if current_TE[i][7]<current_TE[i+1][7]:
try:
to_keep.remove(i)
except:
pass
else:
to_keep.remove(i+1)
final_TE=[]
for i in to_keep:
final_TE.append(current_TE[i])
return final_TE
refine = open("tffm/tffm_te_hits95.txt", "r")
refined = open("tffm/tffm_te_unique_hits95.txt", "w")
current_TE=[]
for hit in refine:
info=hit.rstrip().split('\t')
if len(current_TE)==0 or info[0]==current_TE[0][0]:
current_TE.append(info)
else:
if len(current_TE)==1:
print>>refined, current_TE[0]
current_TE=[]
else:
final_TE = get_unique_sre(current_TE)
for item in final_TE:
print>>refined, item
current_TE=[]
refined.close()

Categories

Resources