splitting a text document while making combinations in python - python

I have two text files, one file contains Neo4j script and other contains list of countries and cities with some document ID and indexes. As given below:
Cypher file:
MATCH (t:Country {name:'%a'}),(o:City {name:'%b'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
Text file:
5 <DOCID>GH950102-000000<DOCID>/O
114 Cardiff/LOCATION
321 United States'/LOCATION
898 Alps/LOCATION
1029 Dresden/LOCATION
1150 Scotland/LOCATION
1162 Gasforth/LOCATION
1258 Arabia/LOCATION
1261 Hejaz/LOCATION
1265 Aleppo/LOCATION
1267 Northern Syria/LOCATION
1269 Aqaba/LOCATION
1271 Jordan./LOCATION
1543 London/LOCATION
1556 London/LOCATION
1609 London/LOCATION
2040 <DOCID>GH950102-000001<DOCID>/O
2317 America/LOCATION
3096 New York./LOCATION
3131 Great Britain/LOCATION
3147 <DOCID>GH950102-000002<DOCID>/O
3184 Edinburgh/LOCATION
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
My question is how to split this document whenever DOCID appears and take the combinations between all the location names between each DOCID. Index Number should be removed and /location should also be removed while copying the location name in Cypher script
I tried with this code but it didn't help.
from itertools import combinations
with open ("results.txt") as f:
for line in f:
for "DOCID" in line.split():
cities = (city.strip() for city in f.readlines())
with open ("cypher.txt") as g:
cypher_query =g.readlines()
with open ("resultfile.txt","w") as f:
for city1,city2 in combinations (cities,2):
f.writelines(line.replace("%a",city1).replace("%b",city2) for line in cypher_query)
f.write("\n")

I dont know cypher so you might have to fit that in by yourself, but this gives you the combinations:
import re
import itertools
with open ("cypher.txt") as g:
cypher_query =g.readlines()
with open("textFile", "r") as inputFile:
locations = set()
for line in inputFile:
if "DOCID" in line and len(locations) > 1:
for city1, city2 in itertools.combinations(locations,2):
#
# here call cypher script with cities as parameter
#
with open ("resultfile.txt","a") as f:
f.writelines(line.replace("%a",city1.strip()).replace("%b",city2.strip()) for line in cypher_query)
f.write("\n")
locations.clear()
else:
location = re.search("(\D+)/LOCATION$", line)
if location:
locations.add(location.group(1))
EDIT: fixed a line, this now produces a file with 1 cypher command for each 2-combination of locations, if you want seperate files, add a counter or similar to the resultfile-filename. Also note there are names like Jordan. (with . at end) if that makes any difference.
Example output:
MATCH (t:Country {name:'Alps'}),(o:City {name:'Scotland'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
MATCH (t:Country {name:'Alps'}),(o:City {name:'Dresden'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)
MATCH (t:Country {name:'Alps'}),(o:City {name:'Gasforth'})
WITH point({ longitude: toFloat(t.longitude), latitude: toFloat(t.latitude) }) AS copoint, point({ longitude: toFloat(o.longitude), latitude: toFloat(o.latitude) }) AS cipoint
RETURN distance(copoint, cipoint)

Related

PyPDF2 Font Read Issue

I'm writing a script to automate extracting data from pdfs I receive. I'm using PyPDF2 to read the pdfs and extract the text to be interpreted. I've tested pdfs with two different formats. The script works perfectly for the first format. When trying it with the second format I'm getting an indexing error (below). After troubleshooting I've found the issue is due to the font used in the second format. They use "Roboto" while the first, successful format, uses Arial.
I've attached stripped-down versions of the pdfs that are causing issues. One in Roboto and one I manually changed to Arial.
https://drive.google.com/drive/folders/1BhaXPfNyLx8euR2dPQaTqdHvtYJg8yEh?usp=sharing
The snippet of code here is where I'm running into the issue:
import PyPDF2
pdf_roboto = r"C:\Users\Robert.Smyth\Python\test_pdf_roboto.pdf"
pdf_arial = r"C:\Users\Robert.Smyth\Python\test_pdf_arial.pdf"
reader = PyPDF2.PdfFileReader(pdf_roboto)
pageObj = reader.pages[0]
pages_text = pageObj.extractText()
The indexing error I'm getting is:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
C:\Users\ROBERT~1.SMY\AppData\Local\Temp/ipykernel_22076/669450932.py in <module>
1 reader = PyPDF2.PdfFileReader(pdf_roboto)
2 pageObj = reader.pages[0]
----> 3 pages_text = pageObj.extractText()
~\Anaconda3\lib\site-packages\PyPDF2\_page.py in extractText(self, Tj_sep, TJ_sep)
1539 """
1540 deprecate_with_replacement("extractText", "extract_text")
-> 1541 return self.extract_text()
1542
1543 def _get_fonts(self) -> Tuple[Set[str], Set[str]]:
~\Anaconda3\lib\site-packages\PyPDF2\_page.py in extract_text(self, Tj_sep, TJ_sep, orientations, space_width, *args)
1511 orientations = (orientations,)
1512
-> 1513 return self._extract_text(
1514 self, self.pdf, orientations, space_width, PG.CONTENTS
1515 )
~\Anaconda3\lib\site-packages\PyPDF2\_page.py in _extract_text(self, obj, pdf, orientations, space_width, content_key)
1144 if "/Font" in resources_dict:
1145 for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1146 cmaps[f] = build_char_map(f, space_width, obj)
1147 cmap: Tuple[Union[str, Dict[int, str]], Dict[str, str], str] = (
1148 "charmap",
~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in build_char_map(font_name, space_width, obj)
20 space_code = 32
21 encoding, space_code = parse_encoding(ft, space_code)
---> 22 map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
23
24 # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in parse_to_unicode(ft, space_code)
187 cm = prepare_cm(ft)
188 for l in cm.split(b"\n"):
--> 189 process_rg, process_char = process_cm_line(
190 l.strip(b" "), process_rg, process_char, map_dict, int_entry
191 )
~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in process_cm_line(l, process_rg, process_char, map_dict, int_entry)
247 process_char = False
248 elif process_rg:
--> 249 parse_bfrange(l, map_dict, int_entry)
250 elif process_char:
251 parse_bfchar(l, map_dict, int_entry)
~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in parse_bfrange(l, map_dict, int_entry)
256 lst = [x for x in l.split(b" ") if x]
257 a = int(lst[0], 16)
--> 258 b = int(lst[1], 16)
259 nbi = len(lst[0])
260 map_dict[-1] = nbi // 2
IndexError: list index out of range
I've found that if I use the exact same pdf and all I change is the font from Roboto to Arial, PyPDF2 has no problem extracting the text. I've searched online and in the PyPDF2 documentation but I can't find any solution on how to get it to extract text in the Roboto font, or add the Roboto font to the PyPDF2 font library.
I'd really appreciate if anyone could provide some advice on how to solve this issue.
Note: manually changing the font from Roboto to Arial isn't a desirable option as I receive hundreds of these invoices monthly.

How to append/update new values to the rows of a existing csv file from a new csv file as a new column in python using pandas or something else

Old file
Name, 2015
Jack, 205
Jill, 215
Joy, 369
New file
Name, 2016
Hill, 289
Jill, 501
Rauf, 631
Jack, 520
Kay, 236
Joy, 615
Here what i want:
Name, 2015, 2016
Jack, 205, 520
Jill, 215, 501
Joy, 369, 615
Hill, , 289
Rauf, , 631
Kay, , 236
Here is a post about how to create a new column in Pandas DataFrame based on the existing columns:
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
It would help if you were to explain your bug in the following schema:
post your code
post your error or return code
post what you would have expected
it took me a while to get this neat.
First of all you extract the values of the files:
import csv
with open('old.csv', 'r') as old_file:
old_csv = [row for row in csv.reader(old_file)]
with open('new.csv', 'r') as new_file:
new_csv = [row for row in csv.reader(new_file)]
then we need to get the names of the new file:
new_names = [row[0] for row in new_csv]
then we can iterate over all old rows so we can modify the new file and update the values
for name, number in old_csv:
index = None
#Check if the name is already in the file
if name in new_names:
index = new_names.index(name)
new_csv[index].append(number)
#If not, add the new name with the number. This is maybe not neccessay
else:
new_entry = [name, number]
new_csv.append(new_entry)
After we merged the lists, we write the new file
with open('merged_file.csv', 'w') as merge_file:
merger = csv.writer(merge_file)
for row in new_csv:
merger.writerow(row)
the File looks like this:
Name, 2016, 2015
Hill, 289
Jill, 501, 215
Rauf, 631
Jack, 520, 205
Kay, 236
Joy, 615, 369
wasn't sure if "name" is the header or not. This need to be added in the csv reader
Thanks everyone for replying
I found a way as follows:
import pandas as pd
old_file = pd.read_csv('old file.csv')
new_file = pd.read_csv('new file.csv')
old_file = old_file[old_file['Name'].isna() == False]
new_file = new_file[new_file['Name'].isna() == False]
data_combined = pd.merge(old_file, new_file, left_on='Name', right_on='Name', how='outer')
print(data_combined.fillna(0).convert_dtypes())
This gives the desired output:
Name 2015 2016
0 Jack 205 520
1 Jill 215 501
2 Joy 369 615
3 Hill 0 289
4 Rauf 0 631
5 Kay 0 236

OverflowError when trying to convert generators to lists

I'm trying to extract dates from txt files using datefinder.find_dates which returns a generator object. Everything works fine until I try to convert the generator to list, when i get the following error.
I have been looking around for a solution but I can't figure out a solution to this, not sure I really understand the problem neither.
import datefinder
import glob
path = "some_path/*.txt"
files = glob.glob(path)
dates_dict = {}
for name in files:
with open(name, encoding='utf8') as f:
dates_dict[name] = list(datefinder.find_dates(f.read()))
Returns :
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-53-a4b508b01fe8> in <module>()
1 for name in files:
2 with open(name, encoding='utf8') as f:
----> 3 dates_dict[name] = list(datefinder.find_dates(f.read()))
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
find_dates(self, text, source, index, strict)
29 ):
30
---> 31 as_dt = self.parse_date_string(date_string, captures)
32 if as_dt is None:
33 ## Dateutil couldn't make heads or tails of it
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
parse_date_string(self, date_string, captures)
99 # otherwise self._find_and_replace method might corrupt
them
100 try:
--> 101 as_dt = parser.parse(date_string, default=self.base_date)
102 except ValueError:
103 # replace tokens that are problematic for dateutil
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(timestr, parserinfo, **kwargs)
1354 return parser(parserinfo).parse(timestr, **kwargs)
1355 else:
-> 1356 return DEFAULTPARSER.parse(timestr, **kwargs)
1357
1358
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
651 raise ValueError("String does not contain a date:",
timestr)
652
--> 653 ret = self._build_naive(res, default)
654
655 if not ignoretz:
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
_build_naive(self, res, default)
1222 cday = default.day if res.day is None else res.day
1223
-> 1224 if cday > monthrange(cyear, cmonth)[1]:
1225 repl['day'] = monthrange(cyear, cmonth)[1]
1226
C:\ProgramData\Anaconda3\lib\calendar.py in monthrange(year, month)
122 if not 1 <= month <= 12:
123 raise IllegalMonthError(month)
--> 124 day1 = weekday(year, month, 1)
125 ndays = mdays[month] + (month == February and isleap(year))
126 return day1, ndays
C:\ProgramData\Anaconda3\lib\calendar.py in weekday(year, month, day)
114 """Return weekday (0-6 ~ Mon-Sun) for year (1970-...), month(1- 12),
115 day (1-31)."""
--> 116 return datetime.date(year, month, day).weekday()
117
118
OverflowError: Python int too large to convert to C long
Can someone explain this clearly?
Thanks in advance
REEDIT : After taking into consideration the remarks that were made, I found a minimal, readable and verifiable example. The error occurs on :
import datefinder
generator = datefinder.find_dates("466990103060049")
for s in generator:
pass
This looks to be a bug in the library you are using. It is trying to parse the string as a year, but that this year is too big to be handled by Python. The library that datefinder is using says that it raises an OverflowError in this instance, but that datefinder is ignoring this possibility.
One quick and dirty hack just to get it working would be to do:
>>> datefinder.ValueError = ValueError, OverflowError
>>> list(datefinder.find_dates("2019/02/01 is a date and 466990103060049 is not"))
[datetime.datetime(2019, 2, 1, 0, 0)]

Python removing duplicate names

I have plain text file with words in each line:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3521 >India<TOPONYM>/O
3526 >Zimbabwe<TOPONYM>/O
3531 >England<TOPONYM>/O
3536 >Melbourne<TOPONYM>/O
3541 >England<TOPONYM>/O
3546 >England<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3556 >England<TOPONYM>/O
3561 >England<TOPONYM>/O
3566 >Australia<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3821 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4234 >Hampden<TOPONYM>/O
4239 >Hampden<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4845 >Edinburgh<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
I want to remove same location names in this list and it should look like this:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O
I want to remove the duplicate locations name and docid should remain in the file. I know there is a way through linux using uniq but if I'll run that it will remove locations within different docid.
Is there anyway to split it through every docid and within docid if location names are same then it should remove duplicate names.
I am writing from mobile, so this will not be a complete solution, but the key points:
import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={}
for line in file:
if re.match(Docid,line):
Lines={}
print line
else:
loc=re.findall(Location, line)[0]
if loc not in Lines.keys():
print line
Lines[loc] = True
Basically it checks each line of it is not a new docid. If it isn't, it then tries to read location and see if it already was read. If not, it prints the location and adds it to the list of locations tead.
If there is a new docid, it resets the last of read locations.
Here is a way to do it.
import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))
final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
currentline = str(line)
if 'DOCID' in currentline:
unique_list = [] # this resets itself every docid
final_list.append(line)
else:
exclude = set(string.punctuation)
currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
city = currentline.split()[1]
if city not in unique_list:
unique_list.append(city)
final_list.append(line)
for line in final_list:
print(line)
output:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
Note: The testfileis a text file with your input text. You can optimize the code if necessary.

Clean from .txt, write new variable as a csv delimited string (not list) into a csv file

my .txt file looks like the following:
Page 1 of 49
<="">
View Full Profile
S.S. Anne
Oil Tanker
42 miles offshore
Anchor length is 50 feet
<="">
View Full Profile
S.S. Minnow
Passenger Ship
1502.2 miles offshore
Anchor length is 12 feet
<="">
View Full Profile
S.S. Virginia
Passenger Ship
2 km offshore
Anchor length is 25 feet
<="">
View Full Profile
S.S. Chesapeake
Naval Ship
10 miles offshore
Anchor length is 75 feet
<="">
I've worked out the cleaning part so that following 'View Full Profile' I take the next 4 line items and put them into their own new line item, I do this for each instance of 'View Full Profile'.
Code:
import csv
data = []
with open('ship.txt','r',encoding='utf8') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if 'View Full Profile' in line:
x = [lines[i+1],lines[i+2],lines[i+3],lines[i+4]]
data.append(x)
for line in data:
y = line
print(y)
with open('ship_test.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
And the output of printing y to see what will be written into the new file is:
['S.S. Anne\n', 'Oil Tanker\n', '42 miles offshore\n', 'Anchor length is 50 feet\n']
['S.S. Minnow\n', 'Passenger Ship\n', '1502.2 miles offshore\n', 'Anchor length is 12 feet\n']
['S.S. Virginia\n', 'Passenger Ship\n', '2 km offshore\n', 'Anchor length is 25 feet\n']
['S.S. Chesapeake\n', 'Naval Ship\n', '10 miles offshore\n', 'Anchor length is 75 feet\n']
[Finished in 0.1s]
When it copies into the new file I get the following:
"S.S. Anne
","Oil Tanker
","42 miles offshore
","Anchor length is 50 feet
"
"S.S. Minnow
","Passenger Ship
","1502.2 miles offshore
","Anchor length is 12 feet
"
"S.S. Virginia
","Passenger Ship
","2 km offshore
","Anchor length is 25 feet
"
"S.S. Chesapeake
","Naval Ship
","10 miles offshore
","Anchor length is 75 feet
"
I am looking to write into the new file the output of y. I believe it writes each item as its own line due to the '/n' attached to each element of x. How do I remove this (I tried splitting and received an error) so that x is written as a single line item in the new file as a csv string?
line = line.replace('\n', '')
Use the replace method
If there is a list of strings:
line = [l.replace('\n', '') for l in line]
You should strip the terminating newline as soon as you read the line. The first part of your code could become:
data = []
with open('ship.txt','r',encoding='utf8') as f:
for line in lines:
if 'View Full Profile' in line:
x = [ next(f).strip(), next(f).strip(),
next(f).strip(), next(f).strip() ]
data.append(x)
You could even write the csv file on the fly:
with open('ship.txt','r',encoding='utf8') as f, open('ship_test.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in f:
if 'View Full Profile' in line:
x = [ next(f).strip(), next(f).strip(),
next(f).strip(), next(f).strip() ]
writer.writerow(x)

Categories

Resources