Related
I am working on reading PDF files that are CVs into a DataFrame (pandas). However, after reading the files I find a NaN row and a duplicate row of the last CV (alphabetically). Is there something in the code that does this? I can't seem to figure out why. I have tried changing around the iloc index[0] parts and the fileIndex value, but have found no solution. All help is appreciated.
dataset = []
pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
dataset = pd.DataFrame(columns = ['FileName','Text'])
fileIndex = 0
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([output_data, newRow], ignore_index=True)
Here is a list of the PDF files currently in the directory:
[PDF files in directory][1]
[1]: https://i.stack.imgur.com/8UJZQ.png
This is the result:
FileName Text
0 NaN NaN
1 C:/Users/user/Documents/CV ML Test/CVs\1... [Copyright, ©, 1996-2018,, JobStreet.com., All...
2 C:/Users/user/Documents/CV ML Test/CVs\1... [AMY, PROFILE, Fund, accountant, with, nearly,...
3 C:/Users/user/Documents/CV ML Test/CVs\2... [BEN, Fund, Accoutant, Sep, 2016, -, Present, ...
4 C:/Users/user/Documents/CV ML Test/CVs\3... [CARRIE, Professional, Experience, Citco, Fund...
5 C:/Users/user/Documents/CV ML Test/CVs\4... [DICKSON, PROFESSIONAL, EXPERIENCE, Conifer, F...
6 C:/Users/user/Documents/CV ML Test/CVs\5... [EDWARDO, QUALIFICATION, SUMMARY, Results-driv...
7 C:/Users/user/Documents/CV ML Test/CVs\6... [FAYE, Citco, Fund, Services, (Singapore), Pte...
8 C:/Users/user/Documents/CV ML Test/CVs\7... [GIRAFFE, Work, Experience:, CITCO, Fund, Serv...
9 C:/Users/user/Documents/CV ML Test/CVs\8... [Career, Objectives, Have, strong, interest, i...
10 C:/Users/user/Documents/CV ML Test/CVs\9... [IGNATIUS, Work, Experience, Watiga, &, Co., (...
11 C:/Users/user/Documents/CV ML Test/CVs\9... [IGNATIUS, Work, Experience, Watiga, &, Co., (...
Fixed by changing the following part of the code:
dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([dataset, newRow], ignore_index= True )
Notice last line has changed list parameter to match first part. "dataset" instead of output_data.
I have two data tables related to journals (title, issn ...) and basically I want to know if journal of table1 is present in table2. For comparing I only use a digital identifier, named issn
My basic problem is that I dont manage to iterate throw all tab1, it stops after the end of tab2.
import csv
tab1 = open("tab1.csv", 'r', encoding='utf8')
readtab1 = csv.DictReader(tab1)
tab2 = open("tab2.csv", 'r', encoding='utf8')
readtab2 = csv.DictReader(tab2)
linenb1 = 0
for row1 in readtab1:
issn1 = row1['ISSN'].strip()
linenb1 +=1
linenb2 = 0
for row2 in readtab2 :
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb1, linenb2, issn1,issn2)
consol
1 1 2552-8831 0001-253X
1 2 2552-8831 0002-2667
[Finished in 0.2s]
tab1
Nom du titre,ISSN,Format,nom editeur
Revue Droit & Litterature,2552-8831,Papier,LGDJ MONTCHRESTIEN
Memoires en Jeu,2497-2711,Papier,EDITIONS KIME
Le Monde,2262-4694,Online,LE MONDE
Journal des Energies Renouvelables,2491-8687,Papier + e-mail,OBSERVER
tab2
ISSN,TITLE,TARGET_PUBLIC_NAME,TARGET_SERVICE,THRESHOLD_ACTIVE,THRESHOLD_GLOBAL,PUBLISHER,LOCAL_THRESHOLD
0001-253X,Aslib proceedings,French National Licences Emerald,getFullTxt,"$obj->parsedDate('>=',1949,1,1) && $obj->parsedDate('<=',2010,65,6)","$obj->parsedDate('>=',1949,1,1) && $obj->parsedDate('<=',2010,65,6)",Emerald Group Publishing Ltd.,
0002-2667,Aircraft Engineering,French National Licences Emerald,getFullTxt,"$obj->parsedDate('>=',1929,1,1) && $obj->parsedDate('<=',1986,58,3)","$obj->parsedDate('>=',1929,1,1) && $obj->parsedDate('<=',1986,58,3)",Emerald Group Pub.,
I dont get because nested for loop works :
for i in range(0,5):
for j in range(1,10):
print(i,j)
I would try setting the fieldnames and delimiter values of the DictReaders. Because they respectively default to the first row's values and a comma, whereas the data does not have a header row and it's separated by spaces.
fieldnames = "Nom du titre,ISSN,Format,Nom de l'éditeur,Nom de l'abonné".split(',')
abt = open("tab1.csv", 'r', encoding='utf8')
readTab1= csv.DictReader(abt, fieldnames=fieldnames, delimiter=' ')
sfx = open("tab2.csv", 'r', encoding='utf8')
readTab2 = csv.DictReader(sfx, fieldnames=fieldnames, delimiter=' ')
linenb = 0
for row1 in readTab1:
issn1 = row1['ISSN'].strip()
if len(issn1) < 2 : continue
linenb +=1
linenb2 = 0
for row2 in readTab2:
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb,issn1,issn2)
The second loop was satying at the end, the solution is to add tab2.seek(0)
I found the solution here
https://stackoverflow.com/a/26526224/3334635
What is strange for me is that it works with number, but not with csv reader
import csv
tab1 = open("tab1.csv", 'r', encoding='utf8', newline='')
readtab1 = csv.DictReader(tab1)
tab2 = open("tab2.csv", 'r', encoding='utf8', newline='')
readtab2 = csv.DictReader(tab2)
linenb1 = 0
for row1 in readtab1:
issn1 = row1['ISSN'].strip()
linenb1 +=1
linenb2 = 0
for row2 in readtab2 :
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb1, linenb2, issn1,issn2)
tab2.seek(0)
next(readtab2) # skip header
1 1 2552-8831 0001-253X
1 2 2552-8831 0002-2667
2 1 2497-2711 0001-253X
2 2 2497-2711 0002-2667
3 1 2262-4694 0001-253X
3 2 2262-4694 0002-2667
4 1 2491-8687 0001-253X
4 2 2491-8687 0002-2667
I have to write a script in python that will do following actions
I have a xlsx/csv file in which there are 300 cities listed in one column
I have to make all pairs between them and also with help of google api I have to add their distance and travel time in the second column
my CSV file is looks like this:
=======
SOURCE
=======
Agra
Delhi
Jaipur
and expected output in csv/xlsx file be like this
=============================================
SOURCE | DESTINATION | DISTANCE | TIME_TRAVEL
=============================================
Agra | Delhi | 247 | 4
Agra | Jaipur | 238 | 4
Delhi | Agra | 247 | 4
Delhi | jaipur | 281 | 5
Jaipur | Agra | 238 | 4
Jaipur | Delhi | 281 | 5
and so on.. how to do this.?
NOTE: Distance and Travel Time are from google.
To make the pairs you can use itertools.permutations to get all possible pairs.
Code for the same would be as :
import csv # imports the csv module
import sys # imports the sys module
import ast
import itertools
source_list = []
destination_list = []
type_list = []list
f = open(sys.argv[1], 'rb')
g = open(sys.argv[2], 'wb')
# opens the csv file
try:
reader = csv.reader(f)
my_list = list(reader) # creates the reader object
for i in my_list:
source_list.append(i[0])
a = list(itertools.permutations(source_list, 2))
for i in a:
source_list.append(i[0])
destination_list.append(i[1])
mywriter=csv.writer(g)
rows = zip(source_list,destination_list)
mywriter.writerows(rows)
g.close()
finally:
f.close()
Apart from that to get distance and time from the google this sample code may work for full debugging.
import csv # imports the csv module
import sys # imports the sys module
import urllib2,json
import ast
api_google_key = ''
api_google_url = 'https://maps.googleapis.com/maps/api/distancematrix/json?origins='
source_list = []
destination_list = []
distance_list = []
duration_list = []
f = open(sys.argv[1], 'rb')
g = open(sys.argv[2], 'wb')
# opens the csv file
try:
reader = csv.reader(f)
my_list = list(reader) # creates the reader object
for i in my_list:
if i:
s = (i[0])
src = s.replace(" ","")
d = (i[1])
dest = d.replace(" ","")
source = ''.join(e for e in src if e.isalnum())
destination = ''.join(e for e in dest if e.isalnum())
print 'source status = '+str(source.isalnum())
print 'dest status = '+str(destination.isalnum())
source_list.append(source)
destination_list.append(destination)
request = api_google_url+source+'&destinations='+destination+'&key='+api_google_key
print request
dist = json.load(urllib2.urlopen(request))
if dist['rows']:
if 'duration' in dist['rows'][0]['elements'][0].keys():
duration_dict = dist['rows'][0]['elements'][0]['duration']['text']
distance_dict = dist['rows'][0]['elements'][0]['distance']['text']
else:
duration_dict = 0
distance_dict = 0
else:
duration_dict = 0
distance_dict = 0
distance_list.append(distance_dict)
duration_list.append(duration_dict)
mywriter=csv.writer(g)
rows = zip(source_list,destination_list,distance_list,duration_list)
mywriter.writerows(rows)
g.close()
finally:
f.close()
You can do this by using itertools.product but that'll mean that you'll also get repetitions like (Agra, Agra) the distance for which will be 0 really.
import itertools
cities = ["Agra","Delhi","Jaipur"]
cities2 = cities
p = itertools.product(cities, cities2)
print(list(p))
In this case you'd get
[('Agra', 'Agra'), ('Agra', 'Delhi'), ('Agra', 'Jaipur'), ('Delhi', 'Agra'), ('Delhi', 'Delhi'), ('Delhi', 'Jaipur'), ('Jaipur', 'Agra'), ('Jaipur', 'Delhi'), ('Jaipur', 'Jaipur')]
You can take loop in this forlist and make a request to google to get the travel time and distance.
>>> for pair in list(p):
... print (pair)
...
('Agra', 'Agra')
('Agra', 'Delhi')
('Agra', 'Jaipur')
('Delhi', 'Agra')
('Delhi', 'Delhi')
('Delhi', 'Jaipur')
('Jaipur', 'Agra')
('Jaipur', 'Delhi')
('Jaipur', 'Jaipur')
You can get all the combinations with itertools.permutations() like so:
from itertools import permutations
with open(cities_file, 'r') as f, open(newfile, 'w') as f2:
for pair in (permutations([a.strip() for a in f.read().splitlines()], 2)):
print pair
response = googleapi.get(pair)
f2.write(response+'\n')
Output of print pair
('Agra', 'Delhi')
('Agra', 'Jaipur')
('Delhi', 'Agra')
('Delhi', 'Jaipur')
('Jaipur', 'Agra')
('Jaipur', 'Delhi')
You can then hit the api from the list elements 1 by 1 and keep storing the result in the file.
I have a existing data table with two columns, one is a ID and one is a list of IDs, separated by comma.
For example
ID | List
---------
1 | 1, 4, 5
3 | 2, 12, 1
I would like to split the column List so that I have a table like this:
ID | List
---------
1 | 1
1 | 4
1 | 5
3 | 2
3 | 12
3 | 1
I figured this out now:
tablename='Querysummary Data'
table=Document.Data.Tables[tablename]
topiccolname='TOPIC_ID'
topiccol=table.Columns[topiccolname]
topiccursor=DataValueCursor.Create[str](topiccol)
docscolname='DOC_IDS'
doccol=table.Columns[docscolname]
doccursor=DataValueCursor.Create[str](doccol)
myPanel = Document.ActivePageReference.FilterPanel
idxSet = myPanel.FilteringSchemeReference.FilteringSelectionReference.GetSelection(table).AsIndexSet()
keys=dict()
topdoc=dict()
for row in table.GetRows(idxSet,topiccursor,doccursor):
keys[topiccursor.CurrentValue]=doccursor.CurrentValue
for key in keys:
str = keys[key].split(",")
for i in str:
topdoc[key]=i
print key + " " +i
now I can print the topic id with the corresponding id.
How can I create a new data table in Spotfire using this dict()?
I solved it myself finally..maybe there is some better code but it works:
tablename='Querysummary Data'
table=Document.Data.Tables[tablename]
topiccolname='TOPIC_ID'
topiccol=table.Columns[topiccolname]
topiccursor=DataValueCursor.Create[str](topiccol)
docscolname='DOC_IDS'
doccol=table.Columns[docscolname]
doccursor=DataValueCursor.Create[str](doccol)
myPanel = Document.ActivePageReference.FilterPanel
idxSet = myPanel.FilteringSchemeReference.FilteringSelectionReference.GetSelection(table).AsIndexSet()
# build a string representing the data in tab-delimited text format
textData = "TOPIC_ID;DOC_IDS\r\n"
keys=dict()
topdoc=dict()
for row in table.GetRows(idxSet,topiccursor,doccursor):
keys[topiccursor.CurrentValue]=doccursor.CurrentValue
for key in keys:
str = keys[key].split(",")
for i in str:
textData += key + ";" + i + "\r\n"
dataSet = DataSet()
dataTable = DataTable("DOCIDS")
dataTable.Columns.Add("TOPIC_ID", System.String)
dataTable.Columns.Add("DOC_IDS", System.String)
dataSet.Tables.Add(dataTable)
# make a stream from the string
stream = MemoryStream()
writer = StreamWriter(stream)
writer.Write(textData)
writer.Flush()
stream.Seek(0, SeekOrigin.Begin)
# set up the text data reader
readerSettings = TextDataReaderSettings()
readerSettings.Separator = ";"
readerSettings.AddColumnNameRow(0)
readerSettings.SetDataType(0, DataType.String)
readerSettings.SetDataType(1, DataType.String)
readerSettings.SetDataType(2, DataType.String)
# create a data source to read in the stream
textDataSource = TextFileDataSource(stream, readerSettings)
# add the data into a Data Table in Spotfire
if Document.Data.Tables.Contains("Querysummary Mapping"):
Document.Data.Tables["Querysummary Mapping"].ReplaceData(textDataSource)
else:
newTable = Document.Data.Tables.Add("Querysummary Mapping", textDataSource)
tableSettings = DataTableSaveSettings (newTable, False, False)
Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)
I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.