Calculating average of data set, with text mixed [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am required to write a Python program that reads a file and calculates the average GDP for each country over the 10-year period.
Basically, my desired result is:
Australia: 1248467214849.1
Azerbaijan: 55506365440.0
Bangladesh: 139036345780.9
Brazil: 2057882976008.9
Brunei Darussalam: 14817756697.0
Burkina Faso: 10081729086.1
Cabo Verde: 1719693752.3
Cambodia: 13779735437.1
Chile: 229246627569.0
China: 7784747168448.6
Czech Republic: 207328405561.6
Dominica: 499171357.0
Egypt, Arab Rep.: 247614743339.3
France: 2702817149305.2
Germany: 3582562859622.3
Greece: 270091322197.4
Guam: 5115700000.0
India: 1726508317353.4
Iran, Islamic Rep.: 454617559842.3
Iraq: 169480789377.9
Japan: 5217301203153.5
Jordan: 29469864942.1
Kazakhstan: 168198946242.6
Kenya: 48807995178.8
Korea, Rep.: 1205755199135.1
Latvia: 28908355369.8
Lebanon: 40455121214.3
Lithuania: 42763449721.2
Madagascar: 9486935333.5
Malaysia: 274833978374.2
Mali: 11894695436.7
Mongolia: 9207583282.1
Mozambique: 12838623643.4
Myanmar: 50703575766.4
Nicaragua: 10212597587.4
Nigeria: 375494148527.7
Paraguay: 23250819867.6
Philippines: 231981575952.4
Qatar: 149455747118.1
Singapore: 257026873704.2
Spain: 1404296966483.9
Sweden: 519174481541.8
Tanzania: 36731725995.3
Tunisia: 44118349316.0
Turkmenistan: 29383204467.2
United Kingdom: 2736682446205.8
United States: 16108231800000.0
Vietnam: 144579453846.2
Zambia: 21393965950.9
Zimbabwe: 11907947332.3
and the provided text file is as:
853764622753
1055334825425
927168311000
1142876772659
1390557034408
1538194473087
1567178619062
1459597906913
1345383143356
1204616439828
Australia
33050343783
48852482960
44291490421
52902703376
65951627200
69684317719
74164435946
75244166773
53074370486
37847715736
Azerbaijan
79611888213
91631278239
102477791472
115279077465
128637938711
133355749482
149990451022
172885454931
195078665828
221415162446
Bangladesh
1397084381901
1695824517396
1667019605882
2208871646203
2616201578192
2465188674415
2472806919902
2455993200170
1803652649614
1796186586414
Brazil
12247694247
14393099069
10732366286
13707370737
18525319978
19048495519
18093829923
17098342541
12930394938
11400653732
Brunei Darussalam
6771277871
8369637065
8369175126
8979966766
10724063458
11166063467
11947176342
12377391463
10419303761
11693235542
Burkina Faso
1513934037
1789333749
1711817182
1664310770
1864824081
1751888562
1850951315
1858121723
1574288668
1617467436
Cabo Verde
8639235842
10351914093
10401851851
11242275199
12829541141
14038383450
15449630419
16777820333
18049954289
20016747754
Cambodia
173605968179
179638496279
172389498445
218537551220
252251992029
267122320057
278384332694
260990299051
242517905162
247027912574
Chile
3552182311653
4598206091384
5109953609257
6100620488868
7572553836875
8560547314679
9607224481533
10482372109962
11064666282626
11199145157649
China
189227050760
235718586901
206179982164
207477857919
227948349666
207376427021
209402444996
207818330724
186829940546
195305084919
Czech Republic
421375852
458190185
489074333
493824407
501025303
485997988
501979277
523666347
535095846
581484032
Dominica
130478960092
162818181818
188982374701
218888324505
236001858960
279372758362
288586231502
305529656458
332698041031
332791045964
Egypt, Arab Rep.
2663112510266
2923465651091
2693827452070
2646837111795
2862680142625
2681416108537
2808511203185
2849305322685
2433562015516
2465453975282
France
3439953462907
3752365607148
3418005001389
3417094562649
3757698281118
3543983909148
3752513503278
3890606893347
3375611100742
3477796274497
Germany
318497936901
354460802549
330000252153
299361576558
287797822093
245670666639
239862011450
237029579261
195541761243
192690813127
Greece
4375000000
4621000000
4781000000
4895000000
4928000000
5199000000
5337000000
5531000000
5697000000
5793000000
Guam
1201111768409
1186952757636
1323940295875
1656617073124
1823049927771
1827637859136
1856722121395
2035393459979
2089865410868
2263792499341
India
349881601459
406070949554
414059094949
487069570464
583500357530
598853401276
467414852231
434474616832
385874474399
418976679729
Iran, Islamic Rep.
88840050497
131613661510
111660855043
138516722650
185749664444
218000986223
234648370497
234648370497
179640210726
171489001692
Iraq
4515264514431
5037908465114
5231382674594
5700098114744
6157459594824
6203213121334
5155717056271
4848733415524
4383076298082
4940158776617
Japan
17110587447
21972004086
23820230000
26425379437
28840263380
30937277606
33593843662
35826925775
37517410282
38654727746
Jordan
104849886826
133441612247
115308661143
148047348241
192626507972
207998568866
236634552078
221415572820
184388432149
137278320084
Kazakhstan
31958195182
35895153328
37021512049
39999659234
41953433591
50412754822
55097343448
61445345999
63767539357
70529014778
Kenya
1122679154632
1002219052968
901934953365
1094499338703
1202463682634
1222807284485
1305604981272
1411333926201
1382764027114
1411245589977
Korea, Rep.
30901399261
35596016664
26169854045
23757368290
28223552825
28119996053
30314363219
31419072948
27009231911
27572698482
Latvia
24577114428
29227350570
35477118070
38419626628
40075674163
43868565282
46014226808
47833413749
49459296463
49598825982
Lebanon
39738180077
47850551149
37440673478
37120517694
43476878139
42847900766
46473646002
48545251796
41402022148
42738875963
Lithuania
7342923489
9413002921
8550363975
8729936136
9892702358
9919780071
10601690872
10673516673
9744243420
10001193420
Madagascar
193547824063
230813597938
202257586268
255016609233
297951960784
314443149443
323277158907
338061963396
296434003329
296535930381
Malaysia
8145694632
9750822511
10181021770
10678749467
12978107561
12442747897
13246412031
14388360064
13100058100
14034980334
Mali
4234999823
5623216449
4583850368
7189481824
10409797649
12292770631
12582122604
12226514722
11749620620
11183458131
Mongolia
9366742309
11494837053
10911698208
10154238250
13131168012
14534278446
16018848991
16961127046
14798439527
11014858592
Mozambique
20182477481
31862554102
36906181381
49540813342
59977326086
59937797559
60269734045
65446402659
59687373958
63225097051
Myanmar
7423377429
8496965842
8298695145
8758622329
9774316692
10532001130
10982972256
11880438824
12747741540
13230844687
Nicaragua
166451213396
208064753766
169481317540
369062464570
411743801712
460953836444
514966287207
568498937588
481066152889
404652720165
Nigeria
13794910634
18504130753
15929902138
20030528043
25099681461
24595319574
28965906502
30881166852
27282581336
27424071383
Paraguay
149359920006
174195135053
168334599538
199590775190
224143083707
250092093548
271836123724
284584522899
292774099014
304905406845
Philippines
79712087912
115270054945
97798351648
125122306346
167775274725
186833516484
198727747253
206224725275
164641483516
152451923077
Qatar
179981288567
192225881688
192408387762
236421782178
275599459374
289162118909
302510668904
308142766948
296840704102
296975678610
Singapore
1479341637011
1635015380108
1499099749931
1431616749640
1488067258325
1336018949806
1361854206549
1376910811041
1197789902774
1237255019654
Spain
487816328342
513965650650
429657033108
488377689565
563109663291
543880647757
578742001488
573817719109
497918109302
514459972806
Sweden
21501741757
27368386358
28573777052
31407908612
33878631649
39087748240
44333456245
48197218327
45628320606
47340071107
Tanzania
38908069299
44856586316
43454935940
44050929160
45810626509
45044112939
46251061734
47587913059
43156708809
42062549395
Tunisia
12664165103
19271523179
20214385965
22583157895
29233333333
35164210526
39197543860
43524210526
35799628571
36179885714
Turkmenistan
3074359743898
2890564338235
2382825985356
2441173394730
2619700404733
2662085168499
2739818680930
3022827781881
2885570309161
2647898654635
United Kingdom
14477635000000
14718582000000
14418739000000
14964372000000
15517926000000
16155255000000
16691517000000
17393103000000
18120714000000
18624475000000
United States
77414425532
99130304099
106014659770
115931749697
135539438560
155820001920
171222025117
186204652922
193241108710
205276172135
Vietnam
14056957976
17910858638
15328342304
20265556274
23460098340
25503370699
28045460442
27150630607
21154394546
21063989683
Zambia
5291950100
4415702800
8621573608
10141859710
12098450749
14242490252
15451768659
15891049236
16304667807
16619960402
Zimbabwe
So what I have thought of so far is:
to use an aggregation loop that checks whether the current line is a GDP value or the name of a country: when it reaches the name of a country it should calculate the average and print out the result, then it should reset the per-country aggregation variables and continue looping to aggregate the next country's GDP values.
And so to handle the mixed nature of the input file, I would either use the str.isnumeric() method or keep a counter to check when 10 GDP values have been read (since the next line would then be the name of the corresponding country).
for value in open("10year-gdp.txt"):

Something like this in Python 3 may work:
import statistics
with open('10year-gdp.txt') as f:
items = []
for line in f.readlines():
line = line.strip()
if line.isdigit():
items.append(float(line))
else:
print('{0}: {1}'.format(line, statistics.mean(items)))
items = []

You can try this one too:
with open("10year-gdp.txt", "r") as infile:
content = infile.readlines()
content = [content[i:i+11] for i in range(0,len(content),11)]
results = [": ".join([c[10],str(sum(map(float,c[0:10]))/10)]).replace("\n","") for c in content]
for result in results:
print(result)
Output:
Australia: 1248467214849.1
Azerbaijan: 55506365440.0
Bangladesh: 139036345780.9
Brazil: 2057882976008.9
Brunei Darussalam: 14817756697.0
Burkina Faso: 10081729086.1
Cabo Verde: 1719693752.3
Cambodia: 13779735437.1
Chile: 229246627569.0
China: 7784747168448.6
Czech Republic: 207328405561.6
Dominica: 499171357.0
Egypt, Arab Rep.: 247614743339.3
France: 2702817149305.2
Germany: 3582562859622.3
Greece: 270091322197.4
Guam: 5115700000.0
India: 1726508317353.4
Iran, Islamic Rep.: 454617559842.3
Iraq: 169480789377.9
Japan: 5217301203153.5
Jordan: 29469864942.1
Kazakhstan: 168198946242.6
Kenya: 48807995178.8
Korea, Rep.: 1205755199135.1
Latvia: 28908355369.8
Lebanon: 40455121214.3
Lithuania: 42763449721.2
Madagascar: 9486935333.5
Malaysia: 274833978374.2
Mali: 11894695436.7
Mongolia: 9207583282.1
Mozambique: 12838623643.4
Myanmar: 50703575766.4
Nicaragua: 10212597587.4
Nigeria: 375494148527.7
Paraguay: 23250819867.6
Philippines: 231981575952.4
Qatar: 149455747118.1
Singapore: 257026873704.2
Spain: 1404296966483.9
Sweden: 519174481541.8
Tanzania: 36731725995.3
Tunisia: 44118349316.0
Turkmenistan: 29383204467.2
United Kingdom: 2736682446205.8
United States: 16108231800000.0
Vietnam: 144579453846.2
Zambia: 21393965950.9
Zimbabwe: 11907947332.3

#!/usr/bin/env python
from statistics import mean
GDPGroup = []
GDPDictionary = {}
with open("10year-gdp.txt") as FileObject:
lines = FileObject.readlines()
for line in lines:
line = line.strip()
if not line.isdigit():
GDPDictionary[line] = GDPGroup
GDPGroup = []
else:
GDPGroup.append(float(line))
for key in GDPDictionary:
array = GDPDictionary[key]
array2 = []
GDPDictionary[key] = mean(array)
print(GDPDictionary)
Prints out:
{'Guam': 5115700000.0, 'Lithuania': 42763449721.2, 'Azerbaijan': 55506365440.0, 'Bangladesh': 139036345780.9, 'Egypt, Arab Rep.': 247614743339.3, 'Burkina Faso': 10081729086.1, 'Chile': 229246627569.0, 'Mongolia': 9207583282.1, 'Nicaragua': 10212597587.4, 'Brazil': 2057882976008.9, 'Kenya': 48807995178.8, 'Dominica': 499171357.0, 'Japan': 5217301203153.5, 'India': 1726508317353.4, 'Cabo Verde': 1719693752.3, 'United States': 16108231800000.0, 'Greece': 270091322197.4, 'Myanmar': 50703575766.4, 'Madagascar': 9486935333.5, 'Tunisia': 44118349316.0, 'Mozambique': 12838623643.4, 'Cambodia': 13779735437.1, 'Iraq': 169480789377.9, 'Korea, Rep.': 1205755199135.1, 'Kazakhstan': 168198946242.6, 'Turkmenistan': 29383204467.2, 'Germany': 3582562859622.3, 'Iran, Islamic Rep.': 454617559842.3, 'France': 2702817149305.2, 'Paraguay': 23250819867.6, 'United Kingdom': 2736682446205.8, 'Malaysia': 274833978374.2, 'Philippines': 231981575952.4, 'Qatar': 149455747118.1, 'Lebanon': 40455121214.3, 'Jordan': 29469864942.1, 'Mali': 11894695436.7, 'Zambia': 21393965950.9, 'Australia': 1248467214849.1, 'Singapore': 257026873704.2, 'Zimbabwe': 11907947332.3, 'Sweden': 519174481541.8, 'Nigeria': 375494148527.7, 'China': 7784747168448.6, 'Tanzania': 36731725995.3, 'Czech Republic': 207328405561.6, 'Vietnam': 144579453846.2, 'Latvia': 28908355369.8, 'Spain': 1404296966483.9, 'Brunei Darussalam': 14817756697.0}

Related

pdfplumber | Extract text from dynamic column layouts

Attempted Solution at bottom of post.
I have near-working code that extracts the sentence containing a phrase, across multiple lines.
However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged together as a bad sentence.
This problem has been addressed in the following posts:
Solution 1
Solution 2
Question:
How do I "if-condition" whether there are columns?
Pages may not have columns,
Pages may have more than 2 columns.
Pages may also have headers and footers (that can be left out).
Example .pdf with dynamic text layout: PDF (pg. 2).
Jupyter Notebook:
# pip install PyPDF2
# pip install pdfplumber
# ---
import pdfplumber
# ---
def scrape_sentence(phrase, lines, index):
# -- Gather sentence 'phrase' occurs in --
sentence = lines[index]
print("-- sentence --", sentence)
print("len(lines)", len(lines))
# Previous lines
pre_i, flag = index, 0
while flag == 0:
pre_i -= 1
if pre_i <= 0:
break
sentence = lines[pre_i] + sentence
if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# Following lines
post_i, flag = index, 0
while flag == 0:
post_i += 1
if post_i >= len(lines):
break
sentence = sentence + lines[post_i]
if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# -- Extract --
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
print(sentence)
sentence = sentence[0].replace('\n', '').strip() # first occurance
print(sentence)
return sentence
# ---
phrase = 'Gulf Petrochemical Industries Company'
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if phrase in lines[i]:
sentence = scrape_sentence(phrase, lines, i)
i += 1
Example Incorrect Output:
-- sentence -- being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of
len(lines) 47
Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of
Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report, Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[' being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption']
being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption
...
Attempted Minimal Solution:
This will separate text into 2 columns; regardless if there are 2.
# pip install PyPDF2
# pip install pdfplumber
# ---
import pdfplumber
import decimal
# ---
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height))
right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height))
l_text = left.extract_text()
r_text = right.extract_text()
print("\n -- l_text --", l_text)
print("\n -- r_text --", r_text)
text = str(l_text) + " " + str(r_text)
Please let me know if there is anything else I should clarify.
This answer enables you to scrape text, in the intended order.
Towards Data Science article PDF Text Extraction in Python:
Compared with PyPDF2, PDFMiner’s scope is much more limited, it really focuses only on extracting the text from the source information of a pdf file.
from io import StringIO
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_string(file_path):
output_string = StringIO()
with open(file_path, 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
return(output_string.getvalue())
file_path = '' # !
text = convert_pdf_to_string(file_path)
print(text)
Cleansing can be applied thereafter.

I keep getting an error in PyCharm, No tests were found and Empty Suite

I keep getting an error while trying to do a test in PyCharm. It keeps saying No tests were found and empty suite. I've tried to tinker with a bunch of different things and can't seem to get this figured out. Does anyone know what the issue is?
import unittest
from city_functions import city_country
class TestCitiesCase(unittest.TestCase):
"""Tests the combined function in city_functions"""
def test_city_country(self):
"""Does seattle united states of america work"""
seattle = city_country('seattle', 'united states of america')
self.assertEqual(seattle, 'seattle, united states of america')
def test_city_country_population(self):
"""Does seattle united states of america population 500000 work?"""
seattle = city_country('seattle', 'united states of america', 500000)
self.assertEqual(seattle, 'seattle, united states of america, population 500000')
unittest.main()
city_functions.py -
def city_country(city, country, population=''):
if population:
city = city.title() + ', ' + country.title() + ', population ' + str(population)
else:
city = city.title() + ', ' + country.title()
return city

TypeError: unorderable types: int() < str()

There is an error occurs when I was applying the 5W1H extractor(which is an opensource library in Git) on my JSON news dataset.
The error occurs at evaluate_location file when it tried to run
raw_locations.sort(key=lambda x: x[1], reverse=True)
Then the console gave the error says
TypeError: unorderable types: int() < str()
My question is: Does this means something wrong with my dataset format? But if so shouldn't it consider all the news data as a simple long string when the extractor work on this corpus? I'm eagerly looking for a solution to this problem.
This is one of the json news data:
{
"title": "Football: Van Dijk, Ronaldo and Messi shortlisted for FIFA award",
"body": "ROME: Liverpool centre-back Virgil van Dijk is on the shortlist to add FIFA's best player award to his UEFA Men's Player of the Year honour.The Dutch international denied Cristiano Ronaldo and Lionel Messi for the European title last week and the same trio are in the running for the FIFA accolade to be announced in Milan on September 23. Van Dijk starred in Liverpool's triumphant Champions League campaign.England full-back Lucy Bronze won UEFA's women's award and is on FIFA's shortlist with the United States' World Cup-winning duo Megan Rapinoe and Alex Morgan.Manchester City boss Pep Guardiola is up against Liverpool's Jurgen Klopp and Mauricio Pochettino of Tottenham for best men's coach.Phil Neville, who led England's women to a World Cup semi-final, is up for the women's coach award with the USA's Jill Ellis and Sarina Wiegman who guided European champions the Netherlands to the World Cup final. FIFA Best shortlistsMen's player:Cristiano Ronaldo (Juventus/Portugal), Lionel Messi (Barcelona/Argentina), Virgil van Dijk player:Lucy Bronze (Lyon/England), Alex Morgan (Orlando Pride/USA), Megan Rapinoe (Reign FC/USA)Men's coach:Pep Guardiola (Manchester City), Jurgen Klopp (Liverpool), Mauricio Pochettino (Tottenham)Women's coach:Jill Ellis (USA), Phil Neville (England), Sarina Wiegman (Netherlands)Women's goalkeeper:Christiane Endler (Paris St-Germain/Chile), Hedvig Lindahl (Wolfsburg/Sweden), Sari van Veenendaal (Atletico Madrid/Netherlands)Men's goalkeeper:Alisson (Liverpool/Brazil), Ederson (Manchester City/Brazil), Marc-Andre ter Stegen (Barcelona/Germany)Puskas award (for best goal):Lionel Messi (Barcelona v Real Betis), Juan Quintero (River Plate v Racing Club), Daniel Zsori (Debrecen v Ferencvaros)",
"published_at": "2019-09-02",
}
Code:
json_file = open("./Labeled.json","r",encoding="utf-8")
data = json.load(json_file)
if __name__ == '__main__':
# logger setup
log = logging.getLogger('GiveMe5W')
log.setLevel(logging.DEBUG)
sh = logging.StreamHandler()
sh.setLevel(logging.DEBUG)
log.addHandler(sh)
# giveme5w setup - with defaults
extractor = MasterExtractor()
Document()
for i in range(0,1000):
body = data[i]["body"]
#print(body)
#for line in body:
#print(line[0:line.find('\n')])
#head = re.sub("[^A-Z\d]", "", "")
head = re.search("^[^\n]*", body).group(0)
head = str(head)
title = data[i]["title"]
title = str(title)
body = data[i]["body"]
body = str(body)
published_at = data[i]["published_at"]
published_at = str(published_at)
doc1 = Document(title,head,body,published_at)
doc = extractor.parse(doc1)
Instead of return the extracted time&location result, it gave me this error:
Traceback (most recent call last): File
"/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run() File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractor.py",
line 20, in run
extractor.process(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/abs_extractor.py",
line 41, in process
self._evaluate_candidates(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py",
line 75, in _evaluate_candidates
locations = self._evaluate_locations(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py",
line 224, in _evaluate_locations
raw_locations.sort(key=lambda x: x[1], reverse=True) TypeError: unorderable types: int() < str()
The row_locations is build in the same file in line 219:
raw_locations.append([parts, location.raw['place_id'], location.point, bb, area, 0, 0, candidate, 0])
Thus, the sort function tries to sort the locations by their place_id. Please check your dataset if it does include strings and numbers for the place_id. If so you need to convert all entries to one type.

Is there a way to properly convert data from lists to a CSV file using BeautifulSoup?

I am trying to create a webscraper for a website. The problem is that after the collected data is stored in a list, I'm not able to write this to a csv file properly. I have been stuck for ages with this problem and hopefully someone has an idea about how to fix this one!
The loop to get the data from the web pages:
import csv
from htmlrequest import simple_get
from htmlrequest import BeautifulSoup
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while (count <= max):
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start=' + str(count) + '&s=h&t=SicCodeSearch&location=&sicCode=93120'
raw_html = simple_get(start)
soup = BeautifulSoup(raw_html, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
output example if printed:
Companies
(AMG) AGILITY MANAGEMENT GROUP LTD
(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)
1 SPORT ORGANISATION LIMITED
100UK LTD
1066 GYMNASTICS
1066 SPECIALS
10COACHING LIMITED
147 LOUNGE LTD
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED
Locations
ENGLAND, BH8 9PS
LONDON, EC2M 2PL
ENGLAND, LS7 3JB
ENGLAND, LE2 8FN
UNITED KINGDOM, N18 2QX
AVON, BS5 0JH
UNITED KINGDOM, WC2H 9JQ
UNITED KINGDOM, SE18 5SZ
UNITED KINGDOM, EC1V 2NX
I've tried to get it into a CSV file by using this code but I can't figure out how to properly format my output! Any suggestions are welcome.
# writing to csv
with open('test.csv', 'w') as csvfile:
write = csv.writer(csvfile, delimiter=',')
write.writerow(['Name','Location'])
write.writerow([listData[0],listData[1]])
print("Writing has been done!")
I want the code to be able to format it properly in the csv file to be able to import the two rows in a database.
This is the output when I write the data on 'test.csv'
which will result into this when opened up
The expected outcome would be something like this!
I'm not sure how it is improperly formatted, but maybe you just need to replace with open('test.csv', 'w') with with open('test.csv', 'w+', newline='')
I've combined your code (taking out htmlrequests for requests and bs4 modules and also not using listData, but instead creating my own lists. I've left your lists but they do nothing):
import csv
import bs4
import requests
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
company_list = []
locations_list = []
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while count <= max:
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start={}&s=h&t=SicCodeSearch&location=&sicCode=93120'.format(count)
res = requests.get(start)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
company_list.append(div.text.strip())
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
locations_list.append(div2.text.strip())
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
if len(company_list) == len(locations_list):
with open('test.csv', 'w+', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(['Name', 'Location'])
for i in range(len(company_list)):
writer.writerow([company_list[i], locations_list[i]])
Which generates a csv file like:
Name,Location
(AMG) AGILITY MANAGEMENT GROUP LTD,"UNITED KINGDOM, M6 6DE"
"(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)","ENGLAND, BD1 2PX"
0161 STUDIOS LTD,"UNITED KINGDOM, HD6 3AX"
1 CLICK SPORTS MANAGEMENT LIMITED,"ENGLAND, E10 5PW"
1 SPORT ORGANISATION LIMITED,"UNITED KINGDOM, CR2 6NF"
100UK LTD,"UNITED KINGDOM, BN14 9EJ"
1066 GYMNASTICS,"EAST SUSSEX, BN21 4PT"
1066 SPECIALS,"EAST SUSSEX, TN40 1HE"
10COACHING LIMITED,"UNITED KINGDOM, SW6 6LR"
10IS ACADEMY LIMITED,"ENGLAND, PE15 9PS"
"10TH MAN LIMITED
(Dissolved)","GLASGOW, G3 6AN"
12 GAUGE EAST MANCHESTER COMMUNITY MMA LTD,"ENGLAND, OL9 8DQ"
121 MAKING WAVES LIMITED,"TYNE AND WEAR, NE30 1AR"
121 WAVES LTD,"TYNE AND WEAR, NE30 1AR"
1-2-KICK LTD,"ENGLAND, BH8 9PS"
"147 HAVANA LIMITED
(Liquidation)","LONDON, EC2M 2PL"
147 LOUNGE LTD,"ENGLAND, LS7 3JB"
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED,"ENGLAND, LE2 8FN"
1ACTIVE LTD,"UNITED KINGDOM, N18 2QX"
1ON1 KING LTD,"AVON, BS5 0JH"
1PUTT LTD,"UNITED KINGDOM, WC2H 9JQ"
1ST SPORTS LTD,"UNITED KINGDOM, SE18 5SZ"
2 BRO PRO EVENTS LTD,"UNITED KINGDOM, EC1V 2NX"
2 SPLASH SWIM SCHOOL LTD,"ENGLAND, B36 0EY"
2 STEPPERS C.I.C.,"SURREY, CR0 6BX"
2017 MOTO LIMITED,"UNITED KINGDOM, ME2 4NW"
2020 ARCHERY LTD,"LONDON, SE16 6SS"
21 LEISURE LIMITED,"LONDON, EC4M 7WS"
261 FEARLESS CLUB UNITED KINGDOM CIC,"LANCASHIRE, LA2 8RF"
2AIM4 LIMITED,"HERTFORDSHIRE, SG2 0JD"
2POINT4 FM LTD,"LONDON, NW10 8LW"
3 LIONS SCHOOL OF SPORT LTD,"BRISTOL, BS20 8BU"
3 PT LTD,"ANTRIM, BT40 2FB"
3 PUTT LIFE LTD,"UNITED KINGDOM, LU3 2DP"
3 THIRTY SEVEN LTD,"KENT, DA9 9RS"
3:30 SOCCER SCHOOL LTD,"UNITED KINGDOM, EH6 7JB"
30 MINUTE WORKOUT (LLANISHEN) LTD,"PONTYCLUN, CF72 9UA"
321 RELAX LTD,"MID GLAMORGAN, CF83 3HL"
360 MOTOR RACING CLUB LTD,"HALSTEAD, CO9 2ET"
3LIONSATHLETICS LIMITED,"ENGLAND, S3 8DB"
3S SWIM ROMFORD LTD,"UNITED KINGDOM, DA9 9DR"
3XL EVENT MANAGEMENT LIMITED,"KENT, BR3 4NW"
3XL MOTORSPORT MANAGEMENT LIMITED,"KENT, BR3 4NW"
4 CORNER FOOTBALL LTD,"BROMLEY, BR1 5DD"
4 PRO LTD,"UNITED KINGDOM, FY5 5HT"
Which seems fine to me, but your post was very unclear about how you expected it to be formatted so I really have no idea

Read in a .txt file as desired dataframe format

I have a txt file that looks like this:
Alabama[edit]
Auburn (Auburn University, Edward Via College of Osteopathic Medicine)
Birmingham (University of Alabama at Birmingham, Birmingham School of
Alaska[edit]
Anchorage[21] (University of Alaska Anchorage)
Fairbanks (University of Alaska Fairbanks)[16]
I want to readin the txt file as a data frame that looks like this:
state county
Alabama Auburn
Alabama Birmingham
Alaska Anchorage
Alaska Faibanks
What I have so far is:
university_towns = open('university_towns.txt','r')
df_university_towns = pd.DataFrame(columns={'State','RegionName'})
# loop over each line of the file object
# determine if each line is state or county.
# if the line has [edit], it's state
for line in university_towns:
state_pattern = re.compile('\[edit\]')
state_pattern_m = state_pattern.search(line)
county_pattern = re.compile('(')
county_pattern_m = county_pattern.search(line)
if state_pattern_m:
#extract everything before \[edit]
print(state_pattern_m.start())
end_position = state_pattern_m.start()
print(line[0:end_position])
state_name = line[0:end_position]
if county_pattern_m:
#extract everything before (
This code will only give me something like this:
State County
Alabama Auburn
Birminham
.
.
.
This should do it:
key = None
for line in t:
if '[edit]' in line:
key = line.replace('[edit]', '')
continue
if key:
# Use regex to extrac what you need
print(key, line.split(' ')[0])
I'm not sure what your data looks like so change the regex to remove [] from the title(guessing it's a title) and possibly use regex in place of '[edit'] in

Categories

Resources