Read in a .txt file as desired dataframe format

Read in a .txt file as desired dataframe format - python

I have a txt file that looks like this:
Alabama[edit]
Auburn (Auburn University, Edward Via College of Osteopathic Medicine)
Birmingham (University of Alabama at Birmingham, Birmingham School of
Alaska[edit]
Anchorage[21] (University of Alaska Anchorage)
Fairbanks (University of Alaska Fairbanks)[16]
I want to readin the txt file as a data frame that looks like this:
state county
Alabama Auburn
Alabama Birmingham
Alaska Anchorage
Alaska Faibanks
What I have so far is:
university_towns = open('university_towns.txt','r')
df_university_towns = pd.DataFrame(columns={'State','RegionName'})
# loop over each line of the file object
# determine if each line is state or county.
# if the line has [edit], it's state
for line in university_towns:
state_pattern = re.compile('\[edit\]')
state_pattern_m = state_pattern.search(line)
county_pattern = re.compile('(')
county_pattern_m = county_pattern.search(line)
if state_pattern_m:
#extract everything before \[edit]
print(state_pattern_m.start())
end_position = state_pattern_m.start()
print(line[0:end_position])
state_name = line[0:end_position]
if county_pattern_m:
#extract everything before (
This code will only give me something like this:
State County
Alabama Auburn
Birminham
.
.
.

This should do it:
key = None
for line in t:
if '[edit]' in line:
key = line.replace('[edit]', '')
continue
if key:
# Use regex to extrac what you need
print(key, line.split(' ')[0])
I'm not sure what your data looks like so change the regex to remove [] from the title(guessing it's a title) and possibly use regex in place of '[edit'] in

Related

Is there a better way to find specific value in a python dictionary like in list?

I have been practicing on iterating through dictionary and list in Python.
The source file is a csv document containing Country and Capital. It seems I had to go through 2 for loops for country_dict in order to produce the same print result for country_list and capital_list.
Is there a better way to do this in Python dictionary?
The code:
import csv
path = #Path_to_CSV_File
country_list=[]
capital_list=[]
country_dict={'Country':[],'Capital':[]}
with open(path, mode='r') as data:
for line in csv.DictReader(data):
locals().update(line)
country_dict['Country'].append(Country)
country_dict['Capital'].append(Capital)
country_list.append(Country)
capital_list.append(Capital)
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
if i >= (len(country_dict['Country'])-1):
print("out of bound")
for count1, element in enumerate(country_dict['Country']):
if count1==i:
print('Country = ' + element)
for count2, element in enumerate(country_dict['Capital']):
if count2==i:
print('Capital = ' + element)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_list[i] + '\nCapital = ' + capital_list[i])
The output:
Country = Djibouti
Capital = Djibouti (city)
Country = Djibouti
Capital = Djibouti (city)
The CSV file content:
Country,Capital
Algeria,Algiers
Angola,Luanda
Benin,Porto-Novo
Botswana,Gaborone
Burkina Faso,Ouagadougou
Burundi,Gitega
Cabo Verde,Praia
Cameroon,Yaounde
Central African Republic,Bangui
Chad,N'Djamena
Comoros,Moroni
"Congo, Democratic Republic of the",Kinshasa
"Congo, Republic of the",Brazzaville
Cote d'Ivoire,Yamoussoukro
Djibouti,Djibouti (city)
Egypt,Cairo
Equatorial Guinea,"Malabo (de jure), Oyala (seat of government)"
Eritrea,Asmara
Eswatini (formerly Swaziland),"Mbabane (administrative), Lobamba (legislative, royal)"
Ethiopia,Addis Ababa
Gabon,Libreville
Gambia,Banjul
Ghana,Accra
Guinea,Conakry
Guinea-Bissau,Bissau
Kenya,Nairobi
Lesotho,Maseru
Liberia,Monrovia
Libya,Tripoli
Madagascar,Antananarivo
Malawi,Lilongwe
Mali,Bamako
Mauritania,Nouakchott
Mauritius,Port Louis
Morocco,Rabat
Mozambique,Maputo
Namibia,Windhoek
Niger,Niamey
Nigeria,Abuja
Rwanda,Kigali
Sao Tome and Principe,São Tomé
Senegal,Dakar
Seychelles,Victoria
Sierra Leone,Freetown
Somalia,Mogadishu
South Africa,"Pretoria (administrative), Cape Town (legislative), Bloemfontein (judicial)"
South Sudan,Juba
Sudan,Khartoum
Tanzania,Dodoma
Togo,Lomé
Tunisia,Tunis
Uganda,Kampala
Zambia,Lusaka
Zimbabwe,Harare

I am not sure if I get your point; Please check out the code.
import csv
path = #Path_to_CSV_File
country_dict={}
with open(path, mode='r') as data:
lines = csv.DictReader(data)
for idx,line in enumerate(lines):
locals().update(line)
country_dict[idx] = {"Country":Country,"Capital":}
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
country_info = country_dict.get(i)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_info['Country'] + '\nCapital = ' + country_info['Capital'])

Search a series for a word. Return that word and N others in a new column?

Okay, I need help. I created a function to search a string for a specific word. If the function finds the search_word it will return the word the and N words that precede it. The function works fine with my test strings but I cannot figure out how to apply the function to an entire series?
My goal is to create a new column in the data frame that contains the n_words_prior whenever the search_word exists.
n_words_prior = []
test = "New School District, Dale County"
def n_before_string(string, search_word, N):
global n_words_prior
n_words_prior = []
found_word = string.find(search_word)
if found_word == -1: return ""
sentence= string[0:found_word]
n_words_prior = sentence.split()[N:]
n_words_prior.append(search_word)
return n_words_prior
The current dataframe looks like this:
data = [['Alabama', 'New School District, Dale County'],
['Alaska', 'Matanuska-Susitna Borough'],
['Arizona', 'Pima County - Tuscon Unified School District']]
df = pd.DataFrame(data, columns = ['State', 'Place'])
The improved function would take the inputs 'Place','County',-1 and create the following result.
improved_function(column, search_word, N)
new_data = [['Alabama', 'New School District, Dale County','Dale County'],
['Alaska', 'Matanuska-Susitna Borough', ''],
['Arizona', 'Pima County - Tuscon Unified School District','Pima County']]
new_df = pd.DataFrame(new_data, columns = ['State', 'Place','Result'])
I thought embedding this function would help, but it has only made things more confusing.
def fast_add(place, search_word):
df[search_word] = df[Place].str.contains(search_word).apply(lambda search_word: 1 if search_word == True else 0)

def fun(sentence, search_word, n):
"""Return search_word and n preceding words from sentence."""
words = sentence.split()
for i,word in enumerate(words):
if word == search_word:
return ' '.join(words[i-n:i+1])
return ''
Example:
df['Result'] = df.Place.apply(lambda x: fun(x, 'County', 1))
Result:
State Place Result
0 Alabama New School District, Dale County Dale County
1 Alaska Matanuska-Susitna Borough
2 Arizona Pima County - Tuscon Unified School District Pima County

Is there a way to properly convert data from lists to a CSV file using BeautifulSoup?

I am trying to create a webscraper for a website. The problem is that after the collected data is stored in a list, I'm not able to write this to a csv file properly. I have been stuck for ages with this problem and hopefully someone has an idea about how to fix this one!
The loop to get the data from the web pages:
import csv
from htmlrequest import simple_get
from htmlrequest import BeautifulSoup
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while (count <= max):
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start=' + str(count) + '&s=h&t=SicCodeSearch&location=&sicCode=93120'
raw_html = simple_get(start)
soup = BeautifulSoup(raw_html, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
output example if printed:
Companies
(AMG) AGILITY MANAGEMENT GROUP LTD
(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)
1 SPORT ORGANISATION LIMITED
100UK LTD
1066 GYMNASTICS
1066 SPECIALS
10COACHING LIMITED
147 LOUNGE LTD
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED
Locations
ENGLAND, BH8 9PS
LONDON, EC2M 2PL
ENGLAND, LS7 3JB
ENGLAND, LE2 8FN
UNITED KINGDOM, N18 2QX
AVON, BS5 0JH
UNITED KINGDOM, WC2H 9JQ
UNITED KINGDOM, SE18 5SZ
UNITED KINGDOM, EC1V 2NX
I've tried to get it into a CSV file by using this code but I can't figure out how to properly format my output! Any suggestions are welcome.
# writing to csv
with open('test.csv', 'w') as csvfile:
write = csv.writer(csvfile, delimiter=',')
write.writerow(['Name','Location'])
write.writerow([listData[0],listData[1]])
print("Writing has been done!")
I want the code to be able to format it properly in the csv file to be able to import the two rows in a database.
This is the output when I write the data on 'test.csv'
which will result into this when opened up
The expected outcome would be something like this!

I'm not sure how it is improperly formatted, but maybe you just need to replace with open('test.csv', 'w') with with open('test.csv', 'w+', newline='')
I've combined your code (taking out htmlrequests for requests and bs4 modules and also not using listData, but instead creating my own lists. I've left your lists but they do nothing):
import csv
import bs4
import requests
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
company_list = []
locations_list = []
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while count <= max:
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start={}&s=h&t=SicCodeSearch&location=&sicCode=93120'.format(count)
res = requests.get(start)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
company_list.append(div.text.strip())
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
locations_list.append(div2.text.strip())
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
if len(company_list) == len(locations_list):
with open('test.csv', 'w+', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(['Name', 'Location'])
for i in range(len(company_list)):
writer.writerow([company_list[i], locations_list[i]])
Which generates a csv file like:
Name,Location
(AMG) AGILITY MANAGEMENT GROUP LTD,"UNITED KINGDOM, M6 6DE"
"(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)","ENGLAND, BD1 2PX"
0161 STUDIOS LTD,"UNITED KINGDOM, HD6 3AX"
1 CLICK SPORTS MANAGEMENT LIMITED,"ENGLAND, E10 5PW"
1 SPORT ORGANISATION LIMITED,"UNITED KINGDOM, CR2 6NF"
100UK LTD,"UNITED KINGDOM, BN14 9EJ"
1066 GYMNASTICS,"EAST SUSSEX, BN21 4PT"
1066 SPECIALS,"EAST SUSSEX, TN40 1HE"
10COACHING LIMITED,"UNITED KINGDOM, SW6 6LR"
10IS ACADEMY LIMITED,"ENGLAND, PE15 9PS"
"10TH MAN LIMITED
(Dissolved)","GLASGOW, G3 6AN"
12 GAUGE EAST MANCHESTER COMMUNITY MMA LTD,"ENGLAND, OL9 8DQ"
121 MAKING WAVES LIMITED,"TYNE AND WEAR, NE30 1AR"
121 WAVES LTD,"TYNE AND WEAR, NE30 1AR"
1-2-KICK LTD,"ENGLAND, BH8 9PS"
"147 HAVANA LIMITED
(Liquidation)","LONDON, EC2M 2PL"
147 LOUNGE LTD,"ENGLAND, LS7 3JB"
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED,"ENGLAND, LE2 8FN"
1ACTIVE LTD,"UNITED KINGDOM, N18 2QX"
1ON1 KING LTD,"AVON, BS5 0JH"
1PUTT LTD,"UNITED KINGDOM, WC2H 9JQ"
1ST SPORTS LTD,"UNITED KINGDOM, SE18 5SZ"
2 BRO PRO EVENTS LTD,"UNITED KINGDOM, EC1V 2NX"
2 SPLASH SWIM SCHOOL LTD,"ENGLAND, B36 0EY"
2 STEPPERS C.I.C.,"SURREY, CR0 6BX"
2017 MOTO LIMITED,"UNITED KINGDOM, ME2 4NW"
2020 ARCHERY LTD,"LONDON, SE16 6SS"
21 LEISURE LIMITED,"LONDON, EC4M 7WS"
261 FEARLESS CLUB UNITED KINGDOM CIC,"LANCASHIRE, LA2 8RF"
2AIM4 LIMITED,"HERTFORDSHIRE, SG2 0JD"
2POINT4 FM LTD,"LONDON, NW10 8LW"
3 LIONS SCHOOL OF SPORT LTD,"BRISTOL, BS20 8BU"
3 PT LTD,"ANTRIM, BT40 2FB"
3 PUTT LIFE LTD,"UNITED KINGDOM, LU3 2DP"
3 THIRTY SEVEN LTD,"KENT, DA9 9RS"
3:30 SOCCER SCHOOL LTD,"UNITED KINGDOM, EH6 7JB"
30 MINUTE WORKOUT (LLANISHEN) LTD,"PONTYCLUN, CF72 9UA"
321 RELAX LTD,"MID GLAMORGAN, CF83 3HL"
360 MOTOR RACING CLUB LTD,"HALSTEAD, CO9 2ET"
3LIONSATHLETICS LIMITED,"ENGLAND, S3 8DB"
3S SWIM ROMFORD LTD,"UNITED KINGDOM, DA9 9DR"
3XL EVENT MANAGEMENT LIMITED,"KENT, BR3 4NW"
3XL MOTORSPORT MANAGEMENT LIMITED,"KENT, BR3 4NW"
4 CORNER FOOTBALL LTD,"BROMLEY, BR1 5DD"
4 PRO LTD,"UNITED KINGDOM, FY5 5HT"
Which seems fine to me, but your post was very unclear about how you expected it to be formatted so I really have no idea

Calculating average of data set, with text mixed [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am required to write a Python program that reads a file and calculates the average GDP for each country over the 10-year period.
Basically, my desired result is:
Australia: 1248467214849.1
Azerbaijan: 55506365440.0
Bangladesh: 139036345780.9
Brazil: 2057882976008.9
Brunei Darussalam: 14817756697.0
Burkina Faso: 10081729086.1
Cabo Verde: 1719693752.3
Cambodia: 13779735437.1
Chile: 229246627569.0
China: 7784747168448.6
Czech Republic: 207328405561.6
Dominica: 499171357.0
Egypt, Arab Rep.: 247614743339.3
France: 2702817149305.2
Germany: 3582562859622.3
Greece: 270091322197.4
Guam: 5115700000.0
India: 1726508317353.4
Iran, Islamic Rep.: 454617559842.3
Iraq: 169480789377.9
Japan: 5217301203153.5
Jordan: 29469864942.1
Kazakhstan: 168198946242.6
Kenya: 48807995178.8
Korea, Rep.: 1205755199135.1
Latvia: 28908355369.8
Lebanon: 40455121214.3
Lithuania: 42763449721.2
Madagascar: 9486935333.5
Malaysia: 274833978374.2
Mali: 11894695436.7
Mongolia: 9207583282.1
Mozambique: 12838623643.4
Myanmar: 50703575766.4
Nicaragua: 10212597587.4
Nigeria: 375494148527.7
Paraguay: 23250819867.6
Philippines: 231981575952.4
Qatar: 149455747118.1
Singapore: 257026873704.2
Spain: 1404296966483.9
Sweden: 519174481541.8
Tanzania: 36731725995.3
Tunisia: 44118349316.0
Turkmenistan: 29383204467.2
United Kingdom: 2736682446205.8
United States: 16108231800000.0
Vietnam: 144579453846.2
Zambia: 21393965950.9
Zimbabwe: 11907947332.3
and the provided text file is as:
853764622753
1055334825425
927168311000
1142876772659
1390557034408
1538194473087
1567178619062
1459597906913
1345383143356
1204616439828
Australia
33050343783
48852482960
44291490421
52902703376
65951627200
69684317719
74164435946
75244166773
53074370486
37847715736
Azerbaijan
79611888213
91631278239
102477791472
115279077465
128637938711
133355749482
149990451022
172885454931
195078665828
221415162446
Bangladesh
1397084381901
1695824517396
1667019605882
2208871646203
2616201578192
2465188674415
2472806919902
2455993200170
1803652649614
1796186586414
Brazil
12247694247
14393099069
10732366286
13707370737
18525319978
19048495519
18093829923
17098342541
12930394938
11400653732
Brunei Darussalam
6771277871
8369637065
8369175126
8979966766
10724063458
11166063467
11947176342
12377391463
10419303761
11693235542
Burkina Faso
1513934037
1789333749
1711817182
1664310770
1864824081
1751888562
1850951315
1858121723
1574288668
1617467436
Cabo Verde
8639235842
10351914093
10401851851
11242275199
12829541141
14038383450
15449630419
16777820333
18049954289
20016747754
Cambodia
173605968179
179638496279
172389498445
218537551220
252251992029
267122320057
278384332694
260990299051
242517905162
247027912574
Chile
3552182311653
4598206091384
5109953609257
6100620488868
7572553836875
8560547314679
9607224481533
10482372109962
11064666282626
11199145157649
China
189227050760
235718586901
206179982164
207477857919
227948349666
207376427021
209402444996
207818330724
186829940546
195305084919
Czech Republic
421375852
458190185
489074333
493824407
501025303
485997988
501979277
523666347
535095846
581484032
Dominica
130478960092
162818181818
188982374701
218888324505
236001858960
279372758362
288586231502
305529656458
332698041031
332791045964
Egypt, Arab Rep.
2663112510266
2923465651091
2693827452070
2646837111795
2862680142625
2681416108537
2808511203185
2849305322685
2433562015516
2465453975282
France
3439953462907
3752365607148
3418005001389
3417094562649
3757698281118
3543983909148
3752513503278
3890606893347
3375611100742
3477796274497
Germany
318497936901
354460802549
330000252153
299361576558
287797822093
245670666639
239862011450
237029579261
195541761243
192690813127
Greece
4375000000
4621000000
4781000000
4895000000
4928000000
5199000000
5337000000
5531000000
5697000000
5793000000
Guam
1201111768409
1186952757636
1323940295875
1656617073124
1823049927771
1827637859136
1856722121395
2035393459979
2089865410868
2263792499341
India
349881601459
406070949554
414059094949
487069570464
583500357530
598853401276
467414852231
434474616832
385874474399
418976679729
Iran, Islamic Rep.
88840050497
131613661510
111660855043
138516722650
185749664444
218000986223
234648370497
234648370497
179640210726
171489001692
Iraq
4515264514431
5037908465114
5231382674594
5700098114744
6157459594824
6203213121334
5155717056271
4848733415524
4383076298082
4940158776617
Japan
17110587447
21972004086
23820230000
26425379437
28840263380
30937277606
33593843662
35826925775
37517410282
38654727746
Jordan
104849886826
133441612247
115308661143
148047348241
192626507972
207998568866
236634552078
221415572820
184388432149
137278320084
Kazakhstan
31958195182
35895153328
37021512049
39999659234
41953433591
50412754822
55097343448
61445345999
63767539357
70529014778
Kenya
1122679154632
1002219052968
901934953365
1094499338703
1202463682634
1222807284485
1305604981272
1411333926201
1382764027114
1411245589977
Korea, Rep.
30901399261
35596016664
26169854045
23757368290
28223552825
28119996053
30314363219
31419072948
27009231911
27572698482
Latvia
24577114428
29227350570
35477118070
38419626628
40075674163
43868565282
46014226808
47833413749
49459296463
49598825982
Lebanon
39738180077
47850551149
37440673478
37120517694
43476878139
42847900766
46473646002
48545251796
41402022148
42738875963
Lithuania
7342923489
9413002921
8550363975
8729936136
9892702358
9919780071
10601690872
10673516673
9744243420
10001193420
Madagascar
193547824063
230813597938
202257586268
255016609233
297951960784
314443149443
323277158907
338061963396
296434003329
296535930381
Malaysia
8145694632
9750822511
10181021770
10678749467
12978107561
12442747897
13246412031
14388360064
13100058100
14034980334
Mali
4234999823
5623216449
4583850368
7189481824
10409797649
12292770631
12582122604
12226514722
11749620620
11183458131
Mongolia
9366742309
11494837053
10911698208
10154238250
13131168012
14534278446
16018848991
16961127046
14798439527
11014858592
Mozambique
20182477481
31862554102
36906181381
49540813342
59977326086
59937797559
60269734045
65446402659
59687373958
63225097051
Myanmar
7423377429
8496965842
8298695145
8758622329
9774316692
10532001130
10982972256
11880438824
12747741540
13230844687
Nicaragua
166451213396
208064753766
169481317540
369062464570
411743801712
460953836444
514966287207
568498937588
481066152889
404652720165
Nigeria
13794910634
18504130753
15929902138
20030528043
25099681461
24595319574
28965906502
30881166852
27282581336
27424071383
Paraguay
149359920006
174195135053
168334599538
199590775190
224143083707
250092093548
271836123724
284584522899
292774099014
304905406845
Philippines
79712087912
115270054945
97798351648
125122306346
167775274725
186833516484
198727747253
206224725275
164641483516
152451923077
Qatar
179981288567
192225881688
192408387762
236421782178
275599459374
289162118909
302510668904
308142766948
296840704102
296975678610
Singapore
1479341637011
1635015380108
1499099749931
1431616749640
1488067258325
1336018949806
1361854206549
1376910811041
1197789902774
1237255019654
Spain
487816328342
513965650650
429657033108
488377689565
563109663291
543880647757
578742001488
573817719109
497918109302
514459972806
Sweden
21501741757
27368386358
28573777052
31407908612
33878631649
39087748240
44333456245
48197218327
45628320606
47340071107
Tanzania
38908069299
44856586316
43454935940
44050929160
45810626509
45044112939
46251061734
47587913059
43156708809
42062549395
Tunisia
12664165103
19271523179
20214385965
22583157895
29233333333
35164210526
39197543860
43524210526
35799628571
36179885714
Turkmenistan
3074359743898
2890564338235
2382825985356
2441173394730
2619700404733
2662085168499
2739818680930
3022827781881
2885570309161
2647898654635
United Kingdom
14477635000000
14718582000000
14418739000000
14964372000000
15517926000000
16155255000000
16691517000000
17393103000000
18120714000000
18624475000000
United States
77414425532
99130304099
106014659770
115931749697
135539438560
155820001920
171222025117
186204652922
193241108710
205276172135
Vietnam
14056957976
17910858638
15328342304
20265556274
23460098340
25503370699
28045460442
27150630607
21154394546
21063989683
Zambia
5291950100
4415702800
8621573608
10141859710
12098450749
14242490252
15451768659
15891049236
16304667807
16619960402
Zimbabwe
So what I have thought of so far is:
to use an aggregation loop that checks whether the current line is a GDP value or the name of a country: when it reaches the name of a country it should calculate the average and print out the result, then it should reset the per-country aggregation variables and continue looping to aggregate the next country's GDP values.
And so to handle the mixed nature of the input file, I would either use the str.isnumeric() method or keep a counter to check when 10 GDP values have been read (since the next line would then be the name of the corresponding country).
for value in open("10year-gdp.txt"):

Something like this in Python 3 may work:
import statistics
with open('10year-gdp.txt') as f:
items = []
for line in f.readlines():
line = line.strip()
if line.isdigit():
items.append(float(line))
else:
print('{0}: {1}'.format(line, statistics.mean(items)))
items = []

You can try this one too:
with open("10year-gdp.txt", "r") as infile:
content = infile.readlines()
content = [content[i:i+11] for i in range(0,len(content),11)]
results = [": ".join([c[10],str(sum(map(float,c[0:10]))/10)]).replace("\n","") for c in content]
for result in results:
print(result)
Output:
Australia: 1248467214849.1
Azerbaijan: 55506365440.0
Bangladesh: 139036345780.9
Brazil: 2057882976008.9
Brunei Darussalam: 14817756697.0
Burkina Faso: 10081729086.1
Cabo Verde: 1719693752.3
Cambodia: 13779735437.1
Chile: 229246627569.0
China: 7784747168448.6
Czech Republic: 207328405561.6
Dominica: 499171357.0
Egypt, Arab Rep.: 247614743339.3
France: 2702817149305.2
Germany: 3582562859622.3
Greece: 270091322197.4
Guam: 5115700000.0
India: 1726508317353.4
Iran, Islamic Rep.: 454617559842.3
Iraq: 169480789377.9
Japan: 5217301203153.5
Jordan: 29469864942.1
Kazakhstan: 168198946242.6
Kenya: 48807995178.8
Korea, Rep.: 1205755199135.1
Latvia: 28908355369.8
Lebanon: 40455121214.3
Lithuania: 42763449721.2
Madagascar: 9486935333.5
Malaysia: 274833978374.2
Mali: 11894695436.7
Mongolia: 9207583282.1
Mozambique: 12838623643.4
Myanmar: 50703575766.4
Nicaragua: 10212597587.4
Nigeria: 375494148527.7
Paraguay: 23250819867.6
Philippines: 231981575952.4
Qatar: 149455747118.1
Singapore: 257026873704.2
Spain: 1404296966483.9
Sweden: 519174481541.8
Tanzania: 36731725995.3
Tunisia: 44118349316.0
Turkmenistan: 29383204467.2
United Kingdom: 2736682446205.8
United States: 16108231800000.0
Vietnam: 144579453846.2
Zambia: 21393965950.9
Zimbabwe: 11907947332.3

#!/usr/bin/env python
from statistics import mean
GDPGroup = []
GDPDictionary = {}
with open("10year-gdp.txt") as FileObject:
lines = FileObject.readlines()
for line in lines:
line = line.strip()
if not line.isdigit():
GDPDictionary[line] = GDPGroup
GDPGroup = []
else:
GDPGroup.append(float(line))
for key in GDPDictionary:
array = GDPDictionary[key]
array2 = []
GDPDictionary[key] = mean(array)
print(GDPDictionary)
Prints out:
{'Guam': 5115700000.0, 'Lithuania': 42763449721.2, 'Azerbaijan': 55506365440.0, 'Bangladesh': 139036345780.9, 'Egypt, Arab Rep.': 247614743339.3, 'Burkina Faso': 10081729086.1, 'Chile': 229246627569.0, 'Mongolia': 9207583282.1, 'Nicaragua': 10212597587.4, 'Brazil': 2057882976008.9, 'Kenya': 48807995178.8, 'Dominica': 499171357.0, 'Japan': 5217301203153.5, 'India': 1726508317353.4, 'Cabo Verde': 1719693752.3, 'United States': 16108231800000.0, 'Greece': 270091322197.4, 'Myanmar': 50703575766.4, 'Madagascar': 9486935333.5, 'Tunisia': 44118349316.0, 'Mozambique': 12838623643.4, 'Cambodia': 13779735437.1, 'Iraq': 169480789377.9, 'Korea, Rep.': 1205755199135.1, 'Kazakhstan': 168198946242.6, 'Turkmenistan': 29383204467.2, 'Germany': 3582562859622.3, 'Iran, Islamic Rep.': 454617559842.3, 'France': 2702817149305.2, 'Paraguay': 23250819867.6, 'United Kingdom': 2736682446205.8, 'Malaysia': 274833978374.2, 'Philippines': 231981575952.4, 'Qatar': 149455747118.1, 'Lebanon': 40455121214.3, 'Jordan': 29469864942.1, 'Mali': 11894695436.7, 'Zambia': 21393965950.9, 'Australia': 1248467214849.1, 'Singapore': 257026873704.2, 'Zimbabwe': 11907947332.3, 'Sweden': 519174481541.8, 'Nigeria': 375494148527.7, 'China': 7784747168448.6, 'Tanzania': 36731725995.3, 'Czech Republic': 207328405561.6, 'Vietnam': 144579453846.2, 'Latvia': 28908355369.8, 'Spain': 1404296966483.9, 'Brunei Darussalam': 14817756697.0}

Reading statistics from a .txt file and outputting them

I am supposed to get certain information from a .txt file and output it. This is the information I need:
State with the maximum population
State with the minimum population
Average state population
State of Texas population
The DATA looks like:
Alabama
AL
4802982
Alaska
AK
721523
Arizona
AZ
6412700
Arkansas
AR
2926229
California
CA
37341989
This is my code that does not really do anything I need it to do:
def main():
# Open the StateCensus2010.txt file.
census_file = open('StateCensus2010.txt', 'r')
# Read the state name
state_name = census_file.readline()
while state_name != '':
state_abv = census_file.readline()
population = int(census_file.readline())
state_name = state_name.rstrip('\n')
state_abv = state_abv.rstrip('\n')
print('State Name: ', state_name)
print('State Abv.: ', state_abv)
print('Population: ', population)
print()
state_name = census_file.readline()
census_file.close()
main()
All I have it doing is reading the state name, abv and converting the population into an int. I don't need it to do anything of that, however I'm unsure how to do what the assignment is asking. Any hints would definitely be appreciated! I've been trying some things for the past few hours to no avail.
Update:
This is my updated code however I'm receving the following error:
Traceback (most recent call last):
File "main.py", line 13, in <module>
if population > max_population:
TypeError: unorderable types: str() > int()
Code:
with open('StateCensus2010.txt', 'r') as census_file:
while True:
try:
state_name = census_file.readline()
state_abv = census_file.readline()
population = int(census_file.readline())
except IOError:
break
# data processing here
max_population = 0
for population in census_file:
if population > max_population:
max_population = population
print(max_population)

As the data is in consistent order; Statename, State Abv, Population. So you just need to read the lines one time, and display all three 3 information. Below is the sample code.
average = 0.0
total = 0.0
state_min = 999999999999
state_max = 0
statename_min = ''
statename_max = ''
texas_population = 0
with open('StateCensus2010.txt','r') as file:
# split new line, '\n' here means newline
data = file.read().split('\n')
# get the length of the data by using len() method
# there are 50 states in the text file
# each states have 3 information stored,
# state name, state abreviation, population
# that's why length of data which is 150/3 = 50 states
state_total = len(data)/3
# this count is used as an index for the list
count = 0
for i in range(int(state_total)):
statename = data[count]
state_abv = data[count+1]
population = int(data[count+2])
print('Statename : ',statename)
print('State Abv : ',state_abv)
print('Population: ',population)
print()
# sum all states population
total += population
if population > state_max:
state_max = population
statename_max = statename
if population < state_min:
state_min = population
statename_min = statename
if statename == 'Texas':
texas_population = population
# add 3 because we want to jump to next state
# for example the first three lines is Alabama info
# the next three lines is Alaska info and so on
count += 3
# divide the total population with number of states
average = total/state_total
print(str(average))
print('Lowest population state :', statename_min)
print('Highest population state :', statename_max)
print('Texas population :', texas_population)

This problem is pretty easy using pandas.
Code:
states = []
for line in data:
states.append(
dict(state=line.strip(),
abbrev=next(data).strip(),
pop=int(next(data)),
)
)
df = pd.DataFrame(states)
print(df)
print('\nmax population:\n', df.ix[df['pop'].idxmax()])
print('\nmin population:\n', df.ix[df['pop'].idxmin()])
print('\navg population:\n', df['pop'].mean())
print('\nAZ population:\n', df[df.abbrev == 'AZ'])
Test Data:
from io import StringIO
data = StringIO(u'\n'.join([x.strip() for x in """
Alabama
AL
4802982
Alaska
AK
721523
Arizona
AZ
6412700
Arkansas
AR
2926229
California
CA
37341989
""".split('\n')[1:-1]]))
Results:
abbrev pop state
0 AL 4802982 Alabama
1 AK 721523 Alaska
2 AZ 6412700 Arizona
3 AR 2926229 Arkansas
4 CA 37341989 California
max population:
abbrev CA
pop 37341989
state California
Name: 4, dtype: object
min population:
abbrev AK
pop 721523
state Alaska
Name: 1, dtype: object
avg population:
10441084.6
AZ population:
abbrev pop state
2 AZ 6412700 Arizona

Another pandas solution, from the interpreter:
>>> import pandas as pd
>>>
>>> records = [line.strip() for line in open('./your.txt', 'r')]
>>>
>>> df = pd.DataFrame([records[i:i+3] for i in range(0, len(records), 3)],
... columns=['State', 'Code', 'Pop']).dropna()
>>>
>>> df['Pop'] = df['Pop'].astype(int)
>>>
>>> df
State Code Pop
0 Alabama AL 4802982
1 Alaska AK 721523
2 Arizona AZ 6412700
3 Arkansas AR 2926229
4 California CA 37341989
>>>
>>> df.ix[df['Pop'].idxmax()]
State California
Code CA
Pop 37341989
Name: 4, dtype: object
>>>
>>> df.ix[df['Pop'].idxmin()]
State Alaska
Code AK
Pop 721523
Name: 1, dtype: object
>>>
>>> df['Pop'].mean()
10441084.6
>>>
>>> df.ix[df['Code'] == 'AZ' ]
State Code Pop
2 Arizona AZ 6412700

Please try this the earlier code was not python 3 compatible. It supported python 2.7
def extract_data(state):
total_population = 0
for states, stats in state.items():
population = stats.get('population')
state_name = stats.get('state_name')
states = states
total_population = population + total_population
if 'highest' not in vars():
highest = population
higherst_state_name = state_name
highest_state = states
if 'lowest' not in vars():
lowest = population
lowest_state_name = state_name
lowest_state = states
if highest < population:
highest = population
higherst_state_name = state_name
highest_state = states
if lowest > population:
lowest = population
lowest_state_name = state_name
lowest_state = states
print(highest_state, highest)
print(lowest_state, lowest)
print(len(state))
print(int(total_population/len(state)))
print(state.get('TX').get('population'))
def main():
# Open the StateCensus2010.txt file.
census_file = open('states.txt', 'r')
# Read the state name
state_name = census_file.readline()
state = {}
while state_name != '':
state_abv = census_file.readline()
population = int(census_file.readline())
state_name = state_name.rstrip('\n')
state_abv = state_abv.rstrip('\n')
if state_abv in state:
state[state_abv].update({'population': population, 'state_name': state_name})
else:
state.setdefault(state_abv,{'population': population, 'state_name': state_name})
state_name = census_file.readline()
census_file.close()
return state
state=main()
extract_data(state)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read in a .txt file as desired dataframe format - python

Related

Is there a better way to find specific value in a python dictionary like in list?

Search a series for a word. Return that word and N others in a new column?

Is there a way to properly convert data from lists to a CSV file using BeautifulSoup?

Calculating average of data set, with text mixed [closed]

Reading statistics from a .txt file and outputting them

Categories

Resources