I want to parse the data from the following API response into a pandas dataframe. There is an extra parent in this JSON file that I guess is causing the problem. How can I remove this and parse the data correctly?
URL: "https://api.covid19india.org/state_district_wise.json"
import pandas as pd
URL = "https://api.covid19india.org/state_district_wise.json"
df = pd.read_json(URL)
df.head()
The above code does not work and gives bad output. Please help.
Parsing nested structures in python is pain, here is solution working for your data:
import requests
URL = "https://api.covid19india.org/state_district_wise.json"
d = requests.get(URL).json()
L = []
for k, v in d.items():
for k1, v1 in v.items():
if isinstance(v1, dict):
for k2, v2 in v1.items():
if isinstance(v2, dict):
for k3, v3 in v2.items():
if isinstance(v3, dict):
d1 = {f'{k3}.{k4}': v4 for k4, v4 in v3.items()}
d2 = {'districtData':k,'State':k2,'statecode': v['statecode']}
d3 = {**d2, **v2, **d1}
del d3[k3]
L.append(d3)
df = pd.DataFrame(L)
print (df)
districtData State statecode \
0 State Unassigned Unassigned UN
1 Andaman and Nicobar Islands Nicobars AN
2 Andaman and Nicobar Islands North and Middle Andaman AN
3 Andaman and Nicobar Islands South Andaman AN
4 Andaman and Nicobar Islands Unknown AN
.. ... ... ...
767 West Bengal Purba Bardhaman WB
768 West Bengal Purba Medinipur WB
769 West Bengal Purulia WB
770 West Bengal South 24 Parganas WB
771 West Bengal Uttar Dinajpur WB
notes active confirmed \
0 0 0
1 District-wise numbers are out-dated as cumulat... 0 0
2 District-wise numbers are out-dated as cumulat... 0 1
3 District-wise numbers are out-dated as cumulat... 19 51
4 148 4442
.. ... ... ...
767 618 8773
768 1424 16548
769 350 5609
770 1899 27445
771 358 5197
deceased recovered delta.confirmed delta.deceased delta.recovered
0 0 0 0 0 0
1 0 0 0 0 0
2 0 1 0 0 0
3 0 32 0 0 0
4 60 4234 0 0 0
.. ... ... ... ... ...
767 74 8081 0 0 0
768 212 14912 0 0 0
769 33 5226 0 0 0
770 501 25045 0 0 0
771 55 4784 0 0 0
[772 rows x 11 columns]
Related
requests.get is not fetching all tags. i need table for that page
import requests
from bs4 import BeautifulSoup
source=requests.get("https://www.covid19india.org/").text
soup=BeautifulSoup(source,"html.parser")
Welcome to Stack Overflow. As you are new, take a read here on how to ask a good question.
You won't be able to pull that table as it is dynamically rendered and is not present in a simple request html. The data does come from an api which you can go directly to. Then take that json response to construct a table.
import requests
import pandas as pd
jsonData = requests.get('https://api.covid19india.org/state_district_wise.json').json()
rows = []
results = pd.DataFrame()
for state, v1 in jsonData.items():
for district, v2 in v1['districtData'].items():
active = v2['active']
confirmed = v2['confirmed']
deceased = v2['deceased']
recovered = v2['recovered']
rows.append([state, district, active, confirmed, deceased, recovered])
df = pd.DataFrame(rows, columns=['State','District','Active','Confirmed','Deceased','Recovered'])
Output:
print (df)
State District ... Deceased Recovered
0 State Unassigned Unassigned ... 0 0
1 Andaman and Nicobar Islands Nicobars ... 0 0
2 Andaman and Nicobar Islands North and Middle Andaman ... 0 1
3 Andaman and Nicobar Islands South Andaman ... 0 32
4 Andhra Pradesh Anantapur ... 4 92
5 Andhra Pradesh Chittoor ... 1 97
6 Andhra Pradesh East Godavari ... 0 43
7 Andhra Pradesh Guntur ... 8 328
8 Andhra Pradesh Krishna ... 15 280
9 Andhra Pradesh Kurnool ... 21 447
10 Andhra Pradesh Other State ... 0 25
11 Andhra Pradesh Prakasam ... 0 63
12 Andhra Pradesh S.P.S. Nellore ... 4 103
13 Andhra Pradesh Srikakulam ... 0 5
14 Andhra Pradesh Visakhapatnam ... 1 47
15 Andhra Pradesh Vizianagaram ... 0 4
16 Andhra Pradesh West Godavari ... 0 54
17 Andhra Pradesh Y.S.R. Kadapa ... 0 76
18 Andhra Pradesh Unknown ... 1 67
19 Arunachal Pradesh Anjaw ... 0 0
20 Arunachal Pradesh Changlang ... 0 0
21 Arunachal Pradesh East Kameng ... 0 0
22 Arunachal Pradesh East Siang ... 0 0
23 Arunachal Pradesh Kamle ... 0 0
24 Arunachal Pradesh Kra Daadi ... 0 0
25 Arunachal Pradesh Kurung Kumey ... 0 0
26 Arunachal Pradesh Lepa Rada ... 0 0
27 Arunachal Pradesh Lohit ... 0 1
28 Arunachal Pradesh Longding ... 0 0
29 Arunachal Pradesh Lower Dibang Valley ... 0 0
.. ... ... ... ... ...
733 Uttarakhand Rudraprayag ... 0 0
734 Uttarakhand Tehri Garhwal ... 0 0
735 Uttarakhand Udham Singh Nagar ... 0 5
736 Uttarakhand Uttarkashi ... 0 0
737 Uttarakhand Unknown ... 0 2
738 West Bengal Alipurduar ... 0 0
739 West Bengal Bankura ... 0 0
740 West Bengal Birbhum ... 0 6
741 West Bengal Cooch Behar ... 0 0
742 West Bengal Dakshin Dinajpur ... 0 0
743 West Bengal Darjeeling ... 1 5
744 West Bengal Hooghly ... 6 109
745 West Bengal Howrah ... 34 200
746 West Bengal Jalpaiguri ... 0 4
747 West Bengal Jhargram ... 0 0
748 West Bengal Kalimpong ... 1 6
749 West Bengal Kolkata ... 172 568
750 West Bengal Malda ... 0 10
751 West Bengal Murshidabad ... 1 4
752 West Bengal Nadia ... 0 9
753 West Bengal North 24 Parganas ... 35 156
754 West Bengal Other State ... 1 5
755 West Bengal Paschim Bardhaman ... 2 10
756 West Bengal Paschim Medinipur ... 0 19
757 West Bengal Purba Bardhaman ... 0 5
758 West Bengal Purba Medinipur ... 1 31
759 West Bengal Purulia ... 0 0
760 West Bengal South 24 Parganas ... 5 46
761 West Bengal Uttar Dinajpur ... 0 0
762 West Bengal Unknown ... 0 0
[763 rows x 6 columns]
I have two dataframes:
file_date = str((date.today() - timedelta(days = 2)).strftime('%m-%d-%Y'))
file_date
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + file_date + '.csv'
first dataframe:
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 45001.0 Abbeville South Carolina US 2020-04-28 02:30:51 34.223334 -82.461707 29 0 0 29 Abbeville, South Carolina, US
1 22001.0 Acadia Louisiana US 2020-04-28 02:30:51 30.295065 -92.414197 130 9 0 121 Acadia, Louisiana, US
2 51001.0 Accomack Virginia US 2020-04-28 02:30:51 37.767072 -75.632346 195 3 0 192 Accomack, Virginia, US
3 16001.0 Ada Idaho US 2020-04-28 02:30:51 43.452658 -116.241552 650 15 0 635 Ada, Idaho, US
4 19001.0 Adair Iowa US 2020-04-28 02:30:51 41.330756 -94.471059 1 0 0 1 Adair, Iowa, US
#
0 0 ... 0 Kerala 0 Kerala 1
2 2020-02-01 Kerala 2 0 0 ... 0 Kerala 0 Kerala 2
3 2020-02-02 Kerala 3 0 0 ... 0 Kerala 0 Kerala 3
4 2020-02-03 Kerala 3 0 0 ... 0 Kerala 0 Kerala 3
Please guide me on how to concatenate both the data frames. I tried a couple of things but did not get the expected result.
I am trying to scrape entire table and want to store it in .csv file.
While I am trying to scrape this data it is showing me error as NO TABLES FOUND.
Here is my code.
from pandas.io.html import read_html
page = 'https://games.crossfit.com/leaderboard/open/2020?view=0&division=1&scaled=0&sort=0'
tables = read_html(page, attrs={"class":"desktop athletes"})
print ("Extracted {num} tables".format(num=len(tables)))
Any suggestion or guidance or any help ?
This page uses JavaScript to get data from server and generate table.
But using DevTool in Chrome/Firefox you can see (in tab Network) all requests from browser to server and one of the XHR/AJAX request gets all data in JSON format so you can use this url to get it also as JSON which you can convert to Python data and you don't have to scrape it.
import requests
r = requests.get('https://games.crossfit.com/competitions/api/v1/competitions/open/2020/leaderboards?view=0&division=1&scaled=0&sort=0')
data = r.json()
for row in data['leaderboardRows']:
print(row['entrant']['competitorName'], row['overallScore'], [(x['rank'],x['scoreDisplay']) for x in row['scores']])
Result
Patrick Vellner 64 [('13', '8:38'), ('19', '988 reps'), ('12', '6:29'), ('18', '16:29'), ('2', '10:09')]
Mathew Fraser 74 [('8', '8:28'), ('40', '959 reps'), ('3', '6:08'), ('2', '14:22'), ('21', '10:45')]
Lefteris Theofanidis 94 [('1', '8:05'), ('3', '1021 reps'), ('13', '6:32'), ('4', '15:00'), ('73', '11:11')]
# ... more ...
As stated below, you can access the api to get the data. To save as CSV, you'll need to work through the json format to get what you need (ie. flatten out the nested data). There's 2 ways to do it, a) completely flatten it out so that each row is for each entrant, or b) have separate rows for each entrant for each of their ordinal scores.
The only differences will be if you choose a) you'll have a really wide table (but no repeated data), and if you go with b) you'll have a long table, with repeat of data.
Since it's not too big of a file, I went with option b) so you can always groupby particular columns or filter:
import requests
import pandas as pd
r = requests.get('https://games.crossfit.com/competitions/api/v1/competitions/open/2020/leaderboards?view=0&division=1&scaled=0&sort=0')
data = r.json()
results = pd.DataFrame()
df = pd.DataFrame(data['leaderboardRows'])
for idx, row in df.iterrows():
entrantData = pd.Series()
scoresData = pd.DataFrame()
entrantResults = pd.DataFrame()
for idx2, each in row.iteritems():
if type(each) == dict:
temp = pd.DataFrame.from_dict(each, orient='index')
entrantData = entrantData.append(temp)
elif type(each) == list:
temp2 = pd.DataFrame(each)
scoresData = scoresData.append(temp2, sort=True).reset_index(drop=True)
else:
entrantData = entrantData.append(pd.Series(each, name=idx2))
entrantResults = entrantResults.append(scoresData, sort=True).reset_index(drop=True)
entrantResults = entrantResults.merge(pd.concat([entrantData.T] *5, ignore_index=True), left_index=True, right_index=True)
results = results.append(entrantResults, sort=True).reset_index(drop=True)
results.to_csv('file.csv', index=False)
Output: first 15 rows of 250
print (results.head(15).to_string())
affiliate affiliateId affiliateName age breakdown competitorId competitorName countryChampion countryOfOriginCode countryOfOriginName divisionId drawBlueHR firstName gender heat height highlight judge lane lastName mobileScoreDisplay nextStage ordinal overallRank overallScore postCompStatus profilePicS3key rank scaled score scoreDisplay scoreIdentifier status time video weight
0 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 10 rounds 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Dallyn Giroux Vellner 1 1 64 d471c-P158264_7-184.jpg 13 0 11800382 8:38 9d3979222412df2842a1 ACT 518 0 195 lb
1 CrossFit Soul Miami 1918 CrossFit Nanaimo 30 29 rounds +\n2 thrusters\n 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Lamar Vernon Vellner 2 1 64 d471c-P158264_7-184.jpg 19 0 1009880000 988 reps 9bd66b00e8367cc7fd0c ACT NaN 0 195 lb
2 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 165 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Jason Lochhead Vellner 3 1 64 d471c-P158264_7-184.jpg 12 0 1001650151 6:29 2347b4cb7339f2a13e6c ACT 389 0 195 lb
3 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 240 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Dallyn Giroux Vellner 4 1 64 d471c-P158264_7-184.jpg 18 0 1002400211 16:29 bcfd3882df3fa2e99451 ACT 989 0 195 lb
4 CrossFit New England 1918 CrossFit Nanaimo 30 240 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Matt O'Keefe Vellner 5 1 64 d471c-P158264_7-184.jpg 2 0 1002400591 10:09 4bb25bed5f71141da122 ACT 609 0 195 lb
5 CrossFit Mayhem 3220 CrossFit Mayhem 30 10 rounds 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False darren hunsucker Fraser 1 2 74 9e218-P153604_4-184.jpg 8 0 11800392 8:28 18b5b2e137f00a2d9d7d ACT 508 0 195 lb
6 CrossFit Soul Miami 3220 CrossFit Mayhem 30 28 rounds +\n4 thrusters\n3 toes-to-bars\n 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Daniel Lopez Fraser 2 2 74 9e218-P153604_4-184.jpg 40 0 1009590000 959 reps b96bc1b7b58fa34a28a1 ACT NaN 0 195 lb
7 CrossFit Mayhem 3220 CrossFit Mayhem 30 165 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Jason Fernandez Fraser 3 2 74 9e218-P153604_4-184.jpg 3 0 1001650172 6:08 4f4a994a045652c894c5 ACT 368 0 195 lb
8 CrossFit Mayhem 3220 CrossFit Mayhem 30 240 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Tasia Percevecz Fraser 4 2 74 9e218-P153604_4-184.jpg 2 0 1002400338 14:22 1a4a7d8760e72bb12d68 ACT 862 0 195 lb
9 CrossFit Mayhem 3220 CrossFit Mayhem 30 240 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Kelley Jackson Fraser 5 2 74 9e218-P153604_4-184.jpg 21 0 1002400555 10:45 b4a259e7049f47f65356 ACT 645 0 195 lb
10 NaN 0 30 10 rounds 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 1 3 94 931eb-P514502_2-184.jpg 1 0 11800415 8:05 c8907e02512f42ff3142 ACT 485 1 81 kg
11 NaN 0 30 30 rounds +\n1 thruster\n 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 2 3 94 931eb-P514502_2-184.jpg 3 0 1010210000 1021 reps 63add31b22606957701c ACT NaN 1 81 kg
12 NaN 0 30 165 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 3 3 94 931eb-P514502_2-184.jpg 13 0 1001650148 6:32 46d7cdb691c25ea38dbe ACT 392 1 81 kg
13 NaN 0 30 240 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 4 3 94 931eb-P514502_2-184.jpg 4 0 1002400300 15:00 d49e55a2af5840740071 ACT 900 1 81 kg
14 NaN 0 30 240 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 5 3 94 931eb-P514502_2-184.jpg 73 0 1002400529 11:11 d35c9d687eb6b72c8e36 ACT 671 1 81 kg
I have a pandas dataframe that looks like this:
genres.head()
Drama Comedy Action Crime Romance Thriller Adventure Horror Mystery Fantasy ... History Music War Documentary Sport Musical Western Film-Noir News number_of_genres
tconst
tt0111161 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
tt0468569 1 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
tt1375666 0 0 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
tt0137523 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
tt0110912 1 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
I want to be able to get a table where the rows are the genres, the columns are the number of labels for a given movie and the values are the counts. In other words, I want this:
number_of_genres 1 2 3 totals
Drama 451 1481 3574 5506
Comedy 333 1108 2248 3689
Action 9 230 1971 2210
Crime 1 284 1687 1972
Romance 1 646 1156 1803
Thriller 22 449 1153 1624
Adventure 1 98 1454 1553
Horror 137 324 765 1226
Mystery 0 108 792 900
Fantasy 1 74 642 717
Sci-Fi 0 129 551 680
Biography 0 95 532 627
Family 0 60 452 512
Animation 0 6 431 437
History 0 32 314 346
Music 1 87 223 311
War 0 90 162 252
Documentary 70 82 78 230
Sport 0 78 142 220
Musical 0 13 131 144
Western 19 44 57 120
Film-Noir 0 11 50 61
News 0 1 2 3
Total 1046 5530 18567 25143
What is the best way of getting that table pythonistically? I solved the problem through the following code but was wondering if there's a better way:
genres['number_of_genres'] = genres.sum(axis=1)
pivots = []
for column in genres.columns[0:-1]:
column = pd.DataFrame(genres[column])
columns = column.join(genres.number_of_genres)
pivot = pd.pivot_table(columns, values=columns.columns[0], columns='number_of_genres', aggfunc=np.sum)
pivots.append(pivot)
pivots_df = pd.concat(pivots)
pivots_df['totals'] = pivots_df.sum(axis=1)
pivots_df.loc['Total'] = pivots_df.sum()
[EDIT]: Added jupyter output that should be compatible with pd.read_clipboard(). If I can format the output better, please let me know how I can do so.
Maybe I'm missing something but doesn't this work for you?
agg = df.groupby('number_of_genres').agg('sum').T
agg['totals'] = agg.sum(axis=1)
Edit: Solution via pivot_table
agg = df.pivot_table(columns='number_of_genres', aggfunc='sum')
agg['total'] = agg.sum(axis=1)
Pick Tm Player Pos Age To AP1 PB St CarAV ... Att Yds TD Rec Yds TD Tkl Int Sk College/Univ
0 1 CLE Myles Garrett DE 21 2017 0 0 0 0 ... 0 0 0 0 0 0 13 5.0 Texas A&M
1 2 CHI Mitch Trubisky QB 23 2017 0 0 1 0 ... 29 194 0 0 0 0 North Carolina
2 3 SFO Solomon Thomas DE 21 2017 0 0 1 0 ... 0 0 0 0 0 0 25 2.0 Stanford
3 4 JAX Leonard Fournette RB 22 2017 0 0 1 0 ... 207 822 7 25 195 1 LSU
4 5 TEN Corey Davis WR 22 2017 0 0 1 0 ... 0 0 0 22 227 0 West. Michigan
Given this df, I want to count the number of players per College/Univ.
So, just in this particular df, all collegs will have the value of 1.
Given a df and a college name, how can I count the number of items?
You can create boolean mask and then count Trues by sum, Trues are processes like 1s:
(df['College/Univ'] == 'Texas A&M').sum()