How to solve ValueError: DataFrame constructor not properly called - python

I have tried to put the csv data into pandas data frame but i am getting an error "DataFrame constructor not properly called!". i have uploaded csv file on the github. file="https://raw.githubusercontent.com/gambler2020/Data_Analysis/master/Economy/WEO_Data.csv"
with open("WEO_Data.csv", encoding='utf-16') as f:
contents = f.read()
df = pd.DataFrame(contents)
ValueError: DataFrame constructor not properly called!
How will i solve this error.

You should just use pd.read_csv to do the work.
df = pd.read_csv("WEO_Data.csv")
If you are planning to do amendments in the data, you can do so after reading it into Pandas dataframe.

Use read_csv with encoding='utf-16' parameter:
file="https://raw.githubusercontent.com/gambler2020/Data_Analysis/master/Economy/WEO_Data.csv"
df = pd.read_csv(file, encoding='utf-16')
print (df.head())
Country 1980 1981 1982 1983 1984 1985 1986 \
0 Afghanistan NaN NaN NaN NaN NaN NaN NaN
1 Albania 1.946 2.229 2.296 2.319 2.290 2.339 2.587
2 Algeria 42.346 44.372 44.780 47.529 51.513 61.132 61.535
3 Andorra NaN NaN NaN NaN NaN NaN NaN
4 Angola 6.639 6.214 6.214 6.476 6.864 8.457 7.918
1987 1988 1989 1990 1991 1992 1993 1994 1995 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2.566 2.530 2.779 2.221 1.333 0.843 1.461 2.361 2.882
2 63.300 51.664 52.558 61.892 46.670 49.217 50.963 42.426 42.066
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 9.050 9.818 11.421 12.571 12.186 9.395 6.819 4.965 6.197
1996 1997 1998 1999 2000 2001 2002 2003 2004 \
0 NaN NaN NaN NaN NaN NaN 4.367 4.553 5.146
1 3.200 2.259 2.560 3.209 3.483 3.928 4.348 5.611 7.185
2 46.941 48.178 48.188 48.845 54.749 54.745 56.761 67.864 85.332
3 NaN NaN NaN NaN 1.429 1.547 1.758 2.362 2.896
4 7.994 9.388 7.958 7.526 11.166 10.930 15.286 17.813 23.552
...

Related

How to import URL in pandas that is neither csv or Excel file?

I was trying to import this URL from the worldbank
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
I want to use the table of all countries and economies.
I can skip the graph by skipping the header, but pd.read_excel or pd.read_csv obviously aren't working.
What am I supposed to do ?
There is an link on the site to download the data as xlsx. You can read this with pd.read_excel():
import pandas as pd
df = pd.read_excel('https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=excel', sheet_name='Data', skiprows=3)
result:
Country Name
Country Code
Indicator Name
Indicator Code
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
0
Aruba
ABW
GDP (current US$)
NY.GDP.MKTP.CD
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
405463417
487602458
596423607
695304363
764887117
872138715
958463184
1082979721
1245688268
1320474860
1379960894
1531944134
1665100559
1722798883
1873452514
1920111732
1941340782
2021229050
2228491620
2330726257
2424581006
2615083799
2745251397
2498882682
2390502793
2549720670
2534636872
2727849721
2790849162
2962905028
2983636872
3092430168
3202188607
nan
nan
1
Africa Eastern and Southern
AFE
GDP (current US$)
NY.GDP.MKTP.CD
19342484576
19753490586
21526615650
25772356399
23563232195
26851350246
29196502382
30219070807
32927067005
37801761961
40377109505
44544318707
48374959174
63079306619
78369918525
83562484550
83337002757
95133441245
106507911957
124687609417
156750816224
160622014029
154904633222
160000530887
146244041212
130638242469
147248826582
180012868628
189290783787
194839284973
212659048041
221099527492
220553773354
220949576766
225099507739
253136239805
252550100523
265549158044
250377799052
247067404758
268315059659
242105498360
247656772652
326744217915
405860474813
471742666480
533533468219
613164396848
668037143166
670986478461
805794703846
898604749626
915590443629
930086422790
958824753165
895440123119
856991850399
964790654431
986610722363
980371628600
900828558644
2
Afghanistan
AFG
GDP (current US$)
NY.GDP.MKTP.CD
537777811
548888896
546666678
751111191
800000044
1006666638
1399999967
1673333418
1373333367
1408888922
1748886596
1831108971
1595555476
1733333264
2155555498
2366666616
2555555567
2953333418
3300000109
3697940410
3641723322
3478787909
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
4055179566
4515558808
5226778809
6209137625
6971285595
9747879532
10109225814
12439087077
15856574731
17804292964
20001598506
20561069558
20484885120
19907111419
18017749074
18869945678
18353881130
19291104008
19807067268
3
Africa Western and Central
AFW
GDP (current US$)
NY.GDP.MKTP.CD
10407321640
11131302981
11946843969
12680220415
13842621612
14866816737
15837474343
14430648807
14884699923
16887028428
23511477700
20838908163
25272340678
31282962686
44227412162
51459772973
62147555474
65334104528
71220525033
88654314398
112064063501
211065184010
187218448133
138155586596
114296077828
116541346401
107528972026
110354025261
108975317104
101798537787
121837737280
117491402473
118316836021
97186773684
85693055814
107403017954
119043576286
119983265523
122621303105
130198655014
134150161633
141862545778
170531894804
197384166043
245856459112
302110792974
384336309536
451866076568
553031246170
492545833171
580217267150
658428249867
716935231751
807818949699
846943079513
757492080700
687484728476
680989095101
738131279382
792078923888
786584975144
4
Angola
AGO
GDP (current US$)
NY.GDP.MKTP.CD
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
5934073604
5553824464
5553824464
5787823809
6135166254
7558613008
7076793823
8089279285
8775116269
10207922517
11236275843
nan
nan
nan
3390500000
5561222222
7526963964
7649716157
6506619145
6152936539
9129634978
8936063723
15285594828
17812705294
23552052408
36970918699
52381006892
65266452081
88538611205
70307163678
83799496611
111789686464
128052853643
136709862831
145712200313
116193649124
101123851090
122123822334
101353230785
89417190341
62306913444

How to create a column with the most recent data in a Pandas Data Frame

I have a data frame for the GDP of various countries. Each row is a different country and each column after the 4th column is a year's data from 1960 to 2020 for each country. I am trying to create a new column that has only the most recent year's GDP data as some countries have certain years and other do not. I am unable to come up with a function or method to only keep/extract the most recent year's data for each country. I am not really sure how to approach it either so not be able to try anything that has been of progress.
Any help or how to go about it would be appreciated.
Code Implemented to clean original file thus far:
gdp_annual_df = pd.read_excel("API_NY.GDP.MKTP.CD_DS2_en_excel_v2_1622079.xls",sheet_name = "Data" )
gdp_annual_df.columns = gdp_annual_df.iloc[2]
gdp_annual_df = gdp_annual_df.drop([0,1,2])
Data File From = https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
This is the Data Frame (sorry not sure how to add it properly/neatly)
2 Country Name Country Code Indicator Name Indicator Code 1960.0 1961.0 1962.0 1963.0 1964.0 1965.0 ... 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0 2017.0 2018.0 2019.0 2020.0
3 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.549721e+09 2.534637e+09 2.701676e+09 2.765363e+09 2.919553e+09 2.965922e+09 3.056425e+09 NaN NaN NaN
4 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.780429e+10 2.000160e+10 2.056107e+10 2.048489e+10 1.990711e+10 1.936264e+10 2.019176e+10 1.948438e+10 1.910135e+10 NaN
5 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.117897e+11 1.280529e+11 1.367099e+11 1.457122e+11 1.161936e+11 1.011239e+11 1.221238e+11 1.013532e+11 9.463542e+10 NaN
6 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.289077e+10 1.231983e+10 1.277622e+10 1.322814e+10 1.138685e+10 1.186120e+10 1.301969e+10 1.514702e+10 1.527808e+10 NaN
7 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 3.629204e+09 3.188809e+09 3.193704e+09 3.271808e+09 2.789870e+09 2.896679e+09 3.000181e+09 3.218316e+09 3.154058e+09 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
262 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 6.701698e+09 6.499807e+09 7.074778e+09 7.396705e+09 6.442916e+09 6.719172e+09 7.245707e+09 7.942962e+09 7.926108e+09 NaN
263 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 3.272642e+10 3.540134e+10 4.041524e+10 4.320647e+10 3.697620e+10 2.808468e+10 2.456133e+10 2.759126e+10 NaN NaN
264 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575397e+09 7.972997e+09 8.497997e+09 9.423396e+09 1.037400e+10 1.133440e+10 ... 4.164189e+11 3.963327e+11 3.668294e+11 3.509046e+11 3.176205e+11 2.963573e+11 3.495541e+11 3.682889e+11 3.514316e+11 NaN
265 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 2.345952e+10 2.550306e+10 2.804551e+10 2.715065e+10 2.124335e+10 2.095476e+10 2.586814e+10 2.700524e+10 2.306472e+10 NaN
266 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 1.410192e+10 1.711485e+10 1.909102e+10 1.949552e+10 1.996312e+10 2.054868e+10 2.204090e+10 2.431156e+10 2.144076e+10 NaN
Create list of lists for each row then loop through and pick last non-null value.enter code here
df = pd.DataFrame([[1,2,3,np.nan,np.nan], [1,np.nan, 3,4,np.nan],[1,np.nan, 3,4,5]], columns=['a', 'b', 'c', 'd', 'e'])
latest = []
for lst in df.values.tolist():
latest.append([l for l in lst if not pd.isnull(l)][-1])
df['latestValue'] = latest
Input dataframe:
a b c d e
0 1 2.0 3 NaN NaN
1 1 NaN 3 4.0 NaN
2 1 NaN 3 4.0 5.0
Output:
a b c d e latestValue
0 1 2.0 3 NaN NaN 3.0
1 1 NaN 3 4.0 NaN 4.0
2 1 NaN 3 4.0 5.0 5.0

Grouping by year through months: pivot table

I need to group values by Year, from my dataset:
Date Freq Year Month
0 2020-03-19 32 2020 3
1 2020-03-25 31 2020 3
2 2020-03-23 28 2020 3
3 2020-03-04 26 2020 3
4 2020-08-04 26 2020 8
... ... ... ... ...
2516 2011-09-02 1 2011 9
2517 2013-04-25 1 2013 4
2518 2020-09-02 1 2020 9
2519 2013-09-03 1 2013 9
2520 2015-01-01 1 2015 1
The table below was found as follows:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
try_this=pd.pivot_table(df, values = 'Freq', index=['Date','Year'], columns = 'Month')
Month 1 2 3 4 5 6 7 8 9 10 11 12
Date Year
2010-03-04 2010 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-03-07 2010 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-07-31 2010 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
2010-10-07 2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
2010-12-20 2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-12-05 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0
2020-12-06 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10.0
2020-12-08 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18.0
2020-12-09 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0
2020-12-10 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14.0
I am trying to get something like this:
Year 1 2 3 4 5 6 7 8 9 10 11 12
2020 ... 61.0
2019 ...
2018 ...
...
i.e. a table where group by year the frequency through months.
What I tried (code above) is not giving me this output.
I would appreciated more help on how to figure it out.
References:
Plot through time setting specific filtering
How to pivot a dataframe in Pandas?
Have you tried using aggfunc in pivot_table:
df = df[['Year', 'Month', 'Freq']]
df = df.pivot_table(values=['Freq'], columns=['Month'], index=['Year'], aggfunc='sum')
print(df)
Freq
Month 1 3 4 8 9
Year
2011 NaN NaN NaN NaN 1.0
2013 NaN NaN 1.0 NaN 1.0
2015 1.0 NaN NaN NaN NaN
2020 NaN 117.0 NaN 26.0 1.0

How to convert messy html table to pandas dataframe

I'm trying to scrape SEC 10-Q and 10-K filings. Though I'm able to extract the tables, the CSV output is a bit messy. Is there any way that I can merge the columns with similar header names with pandas? Or any libraries that can help me export SEC filing data tables as csv?
[user#server sec_parser]$ /usr/bin/python3 /home/user/work_files/sec_parser/parser.py --file 10-Q-cmcsa-3312017x10q.htm
0 1 2 3 4 5 6 7 8 9 10 11
0 (in millions) NaN 2017 2017 2017 NaN 2016 2016 2016 NaN NaN NaN
1 Revenue NaN $ 20463 NaN NaN $ 18790 NaN NaN 8.9 %
2 Costs and Expenses: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Programming and production NaN 6074 6074 NaN NaN 5431 5431 NaN NaN 11.8 NaN
4 Other operating and administrative NaN 5827 5827 NaN NaN 5526 5526 NaN NaN 5.4 NaN
5 Advertising, marketing and promotion NaN 1530 1530 NaN NaN 1466 1466 NaN NaN 4.4 NaN
6 Depreciation NaN 1915 1915 NaN NaN 1785 1785 NaN NaN 7.3 NaN
7 Amortization NaN 587 587 NaN NaN 493 493 NaN NaN 19.0 NaN
8 Operating income NaN 4530 4530 NaN NaN 4089 4089 NaN NaN 10.8 NaN
9 Other income (expense) items, net NaN (625 (625 ) NaN (554 (554 ) NaN 13.0 NaN
10 Income before income taxes NaN 3905 3905 NaN NaN 3535 3535 NaN NaN 10.4 NaN
11 Income tax expense NaN (1,258 (1,258 ) NaN (1,311 (1,311 ) NaN (4.1 )
12 Net income NaN 2647 2647 NaN NaN 2224 2224 NaN NaN 19.0 NaN
13 Net (income) loss attributable to noncontrolli... NaN (81 (81 ) NaN (90 (90 ) NaN (10.2 )
14 Net income attributable to Comcast Corporation NaN $ 2566 NaN NaN $ 2134 NaN NaN 20.2 %
The sample table that I'm trying to convert as CSV https://edgartable.netlify.app/.
Here's my code
import os
import argparse
import sys
from bs4 import BeautifulSoup
import argparse
import pandas as pd
args = argparse.ArgumentParser()
args.add_argument('--file', type=str)
args.add_argument('--list', type=str)
opts = args.parse_args()
def parse_file(file):
data_map = []
div = []
tables = []
soup = BeautifulSoup(open(file, 'r'), 'html.parser')
for div in soup.find_all('div'):
if 'Consolidated Operating Results' not in str(div.find('font')): continue
table = div.find('table')
dataset = pd.read_html(str(table), skiprows=3)
print(dataset[0])
for i, data in enumerate(dataset):
data.to_csv(f'test{i}.csv', '|', index=False, header=False)
def main():
parse_file(opts.file)
if __name__ == "__main__": main()
Try this:
import pandas as pd
df = pd.read_html('https://edgartable.netlify.app/')
df = df[0]
df.to_csv('test.csv')

How to extract column from Python Pandas Pivot_table?

I have the following pandas pivot_table:
print table
Year
1980.0 11.38
1981.0 35.68
1982.0 28.88
1983.0 16.80
1984.0 50.35
1985.0 53.95
1986.0 37.08
1987.0 21.70
1988.0 47.21
1989.0 73.45
1990.0 49.37
1991.0 32.23
1992.0 76.14
1993.0 45.99
1994.0 79.22
1995.0 88.11
1996.0 199.15
1997.0 201.07
1998.0 256.33
1999.0 251.12
2000.0 201.63
2001.0 331.49
2002.0 394.97
2003.0 357.61
2004.0 418.85
2005.0 459.41
2006.0 520.52
2007.0 610.44
2008.0 678.49
2009.0 667.39
2010.0 600.36
2011.0 515.93
2012.0 363.30
2013.0 367.98
2014.0 337.10
2015.0 264.26
dtype: float64
How do I extract the first column of this pivot_table? If I just do table[:,0], it gives me ValueError: Can only tuple-index with a MultiIndex. I am wondering what can I do in order to extract the first column of the table.
Simply reset_index(). Below creates a reproducible example with loc to slice column:
import numpy as np
import pandas as pd
np.random.seed(44)
# RANDOM DATA WITH US CLASS I RAILROADS
df = pd.DataFrame({'Name': ['UP', 'BNSF', 'CSX', 'KCS','NSF', 'CN', 'CP']*5,
'Other_Sales': np.random.randn(35),
'Year': list(range(2007,2014))*5})
table = df.pivot_table('Other_Sales', columns='Name',
index='Year', aggfunc='sum')
print(table)
# Name BNSF CN CP CSX KCS NSF UP
# Year
# 2007 NaN NaN NaN NaN NaN NaN -1.785934
# 2008 1.605111 NaN NaN NaN NaN NaN NaN
# 2009 NaN NaN NaN 1.800014 NaN NaN NaN
# 2010 NaN NaN NaN NaN -2.577264 NaN NaN
# 2011 NaN NaN NaN NaN NaN 0.899372 NaN
# 2012 NaN -3.988874 NaN NaN NaN NaN NaN
# 2013 NaN NaN 1.725111 NaN NaN NaN NaN
table = df.pivot_table('Other_Sales', columns='Name',
index='Year', aggfunc='sum').sum(axis=1).reset_index()
print(table)
# Year 0
# 0 2007 -1.785934
# 1 2008 1.605111
# 2 2009 1.800014
# 3 2010 -2.577264
# 4 2011 0.899372
# 5 2012 -3.988874
# 6 2013 1.725111
print(table.loc[:,0])
# 0 -1.785934
# 1 1.605111
# 2 1.800014
# 3 -2.577264
# 4 0.899372
# 5 -3.988874
# 6 1.725111
# Name: 0, dtype: float64

Categories

Resources