Taking OpenCorporate API data into a structured CSV

Taking OpenCorporate API data into a structured CSV - python

I'm currently struggling with figuring out how to use pandas to scrape data off of the OpenCorporate API and insert it into a CSV file. I'm not quite sure where I'm messing up.
import pandas as pd
df = pd.read_json('https://api.opencorporates.com/companies/search?q=pwc')
data = df['companies']['company'][0]
result = {'name':data['timestamp'],
'company_number':data[0]['company_number'],
'jurisdiction_code':data[0]['jurisdiction_code'],
'incorporation_date':data[0]['incorporation_date'],
'dissolution_date':data[0]['dissolution_date'],
'company_type':data[0]['company_type'],
'registry_url':data[0]['registry_url'],
'branch':data[0]['branch'],
'opencorporates_url':data[0]['opencorporates_url'],
'previous_names':data[0]['previous_names'],
'source':data[0]['source'],
'url':data[0]['url'],
'registered_address':data[0]['registered_address'],
}
df1 = pd.DataFrame(result, columns=['name', 'company_number', 'jurisdiction_code', 'incorporation_date', 'dissolution_date', 'company_type', 'registry_url', 'branch', 'opencorporates_url', 'previous_names', 'source', 'url', 'registered_address'])
df1.to_csv('company.csv', index=False, encoding='utf-8')

Get the json data with requests and then use pd.io.json.json_normalize to flatten the response.
import requests
json_data = requests.get('https://api.opencorporates.com/companies/search?q=pwc').json()
from pandas.io.json import json_normalize
df = None
for row in json_data["results"]["companies"]:
if df is None:
df = json_normalize(row["company"])
else:
df = pd.concat([df, json_normalize(row["company"])])
You then write the DataFrame to a csv using the df.to_csv() method as described in the question.

It might be easier for you to access to the OpenCorporates database in bulk.
OpenCorporates provides access for commercial users under a closed licence, and as open data for journalists, academics and NGOs who are able to share the results under a share-alike, open data licence. The licence is available here: https://opencorporates.com/info/licence

Related

Using pandas in a csv modified file?

is it possible to use pandas after rewriting data into a csv using something like this:
import csv
headers = []
cleaned_data = open('cleaned_data.csv', 'w')
writer = csv.writer(cleaned_data)
for row in open("house_prices.csv"):
# <-- Some body code here to filter out the headers
This is where I want to continue with my cleaning of data and get rid of rows that contain missing values. I've been told that using pandas is the way to go but I'm not sure if it's ok to do it since the first steps are to write this code:
import pandas as pd
df = pd.read_csv('house_prices.csv')
which conflicts with my first code, right? So is it possible to remove rows of missing values with this method or is there another way without importing anything?
Or would it be possible to combine both?ie:
import csv
import pandas as pd
headers = []
cleaned_data = open('cleaned_data.csv', 'w')
writer = csv.writer(cleaned_data)
df = pd.read_csv('house_prices.csv')
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
for row in open("house_prices.csv"):
# <-- Some body code here to filter out the headers
Would that work? This is the first time I'm seeing pandas

Python Datacompy library: how to save report string into a csv file?

I'm comparing between two data frames using Datacompy, but how can I save the final result as an excel sheet or csv file? I got a string as an output, but how can I save it as a CSV.
import pandas as pd
df1_1=pd.read_csv('G1-1.csv')
df1_2=pd.read_csv('G1-2.csv')
import datacompy
compare = datacompy.Compare(
df1_1,
df1_2,
join_columns='SAMPLED CONTENT (URL to content)',
)
print(compare.report())

I have tried this, and it worked for me:
with open('//Path', encoding='utf-8') as report_file:
report_file.write(compare.report())

If you just using pandas, you can try pandas's own way to write into csv:
> df = pd.DataFrame([['yy','rr'],['tt', 'rr'],['cc', 'rr']], index=range(3),
columns=['a', 'b'])
> df.to_csv('compare.csv')
I hadn't used datacompy, but I suggest that you can make your results into a dataframe, then you can use the to_csv way.

This is working fine me also Full code
compare = datacompy.Compare(
Oracle_DF1,PostgreSQL_DF2,
join_columns=['c_transaction_cd','c_anti_social_force_req_id'], #You can also specify a list of columns
abs_tol=0,
rel_tol=0,
df1_name = 'Oracle Source',
df2_name = 'PostgrSQL Reference'
)
compare.matches(ignore_extra_columns=False)
Report = compare.report() csvFileToWrite=r'D://Postgres_Problem_15Feb21//Oracle_PostgreSQLDataFiles//Sample//summary.csv'
with open(csvFileToWrite,mode='r+',encoding='utf-8') as report_file:
report_file.write(compare.report())

Parse a json file to get the right columns to insert into bigquery

I'm relatively new to Python and I am trying to get some exchange rate data from the ECB free api:
GET https://api.exchangeratesapi.io/latest?base=GBP
I want to ultimately end up with this data in a bigquery table. Loading the data to BQ is fine, but getting it into the right column/row format before sending it the BQ is the problem.
I want to end up with a table like this:
Currency Rate Date
CAD 1.629.. 2019-08-27
HKD 9.593.. 2019-08-27
ISK 152.6.. 2019-08-27
... ... ...
I've tried a few things but not quite got there yet:
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
with open('data.json', 'w') as outfile:
json.dump(data['rates'], outfile)
a_dict = {'date': '2019-08-26'}
with open('data.json') as f:
data = json.load(f)
data.update(a_dict)
with open('data.json', 'w') as f:
json.dump(data, f)
print(data)
Here is the original json file:
{
"rates":{
"CAD":1.6296861353,
"HKD":9.593490542,
"ISK":152.6759753684,
"PHP":64.1305429339,
"DKK":8.2428443501,
"HUF":363.2604778172,
"CZK":28.4888284523,
"GBP":1.0,
"RON":5.2195062629,
"SEK":11.8475893558,
"IDR":17385.9684034803,
"INR":87.6742617713,
"BRL":4.9997236134,
"RUB":80.646191945,
"HRK":8.1744110201,
"JPY":130.2223254066,
"THB":37.5852652759,
"CHF":1.2042718318,
"EUR":1.1055465269,
"MYR":5.1255348081,
"BGN":2.1622278974,
"TRY":7.0550451616,
"CNY":8.6717964026,
"NOK":11.0104695256,
"NZD":1.9192287707,
"ZAR":18.6217151449,
"USD":1.223287232,
"MXN":24.3265563331,
"SGD":1.6981194654,
"AUD":1.8126540855,
"ILS":4.3032293014,
"KRW":1482.7479464473,
"PLN":4.8146551248
},
"base":"GBP",
"date":"2019-08-23"
}

Welcome! How about this, as one way to tackle your problem.
# import the pandas library so we can use it's from_dict function:
import pandas as pd
# subset the json to a dict of exchange rates and country codes:
d = data['rates']
# create a dataframe from this data, using pandas from_dict function:
df = pd.DataFrame.from_dict(d,orient='index')
# add a column for date (this value is taken from the json data):
df['date'] = data['date']
# name our columns, to keep things clean
df.columns = ['rate','date']
This gives you:
rate date
CAD 1.629686 2019-08-23
HKD 9.593491 2019-08-23
ISK 152.675975 2019-08-23
PHP 64.130543 2019-08-23
...
In this case the currency is the index of the dataframe, if you'd prefer it as column of it's own just add:
df['currency'] = df.index
You can then write this dataframe out to a .csv file, or write it into BigQuery.
For this i'd recommend you take a look at The BigQuery Client library, this can be a little hard to get your head around at first, so you may also want to check out pandas.DataFrame.to_gbq, which is easier, but less robust (see this link for more detail on Client library vs. a pandas function.

Thanks Ben P for the help.
Here is my script that works for those interested. It uses an internal library my team uses for the BQ load, but the rest is pandas and requests:
from aa.py.gcp import GCPAuth, GCPBigQueryClient
from aa.py.log import StandardLogger
import requests, os, pandas as pd
# Connect to BigQuery
logger = StandardLogger('test').logger
auth = GCPAuth(logger=logger)
credentials_path = 'XXX'
credentials = auth.get_credentials(credentials_path)
gcp_bigquery = GCPBigQueryClient(logger=logger)
gcp_bigquery.connect(credentials)
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
# extract rates object from json
d = data['rates']
# split currency and rate for dataframe
df = pd.DataFrame.from_dict(d,orient='index')
# add date element to dataframe
df['date'] = data['date']
#column names
df.columns = ['rate', 'date']
# print dataframe
print(df)
# write dateframe to csv
df.to_csv('data.csv', sep='\t', encoding='utf-8')
#########################################
# write csv to BQ table
file_path = os.getcwd()
file_name = 'data.csv'
dataset_id = 'Testing'
table_id = 'Exchange_Rates'
response = gcp_bigquery.load_file_into_table(file_path, file_name, dataset_id, table_id, source_format='CSV', field_delimiter="\t", create_disposition='CREATE_NEVER', write_disposition='WRITE_TRUNCATE',skip_leading_rows=1)

Not reading all rows while importing csv into pandas dataframe

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?

I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,

An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'

According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.

Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)

Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.

I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Taking OpenCorporate API data into a structured CSV - python

Related

Using pandas in a csv modified file?

Python Datacompy library: how to save report string into a csv file?

Parse a json file to get the right columns to insert into bigquery

Not reading all rows while importing csv into pandas dataframe

Read specific columns with pandas or other python module

Categories

Resources