API result (COMPLEX NESTED) into Data Frame Panda - python

I need extract some columns from API. I try:
#importing requests
import requests as re
#importing csv
import csv
#importing pandas
import pandas as pd
#taking url and asigning to url variable
url="https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2016-10-01&endtime=2016-10-02"
#assigning to data after getting the url
data=re.get(url)
#put it in the eq variable
eq=data.json()
#reult we can sse here
eq['features']
def obtain_data(eq):
i=0
print('Lat\tLongitude\tTitle\tPlace\tMag')
while i < len(eq['features']):
print(str(eq['features'][i]['geometry']['coordinates'][0])+'\t'+str(eq['features'][i]['geometry']['coordinates'][1])+'\t'+str(eq['features'][i]['properties']['title'])+'\t'+str(eq['features'][i]['properties']['place']+'\t'+str(eq['features'][i]['properties']['mag'])))
i=i+1
final_data= obtain_data(eq)
I need split coordinates to 2 columns - Lat and Longitude and also extract columns Title, Place and \Mag. Output is csv with tab separator.

I think you need:
from pandas.io.json import json_normalize
#extract data
df = json_normalize(data['features'])
#get first and second values of lists
df['Lat'] = df['geometry.coordinates'].str[0]
df['Longitude'] = df['geometry.coordinates'].str[1]
#rename original columns names
df = df.rename(columns={'properties.title':'Title',
'properties.place':'Place',
'properties.mag':'Mag'})
#filter only necessary columns
df = df[['Lat','Longitude', 'Title','Place','Mag']]
print (df.head())
Lat Longitude Title \
0 -118.895700 38.860700 M 1.0 - 27km ESE of Yerington, Nevada
1 -124.254833 40.676333 M 2.5 - 7km SW of Humboldt Hill, California
2 -116.020000 31.622500 M 2.6 - 53km ESE of Maneadero, B.C., MX
3 -121.328167 36.698667 M 2.1 - 13km SSE of Ridgemark, California
4 -115.614500 33.140500 M 1.5 - 10km W of Calipatria, CA
Place Mag
0 27km ESE of Yerington, Nevada 1.00
1 7km SW of Humboldt Hill, California 2.52
2 53km ESE of Maneadero, B.C., MX 2.57
3 13km SSE of Ridgemark, California 2.06
4 10km W of Calipatria, CA 1.45
#write to file
df.to_csv(file, sep='\t', index=False)

Related

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988

How to convert from pandas dataframe to a dictionary

I have look at Convert a Pandas DataFrame to a dictionary for guides to convert my dataframe to a dictionary. However, I can't seem to change my code to convert my output into a dictionary.
Below are my codes.
import pandas as pd
import collections
governmentcsv = pd.read_csv('government-procurement-via-gebiz.csv',parse_dates=True) #read csv and it contain dates (parse_dates = true)
extract = governmentcsv.loc[:, ['supplier_name','award_date']] #only getting these columns
extract.award_date= pd.to_datetime(extract.award_date)
def extract_supplier_not_na_2015():
notNAFifteen = extract[(extract.supplier_name != 'na') & (extract.award_date.dt.year == 2015)] #extract only year 2015
notNAFifteen.reset_index(drop = True,inplace=True) #reset index
notNAFifteen.index += 1 #and index start from 1
#SortednotNAFifteen = collections.orderedDictionary(notNAFifteen)
return notNAFifteen
print extract_supplier_not_na_2015()
The OUTPUT is:
supplier_name award_date
1 RMA CONTRACTS PTE. LTD. 2015-01-28
2 TESCOM (SINGAPORE) SOFTWARE SYSTEMS TESTING PT... 2015-07-01
3 MKS GLOBAL PTE. LTD. 2015-04-24
4 CERTIS TECHNOLOGY (SINGAPORE) PTE. LTD. 2015-06-26
5 RHT COMPLIANCE SOLUTIONS PTE. LTD. 2015-08-14
6 CLEANMAGE PTE. LTD. 2015-07-30
7 SOLUTIONSATWORK PTE. LTD. 2015-11-23
8 Ufinity Pte Ltd 2015-05-04
9 NCS PTE. LTD. 2015-01-28
I think that I find this dataset:
https://data.gov.sg/dataset/government-procurement
Anyway, here is code
import pandas as pd
df = pd.read_csv('government-procurement-via-gebiz.csv',
encoding='unicode_escape',
parse_dates=['award_date'],
infer_datetime_format=True,
usecols=['supplier_name', 'award_date'],
)
df = df[(df['supplier_name'] != 'Unknown') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
#Faster way:
d1 = df.set_index('supplier_name').to_dict()['award_date']
#Alernatively:
d2 = dict(zip(df['supplier_name'], df['award_date']))

How do I replace the similar looking values in a pandas dataframe?

I am new to Pandas. I have the following data types in my dataset. (The dataset is Indian Startup Funding downloaded from Kaggle.)
Date datetime64[ns]
StartupName object
IndustryVertical object
CityLocation object
InvestorsName object
InvestmentType object
AmountInUSD object
dtype: object
data['AmountInUSD'].groupby(data['CityLocation']).describe()
I did the above operation and found that many cities are similar for example,
Bangalore
Bangalore / Palo Alto
Bangalore / SFO
Bangalore / San Mateo
Bangalore / USA
Bangalore/ Bangkok
I want to do following operation, but I do not know the code to this.
In column CityLocation, find all cells which starts with 'Bang' and replace them all with 'Bangalore'. Help will be appreciated.
I did this
data[data.CityLocation.str.startswith('Bang')]
and I do not know what to do after this.
You can use the loc function to find the values in your column whose substring matches and replace with them with the value of your choosing.
import pandas as pd
df = pd.DataFrame({'CityLocation': ['Bangalore', 'Dangerlore', 'Bangalore/USA'], 'Values': [1, 2, 3]})
print(df)
# CityLocation Values
# 0 Bangalore 1
# 1 Dangerlore 2
# 2 Bangalore/USA 3
df.loc[df.CityLocation.str.startswith('Bang'), 'CityLocation'] = 'Bangalore'
print(df)
# CityLocation Values
# 0 Bangalore 1
# 1 Dangerlore 2
# 2 Bangalore 3
pandas 0.23 has a nice way to handle text. See the docs Working with Text Data. You can use regular expressions to capture and replace text.
import pandas as pd
df = pd.DataFrame({'CityLocation': ["Bangalore / Palo Alto", "Bangalore / SFO", "Other"]})
df['CityLocation'] = df['CityLocation'].str.replace("^Bang.*", "Bangalore")
print(df)
Will yield
CityLocation
0 Bangalore
1 Bangalore
2 Other

Pandas read JSON into Excel

I am trying to parse JSON data from an URL. I have fetched the data and parsed it into a dataframe. From the looks of it, I am missing a step.
Data Returns in JSON format in excel but my data frame returns two columns: entry number and JSON Text
import urllib.request
import json
import pandas
with urllib.request.urlopen("https://raw.githubusercontent.com/gavinr/usa-
mcdonalds-locations/master/mcdonalds.geojson") as url:
data = json.loads(url.read().decode())
print(data)
json_parsed = json.dumps(data)
print(json_parsed)
df=pandas.read_json(json_parsed)
writer = pandas.ExcelWriter('Mcdonaldsstorelist.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
I believe you can use json_normalize:
df = pd.io.json.json_normalize(data['features'])
df.head()
geometry.coordinates geometry.type properties.address \
0 [-80.140924, 25.789141] Point 1601 ALTON RD
1 [-80.218683, 25.765501] Point 1400 SW 8TH ST
2 [-80.185108, 25.849872] Point 8116 BISCAYNE BLVD
3 [-80.37197, 25.550894] Point 23351 SW 112TH AVE
4 [-80.36734, 25.579132] Point 10855 CARIBBEAN BLVD
properties.archCard properties.city properties.driveThru \
0 Y MIAMI BEACH Y
1 Y MIAMI Y
2 Y MIAMI Y
3 N HOMESTEAD Y
4 Y MIAMI Y
properties.freeWifi properties.phone properties.playplace properties.state \
0 Y (305)672-7055 N FL
1 Y (305)285-0974 Y FL
2 Y (305)756-0400 N FL
3 Y (305)258-7837 N FL
4 Y (305)254-3487 Y FL
properties.storeNumber properties.storeType properties.storeUrl \
0 14372 FREESTANDING http://www.mcflorida.com/14372
1 7408 FREESTANDING http://www.mcflorida.com/7408
2 11511 FREESTANDING http://www.mcflorida.com/11511
3 34014 FREESTANDING NaN
4 12215 FREESTANDING http://www.mcflorida.com/12215
properties.zip type
0 33139-2420 Feature
1 33135 Feature
2 33138 Feature
3 33032 Feature
4 33157 Feature
df.columns
Index(['geometry.coordinates', 'geometry.type', 'properties.address',
'properties.archCard', 'properties.city', 'properties.driveThru',
'properties.freeWifi', 'properties.phone', 'properties.playplace',
'properties.state', 'properties.storeNumber', 'properties.storeType',
'properties.storeUrl', 'properties.zip', 'type'],
dtype='object')

Not able to pull data from Census API because it is rejecting calls

I am trying to run this script to extract data from the US census but the census API is rejecting my request. It is rejecting my pulls, I did a bit of work, but am stumped....any ideas on how to deal with this
import pandas as pd
import requests
from pandas.compat import StringIO
#Sourced from the following site https://github.com/mortada/fredapi
from fredapi import Fred
fred = Fred(api_key='xxxx')
import StringIO
import datetime
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO as stio
else:
from io import StringIO as stio
year_list = '2013','2014','2015','2016','2017'
month_list = '01','02','03','04','05','06','07','08','09','10','11','12'
#############################################
#Get the total exports from the United States
#############################################
exports = pd.DataFrame()
for i in year_list:
for s in month_list:
try:
link="https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
str1 = ''.join([i])
txt = '-'
str2 = ''.join([s])
total_link=link+str1+txt+str2
r = requests.get(total_link, headers = {'User-agent': 'your bot 0.1'})
df = pd.read_csv(StringIO(r.text))
##################### change starts here #####################
##################### since it is a dataframe itself, so the method to create a dataframe from a list won't work ########################
# Drop the total sales line
df.drop(df.index[0])
# Rename Column name
df.columns=['CTY_CODE','CTY_NAME','EXPORT MTH','EXPORT YR','time','UN']
# Change the ["1234" to 1234
df['CTY_CODE']=df['CTY_CODE'].str[2:-1]
# Change the 2017-01] to 2017-01
df['time']=df['time'].str[:-1]
##################### change ends here #####################
exports = exports.append(df, ignore_index=False)
except:
print i
print s
Here you go:
import ast
import itertools
import pandas as pd
import requests
base = "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
year_list = ['2013','2014','2015','2016','2017']
month_list = ['01','02','03','04','05','06','07','08','09','10','11','12']
exports = []
rejects = []
for year, month in itertools.product(year_list, month_list):
url = '%s%s-%s' % (base, year, month)
r = requests.get(url, headers={'User-agent': 'your bot 0.1'})
if r.text:
r = ast.literal_eval(r.text)
df = pd.DataFrame(r[2:], columns=r[0])
exports.append(df)
else:
rejects.append((int(year), int(month)))
exports = pd.concat(exports).reset_index().drop('index', axis=1)
Your result looks like this:
CTY_CODE CTY_NAME ALL_VAL_MO ALL_VAL_YR time
0 1010 GREENLAND 233446 233446 2013-01
1 1220 CANADA 23170845914 23170845914 2013-01
2 2010 MEXICO 17902453702 17902453702 2013-01
3 2050 GUATEMALA 425978783 425978783 2013-01
4 2080 BELIZE 17795867 17795867 2013-01
5 2110 EL SALVADOR 207606613 207606613 2013-01
6 2150 HONDURAS 429806151 429806151 2013-01
7 2190 NICARAGUA 75752432 75752432 2013-01
8 2230 COSTA RICA 598484187 598484187 2013-01
9 2250 PANAMA 1046236431 1046236431 2013-01
10 2320 BERMUDA 47156737 47156737 2013-01
11 2360 BAHAMAS 256292297 256292297 2013-01
... ... ... ... ...
13883 0024 LAFTA 27790655209 193139639307 2017-07
13884 0025 EURO AREA 15994685459 121039479852 2017-07
13885 0026 APEC 76654291110 550552655105 2017-07
13886 0027 ASEAN 6030380132 44558200533 2017-07
13887 0028 CACM 2133048149 13333440411 2017-07
13888 1XXX NORTH AMERICA 41622877949 299981278306 2017-07
13889 2XXX CENTRAL AMERICA 4697852283 30756310800 2017-07
13890 3XXX SOUTH AMERICA 8117215081 55039567414 2017-07
13891 4XXX EUROPE 25201247938 189925038230 2017-07
13892 5XXX ASIA 38329181070 274304503490 2017-07
13893 6XXX AUSTRALIA AND OC... 2389798925 16656777753 2017-07
13894 7XXX AFRICA 1809443365 13022520158 2017-07
Walkthrough:
itertools.product iterates over the product of (year, month) combinations, joining them with your base url
if the text of the response object is not blank (periods such as 2017-12 will be blank), create a DataFrame out of the literally-evaluated text, which is a list of lists. Use the first element as columns and ignore the second element.
otherwise, add the (year, month) combo to rejects, a list of tuples of the items not found
I used exports = [] because it is much more efficiently to concatenate a list of DataFrames than to append to an existing DataFrame

Categories

Resources