Pandas read JSON into Excel - python

I am trying to parse JSON data from an URL. I have fetched the data and parsed it into a dataframe. From the looks of it, I am missing a step.
Data Returns in JSON format in excel but my data frame returns two columns: entry number and JSON Text
import urllib.request
import json
import pandas
with urllib.request.urlopen("https://raw.githubusercontent.com/gavinr/usa-
mcdonalds-locations/master/mcdonalds.geojson") as url:
data = json.loads(url.read().decode())
print(data)
json_parsed = json.dumps(data)
print(json_parsed)
df=pandas.read_json(json_parsed)
writer = pandas.ExcelWriter('Mcdonaldsstorelist.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

I believe you can use json_normalize:
df = pd.io.json.json_normalize(data['features'])
df.head()
geometry.coordinates geometry.type properties.address \
0 [-80.140924, 25.789141] Point 1601 ALTON RD
1 [-80.218683, 25.765501] Point 1400 SW 8TH ST
2 [-80.185108, 25.849872] Point 8116 BISCAYNE BLVD
3 [-80.37197, 25.550894] Point 23351 SW 112TH AVE
4 [-80.36734, 25.579132] Point 10855 CARIBBEAN BLVD
properties.archCard properties.city properties.driveThru \
0 Y MIAMI BEACH Y
1 Y MIAMI Y
2 Y MIAMI Y
3 N HOMESTEAD Y
4 Y MIAMI Y
properties.freeWifi properties.phone properties.playplace properties.state \
0 Y (305)672-7055 N FL
1 Y (305)285-0974 Y FL
2 Y (305)756-0400 N FL
3 Y (305)258-7837 N FL
4 Y (305)254-3487 Y FL
properties.storeNumber properties.storeType properties.storeUrl \
0 14372 FREESTANDING http://www.mcflorida.com/14372
1 7408 FREESTANDING http://www.mcflorida.com/7408
2 11511 FREESTANDING http://www.mcflorida.com/11511
3 34014 FREESTANDING NaN
4 12215 FREESTANDING http://www.mcflorida.com/12215
properties.zip type
0 33139-2420 Feature
1 33135 Feature
2 33138 Feature
3 33032 Feature
4 33157 Feature
df.columns
Index(['geometry.coordinates', 'geometry.type', 'properties.address',
'properties.archCard', 'properties.city', 'properties.driveThru',
'properties.freeWifi', 'properties.phone', 'properties.playplace',
'properties.state', 'properties.storeNumber', 'properties.storeType',
'properties.storeUrl', 'properties.zip', 'type'],
dtype='object')

Related

Python: Replace multiple old values to new value Pandas

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})
This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0

how to check whether column of text contains specific string or not in pandas

I have following dataframe in pandas
job_desig salary
senior analyst 12
junior researcher 5
scientist 20
sr analyst 12
Now I want to generate one column which will have a flag set as below
sr = ['senior','sr']
job_desig salary senior_profile
senior analyst 12 1
junior researcher 5 0
scientist 20 0
sr analyst 12 1
I am doing following in pandas
df['senior_profile'] = [1 if x.str.contains(sr) else 0 for x in
df['job_desig']]
You can join all values of list by | for regex OR, pass to Series.str.contains and last cast to integer for True/False to 1/0 mapping:
df['senior_profile'] = df['job_desig'].str.contains('|'.join(sr)).astype(int)
If necessary, use word boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in sr)
df['senior_profile'] = df['job_desig'].str.contains(pat).astype(int)
print (df)
job_desig salary senior_profile
0 senior analyst 12 1
1 junior researcher 5 0
2 scientist 20 0
3 sr analyst 12 1
Soluttion with sets, if only one word values in list:
df['senior_profile'] = [int(bool(set(sr).intersection(x.split()))) for x in df['job_desig']]
You can just do it by simply using str.contains
df['senior_profile'] = df['job_desig'].str.contains('senior') | df['job_desig'].str.contains('sr')

pandas - top count items after groupby on multiple columns

I have dataframe data grouped by two columns (X, Y) and then I have count of elements in Z. Idea here is to find the top 2 counts of elements across X, Y.
Dataframe should look like:
mostCountYInX = df.groupby(['X','Y'],as_index=False).count()
C X Y Z
USA NY NY 5
USA NY BR 14
USA NJ JC 40
USA FL MI 3
IND MAH MUM 4
IND KAR BLR 2
IND KER TVM 2
CHN HK HK 3
CHN SH SH 3
Individually, I can extract the information I am looking for:
XTopCountInTopY = mostCountYInX[mostCountYInX['X'] == 'NY']
XTopCountInTopY = XTopCountInTopY.nlargest(2,'Y')
In the above I knew group I am looking for which is X = NY and got the top 2 records. Is there a way to print them together?
Say I am interested in IND and USA then the Output expected:
C X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2
I think you need groupby on index with parameter sort=False then apply using lambda function and sort_values on Z using parameter ascending=False then take top 2 values and reset_index as:
mask = df.index.isin(['USA','IND'])
df = df[mask].groupby(df[mask].index,sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(level=0,drop=True)
print(df)
X Y Z
USA NJ JC 40
USA NY BR 14
IND MAH MUM 4
IND KAR BLR 2
EDIT : After OP changed the Dataframe:
mask = df['C'].isin(['USA','IND'])
df = df[mask].groupby('C',sort=False).\
apply(lambda x: x.sort_values('Z',ascending=False)[:2]).\
reset_index(drop=True)
print(df)
C X Y Z
0 USA NJ JC 40
1 USA NY BR 14
2 IND MAH MUM 4
3 IND KAR BLR 2

API result (COMPLEX NESTED) into Data Frame Panda

I need extract some columns from API. I try:
#importing requests
import requests as re
#importing csv
import csv
#importing pandas
import pandas as pd
#taking url and asigning to url variable
url="https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2016-10-01&endtime=2016-10-02"
#assigning to data after getting the url
data=re.get(url)
#put it in the eq variable
eq=data.json()
#reult we can sse here
eq['features']
def obtain_data(eq):
i=0
print('Lat\tLongitude\tTitle\tPlace\tMag')
while i < len(eq['features']):
print(str(eq['features'][i]['geometry']['coordinates'][0])+'\t'+str(eq['features'][i]['geometry']['coordinates'][1])+'\t'+str(eq['features'][i]['properties']['title'])+'\t'+str(eq['features'][i]['properties']['place']+'\t'+str(eq['features'][i]['properties']['mag'])))
i=i+1
final_data= obtain_data(eq)
I need split coordinates to 2 columns - Lat and Longitude and also extract columns Title, Place and \Mag. Output is csv with tab separator.
I think you need:
from pandas.io.json import json_normalize
#extract data
df = json_normalize(data['features'])
#get first and second values of lists
df['Lat'] = df['geometry.coordinates'].str[0]
df['Longitude'] = df['geometry.coordinates'].str[1]
#rename original columns names
df = df.rename(columns={'properties.title':'Title',
'properties.place':'Place',
'properties.mag':'Mag'})
#filter only necessary columns
df = df[['Lat','Longitude', 'Title','Place','Mag']]
print (df.head())
Lat Longitude Title \
0 -118.895700 38.860700 M 1.0 - 27km ESE of Yerington, Nevada
1 -124.254833 40.676333 M 2.5 - 7km SW of Humboldt Hill, California
2 -116.020000 31.622500 M 2.6 - 53km ESE of Maneadero, B.C., MX
3 -121.328167 36.698667 M 2.1 - 13km SSE of Ridgemark, California
4 -115.614500 33.140500 M 1.5 - 10km W of Calipatria, CA
Place Mag
0 27km ESE of Yerington, Nevada 1.00
1 7km SW of Humboldt Hill, California 2.52
2 53km ESE of Maneadero, B.C., MX 2.57
3 13km SSE of Ridgemark, California 2.06
4 10km W of Calipatria, CA 1.45
#write to file
df.to_csv(file, sep='\t', index=False)

Setting values when iterating through a DataFrame

I have a dictionary of states (example IA:Idaho). I have loaded the dictionary into a DataFrame bystate_df.
then I am importing a CSV with states deaths that I want to add them to the bystate_df as I read the lines:
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df[(byState_df[0] == row['Area'])]['Deaths'] = row['Deaths']
print byState_df
but the byState_df is still 0 afterwords:
0 1 Deaths
0 WA Washington 0
1 WI Wisconsin 0
2 WV West Virginia 0
3 FL Florida 0
4 WY Wyoming 0
5 NH New Hampshire 0
6 NJ New Jersey 0
7 NM New Mexico 0
8 NA National 0
I test row['Deaths'] while it iterates and it's producing the correct values, it just seem to be setting the byState_df value incorrectly.
Can you try the following code where I use .loc instead of [][].
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df.loc[byState_df[0] == row['Area'], 'Deaths'] = row['Deaths']
print byState_df

Categories

Resources