I'm interested in creating Choropleth map with Python on a county level. When I run my code without trying to bind data to it I get the county lines drawn in beautifully. However whenever I try to bind my data I get KeyError: None.
From my searching it appeared as though this is due to values in the GeoJson not matching up with the values in the data file... but I went in manually and checked and have already edited the data so there are the exact same number of rows and exact same values... still getting the same error. Very frustrating :(
My code:
import folium
from folium import plugins
from folium.plugins import Fullscreen
import pandas as pd
county_geo = 'Desktop\counties.json'
county_data = 'Desktop\fips.csv'
# Read into Dataframe, cast to string for consistency.
df = pd.read_csv(county_data, na_values=[' '])
df['FIPS'] = df['FIPS'].astype(str)
m = folium.Map(location=[48, -102], zoom_start=3)
m.choropleth(geo_path=county_geo,
data=df,
columns=['FIPS', 'Value'],
key_on='feature.properties.id',
fill_color='PuBu')
Fullscreen().add_to(m)
m
And my error:
KeyError: None
Out[32]:
folium.folium.Map at 0x10231748
Any advice or example code/files that are working for you on a county level would be much appreciated!
EDIT:
I found my own error.
key_on='feature.properties.id',
Should be:
key_on='feature.id',
import json
keys=[k['id'] for k in json.load(open('Desktop\counties.json')['features']]
missing_keys=set(keys)-set(plot_data['FIPS'])
dicts=[]
for k in missing_keys:
row={}
dicts.append({'FIPS': k, 'Value': 0})
dicts
mapdata = country_data
mapdata = mapdata.append(dicts, ignore_index=True)
This will find the missing keys in DataFrame and create new rows with 0 value.
This might resolve your key error problem
Related
I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!
Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
I am new to Spyder and am working with the KDD1999 data. I am trying to create charts based on the dataset such as total amounts of srv_error rates. However when I try to create these charts errors pop up and I have a few I can't solve. I have commented the code. Does anyone know what is wrong with the code?
#Used to import all packanges annd/or libraries you will be useing
#pd loads and creates the data table or dataframe
import pandas as pd
####Section for loading data
#If the datafile extention has xlsx than the read_excel function should be used. If cvs than read_cvs should be used
#As this is stored in the same area the absoloute path can remain unchanged
df = pd.read_csv('kddcupdata1.csv')
#Pulls specific details
#Pulls first five rows
df.head()
#Pulls first three rows
df.head(3)
#Setting column names
df.columns = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'lnum_compromised', 'lroot_shell', 'lsu_attempted', 'lnum_root', 'lnum_file_creations', 'lnum_shells', 'lnum_access_files', 'lnum_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label']
#Scatter graph for number of failed logins caused by srv serror rate
df.plot(kind='scatter',x='num_failed_logins',y='srv_serror_rate',color='red')
#This works
#Total num_failed_logins caused by srv_error_rate
# making a dict of list
info = {'Attack': ['dst_host_same_srv_rate', 'dst_host_srv_rerror_rate'],
'Num' : [0, 1]}
otd = pd.DataFrame(info)
# sum of all salary stored in 'total'
otd['total'] = otd['Num'].sum()
print(otd)
##################################################################################
#Charts that do not work
import matplotlib.pyplot as plt
#1 ERROR MESSAGE - AttributeError: 'list' object has no attribute 'lsu_attempted'
#Bar chart showing total 1su attempts
df['lsu_attempted'] = df['lsu_attempted'].astype(int)
df = ({'lsu_attempted':[1]})
df['lsu_attempted'].lsu_attempted(sort=0).plot.bar()
ax = df.plot.bar(x='super user attempts', y='Total of super user attempts', rot=0)
df.from_dict('all super user attempts', orient='index')
df.transpose()
#2 ERROR MESSAGE - TypeError: plot got an unexpected keyword argument 'x'
#A simple line plot
plt.plot(kind='bar',x='protocol_type',y='lsu_attempted')
#3 ERROR MESSAGE - TypeError: 'set' object is not subscriptable
df['lsu_attempted'] = df['lsu_attempted'].astype(int)
df = ({'lsu_attempted'})
df['lsu_attempted'].lsu_attempted(sort=0).plot.bar()
ax = df.plot.bar(x='protocol_type', y='lsu_attempted', rot=0)
df.from_dict('all super user attempts', orient='index')
df.transpose()
#5 ERROR MESSAGE - TypeError: 'dict' object is not callable
#Bar chart showing total of chosen protocols used
Data = {'protocol_types': ['tcp','icmp'],
'number of protocols used': [10,20,30]
}
bar = df(Data,columns=['protocol_types','number of protocols used'])
bar.plot(x ='protocol_types', y='number of protocols used', kind = 'bar')
df.show()
Note:(Also if anyone has some clear explanation on what its about that would also be healpful please link sources if possible?)
Your first error in this snippet :
df['lsu_attempted'] = df['lsu_attempted'].astype(int)
df = ({'lsu_attempted':[1]})
df['lsu_attempted'].lsu_attempted(sort=0).plot.bar()
ax = df.plot.bar(x='super user attempts', y='Total of super user attempts', rot=0)
df.from_dict('all super user attempts', orient='index')
df.transpose()
The error you get AttributeError: 'list' object has no attribute 'lsu_attempted' is as a result of line two above.
Initially df is a pandas data frame (line 1 above), but from line 2 df = ({'lsu_attempted':[1]}), df is now a dictionary with one key - ‘lsu_attempted’ - which has a value of a list with one element.
so in line 3 when you do df['lsu_attempted'] (as the first part of that statement) this equates to that single element list, and a list doesn’t have the lsu_attempted attribute.
I have no idea what you were trying to achieve but it is my strong guess that you did not intend to replace your data frame with a single key dictionary.
Your 2nd error is easy - you are calling plt.plot incorrectly - x is not a keyword argument - see matplotlib.pyplot.plot - Matplotlib 3.2.1 documentation - x and y are positional arguments.
Your 3rd error message results from the code snippet above - you made df a dictionary - and you can’t call dictionaries.
I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()
I have a list of named polygons:
import pandas as pd
import geopandas as gp
df = gp.GeoDataFrame([['a',Polygon([(1, 0), (1, 1), (2,2), (1,2)])],
['b',Polygon([(1, 1), (2,2), (3,1)])]],
columns = ['name','geometry'])
df = gp.GeoDataFrame(df, geometry = 'geometry')
and a list of named points:
points = gp.GeoDataFrame( [['box', Point(1.5, 1.75)],
['cone', Point(3.0,2.0)],
['triangle', Point(2.5,1.25)]],
columns=['id', 'geometry'],
geometry='geometry')
Currently, I am running a for loop over these points and polygons to see which point falls within which polygon and returning there names and Ids to a list loc like so:
loc = []
for geo1, name in zip(df['geometry'], df['name']):
for geo2, id in zip(points['geometry'], points['id']):
if geo1.contains(geo2):
loc.append([id, name])
Now what I want to try and do is alter the loop so it adds a column to the points dataframe called 'inside' and returns 'True' if the point is in a polygon and 'False' if it isn't.
I've tried:
points['inside'] = ''
for geo1 in df['geometry']:
for geo2 in points['geometry']:
if geo1.contains(geo2):
points['inside'].append('True')
but it doesn't work
How can I best do this?
sorry if there is a very basic answer that I have missed.
Its been suggested below that this might be a duplicate of another question, however the one that is linked does not refer to adding the results to a column and whilst the Matplotlib methodology may be faster, when I run the example script provided I get the error float() argument must be a string or a number, not 'zip'
You are trying to append to a string...
Just change the line points['inside'] = '' to points['inside'] = []
points['inside'] = []
for geo1 in df['geometry']:
for geo2 in points['geometry']:
if geo1.contains(geo2):
points['inside'].append('True')
This works for me...
Hope you find this helpful!
I've pulled some stock data from Quandl for both Crude Oil prices (WTI) and Caterpillar (CAT) price. When I concatenate the two dataframes together I'm left with some NaNs. My ultimate goal is to run a .Pearsonr() to assess the correlation (along with p-values), however I can't get Pearsonr() to work because of all the Nan's. So I'm trying to clean them up. When I use the .fillNA() function it doesn't seem to be working. I've even tried .interpolate() as well as .dropna(). None of them appear to work. Here is my working code.
import Quandl
import pandas as pd
import numpy as np
#WTI Data#
WTI_daily = Quandl.get("DOE/RWTC", collapse="daily",trim_start="1986-10-10", trim_end="1986-10-15")
WTI_daily.columns = ['WTI']
#CAT Data
CAT_daily = Quandl.get("YAHOO/CAT.6", collapse = "daily",trim_start="1986-10-10", trim_end="1986-10-15")
CAT_daily.columns = ['CAT']
#Combine Data Frames
daily_price_df = pd.concat([CAT_daily, WTI_daily], axis=1)
print daily_price_df
#Verify they are dataFrames:
def really_a_df(var):
if isinstance(var, pd.DataFrame):
print "DATAFRAME SUCCESS"
else:
print "Wahh Wahh"
return 'done'
print really_a_df(daily_price_df)
#Fill NAs
#CAN'T GET THIS TO WORK!!
daily_price_df.fillna(method='pad', limit=8)
print daily_price_df
# Try to interpolate
#CAN'T GET THIS TO WORK!!
daily_price_df.interpolate()
print daily_price_df
#Drop NAs
#CAN'T GET THIS TO WORK!!
daily_price_df.dropna(axis=1)
print daily_price_df
For what it's worth I've managed to get the function working when I create a dataframe from scratch using this code:
import pandas as pd
import numpy as np
d = {'a' : 0., 'b' : 1., 'c' : 2.,'d':None,'e':6}
d_series = pd.Series(d, index=['a', 'b', 'c', 'd','e'])
d_df = pd.DataFrame(d_series)
d_df = d_df.fillna(method='pad')
print d_df
Initially I was thinking that perhaps my data wasn't in dataframe form, but I used a simple test to confirm they are in fact dataframe. The only conclusion I that remains (in my opinion) is that it is something about the structure of the Quandl dataframe, or possibly the TimeSeries nature. Please know I'm somewhat new to python so structure answers for a begginner/novice. Any help is much appreciated!
pot shot - have you just forgotten to assign or use the inplace flag.
daily_price_df = daily_price_df.fillna(method='pad', limit=8)
OR
daily_price_df.fillna(method='pad', limit=8, inplace=True)