I have a dataframe "states" that has each states's child poverty rate and json file called "us_states". I want to create a choropleth map using plotly express but I'm struggling to create the id column. Here is my entire code.
import pandas as pd
import json
import plotly.express as px
states = pd.read_csv('https://raw.githubusercontent.com/ngpsu22/Child-Poverty-State-Map/master/poverty_rate_map.csv')
us_states = pd.read_json('https://github.com/ngpsu22/Child-Poverty-State-Map/raw/master/gz_2010_us_040_00_500k.json')
state_id_map = {}
for feature in us_states['features']:
feature['id'] = feature['properties']['NAME']
state_id_map[feature['properties']['STATE']] = feature['id']
states['id'] = states['state'].apply(lambda x: state_id_map[x])
But I get this error:
KeyError: 'Maine'
Which since Maine is first in my data frame means that something is going wrong.
Any suggestions?
us_states.features is a dict
Use pd.json_normalize to extract the dict into a dataframe.
'geometry.coordinates' for each row is a large nested list
It's not clear what the loop is supposed to do, the data from the two dataframes can be joined together for easier access, using pd.merge.
us_states = pd.read_json('https://github.com/ngpsu22/Child-Poverty-State-Map/raw/master/gz_2010_us_040_00_500k.json')
# convert the dict to dataframe
us_states_features = pd.json_normalize(us_states.features, sep='_')
# the Name column is addressed with
us_states_features['properties_Name']
# join the two dataframe into one
df = pd.merge(states, us_states_features, left_on='state', right_on='properties_NAME')
Related
I am extracting a HTML Table from Web with Pandas.
In this result (List of Dataframe Objects) I want to return all Dataframes where the Cell Value is an Element of an given Array.
So far I am struggling to call only one one column value and not the whole Object.
Syntax of Table: (the Header Lines are not extracted correctly so this i the real Output)
0
1
2
3
Date
Name
Number
Text
09.09.2022
Smith Jason
3290
Free Car Wash
12.03.2022
Betty Paulsen
231
10l Gasoline
import pandas as pd
import numpy as np
url = f'https://some_website.com'
df = pd.read_html(url)
arr_Nr = ['3290', '9273']
def correct_number():
for el in df[0][1]:
if (el in arr_Nr):
return True
def get_winner():
for el in df:
if (el in arr_Nr):
return el
print(get_winner())
With the Function
correct_number()
I can output that there is a Winner, but not the Details, when I try to access
get_winner()
EDIT
So far I now think I got one step closer: The function read_html() returns a list of DataFrame Objects. In my example, there is only one table so accessing it via df = dfs[0] I should get the correct DataFrame Object.
But now when I try the following, the Code don't work as expected, there is no Filter applied and the Table is returned in full:
df2 = df[df.Number == '3290']
print(df2)
Okay i finally figured it out:
Pandas returned List of DataFrame Objects, in my example there is only one table, to access this Table aka the DataFrame Object I had to access it first.
Before I then could compare the Values, I parsed them to integers, Pandas seemed to extract them as char, so my Array couldn't compare them properly.
In the End the code looks way more elegant that I thought before:
import pandas as pd
import numpy as np
url = f'https://mywebsite.com/winners-2022'
dfs_list = pd.read_html(url, header =0, flavor = 'bs4')
df = dfs_list[0]
winner_nrs = [3290, 843]
result = df[df.Losnummer.astype(int).isin(winner_nrs)]
I'm trying to create multiple empty DataFrames with a for loop where each DataFrame has a unique name stored in a list. Per the sample code below, I would like three empty DataFrames, one called A[], another B[] and the last one C[]. Thank you.
import pandas as pd
report=['A','B','C']
for i in report:
report[i]=pd.DataFrame()
It would be best to use a dictionary
import pandas as pd
report=['A','B','C']
df_dict = {}
for i in report:
df_dict[i]=pd.DataFrame()
print(df_dict['A'])
print(df_dict['B'])
print(df_dict['C'])
You should use dictionnary for that:
import pandas as pd
report={'A': pd.DataFrame(),'B': pd.DataFrame(),'C': pd.DataFrame()]
if you have a list of string or character containing the name, which is I think what you are really trying to do
name_dataframe = ['A', 'B', 'C']
dict_dataframe = {}
for name in name_dataframe:
dict_dataframe[name] = pd.Dataframe()
It is not a good practise, and you should probably use a dictionary to do this, but the below code gets the work done if you still need to do it, this will create the DataFrames in the memory with the names in the list report:
for i in report:
exec(i + ' = pd.DataFrame()')
And if you want to store the empty DataFrames in a list:
df_list = []
for i in report:
exec(i + ' = pd.DataFrame() \ndf_list.append(' + i+ ')')
I have two dataframes of unequal size, one contains cuisine style along with its frequency in the dataset and another is the original dataset which has restaurant name and cuisine corresponding to it. I want to add a new column on the original dataset where the frequency value of each cuisine is displayed from the dataframe containing the frequency data. What is the best way to perform that? I have tried by using merge but that creates NaN values. Please suggest
I tried below code snippet suggested but it did not give me the required result. it generates freq for first row and excludes the other rows for the same 'name' column.
df = df.assign(freq=0)
# get all the cuisine styles in the cuisine df
for cuisine in np.unique(cuisine_df['cuisine_style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['cuisine_style'] == cuisine,
'freq'].values
# update value in main df
df.loc[df['cuisine_style'] == cuisine, 'freq'] = freq
Result dataframe
I re ran the code on your data set and still got the same results. Here is the code I ran.
import pandas as pd
import numpy as np
# used to set 'Cuisine Style' to first 'style' in array of values
def getCusinie(row):
arr = row['Cuisine Style'].split("'")
return arr[1]
# read in data set. Used first col for index and drop nan for ease of use
csv = pd.read_csv('TA_restaurants_curated.csv', index_col=0).dropna()
# get cuisine values
cuisines = csv.apply(lambda row: getCusinie(row), axis=1)
# update dataframe
csv['Cuisine Style'] = cuisines
# json obj to quickly make a new data frame with meaningless frequencies
c = {'Cuisine Style' : np.unique(csv['Cuisine Style']), 'freq': range(113)}
cuisine_df = pd.DataFrame(c)
# add 'freq' column to original Data Frame
csv = csv.assign(freq=0)
# same loop as before
for cuisine in np.unique(cuisine_df['Cuisine Style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['Cuisine Style'] == cuisine,
'freq'].values
# update value in main df
csv.loc[csv['Cuisine Style'] == cuisine, 'freq'] = freq
Output:
As you can see, every column, even duplicates, have been updated. If they still are not being updated I'd check to make sure that the names are actually equal i.e. make sure there isn't any hidden spaces or anything causing issues.
You can read up on selecting and indexing DataFrames here.
Its quite long but you can pick apart what you need, when you need it
My code works perfect fine for 1 dataframe using the to_json
However now i would like to have a 2nd dataframe in this result.
So I thought creating a dictionary would be the answer.
However it produces the result below which is not practical.
Any help please
I was hoping to produce something a lot prettier without all the "\"
A simple good example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df.to_json(orient='records')
A simple bad example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
{"result_1": df.to_json(orient='records')}
I also tried
jsonify({"result_1": df.to_json(orient='records')})
and
{"result_1": [df.to_json(orient='records')]}
Hi I think that you are on the right way.
My advice is to use also json.loads to decode json and create a list of dictionary.
As you said before we can create a pandas dataframe and then use df.to_json to convert itself.
Then use json.loads to json format data and create a dictionary to insert into a list e.g. :
data = {}
jsdf = df.to_json(orient = "records")
data["result"] = json.loads(jsdf)
Adding elements to dictionary as below you will find a situation like this:
{"result1": [{...}], "result2": [{...}]}
PS:
If you want to generate random values for different dataframe you can use faker library from python.
e.g.:
from faker import Faker
faker = Faker()
for n in range(5):
df.append(list(faker.profile().values()))
df = pd.DataFrame(df, columns=faker.profile().keys())
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out