I am seeking to generate a fake dataset for my research using the Faker library. I am unable to link gender and first name of the person. Can I expect some help in this regard? The function is given below.
def faker_categorical(num=1, seed=None):
np.random.seed(seed)
fake.seed_instance(seed)
output = [
{
"gender": np.random.choice(["M", "F"], p=[0.5, 0.5]),
"GivenName": fake.first_name_male() if "gender"=="M" else fake.first_name_female(),
"Surname": fake.last_name(),
"Zipcode": fake.zipcode(),
"Date of Birth": fake.date_of_birth(),
"country": np.random.choice(["United Kingdom", "France", "Belgium"]),
}
for x in range(num)
]
return output
df = pd.DataFrame(faker_categorical(num=1000))
Your question is unclear, but I guess what you are looking for is a way to refer to the result from np.random.choice() from two different places in your code. Easy -- assign it to a temporary variable, then refer to that variable from both places.
def faker_categorical(num=1, seed=None):
np.random.seed(seed)
fake.seed_instance(seed)
output = []
for x in range(num):
gender = np.random.choice(["M", "F"], p=[0.5, 0.5])
output.append(
{
"gender": gender,
"GivenName": fake.first_name_male() if gender=="M" else fake.first_name_female(),
"Surname": fake.last_name(),
"Zipcode": fake.zipcode(),
"Date of Birth": fake.date_of_birth(),
"country": np.random.choice(["United Kingdom", "France", "Belgium"]),
})
return output
There is a piece of research in classification linking a name to a Gender,for example John is 99.8% male,and Maria is 99.8% female. You can read it here and can also download a .csv file which maps different names to genders. What I did when I needed fake data about people was parse the dataset and if the value was there I assigned the classified gender,if it wasn't (Because of locals or something else) I just assigned a np.random.choice(["MALE", "FEMALE"]). Hope this helped
Related
I have a list of countries and their cities on one website. I take all names of countries and their capitals from this list, and want to put them in JSON file like this:
[
{
"country": "Afghanistan",
"city": "Kabul"
},
{
"country": "Aland Islands",
"city": "Mariehamn"
}
]
there's my code:
cells = soup.table('td')
count = 0
cities_list.write('[\n')
for cell in cells:
if count == len(cells)-2:
break
else:
cities_list.write(json.dumps({"country": "{}".format(cells[count].getText()),
"city": "{}".format(cells[count].next_sibling.getText())},
indent=2))
count += 2
cities.list_write('\n]')
And my problem is that objects are not separated by comma:
[
{
"country": "Afghanistan",
"city": "Kabul"
}{
"country": "Aland Islands",
"city": "Mariehamn"
}
]
How can I make my objects separated by comma, and also is it possible to do without writing '\n]' at the end and beginning?
Python obviously has no way to know that you are writing a list of objects when you are writing them one at a time ... so just don't.
cells = soup.table('td')
cities = []
for cell in cells[:-2]:
cities.append({"country": str(cells[count].getText()),
"city": str(cells[count].next_sibling.getText())})
json.dump(cities, cities_list)
Notice also how "{}".format(value) is just a clumsy way to write str(value) (or just value if value is already a string) and how json.dump lets you pass an open file handle to write to.
I have some json data similar to this...
{
"people": [
{
"name": "billy",
"age": "12"
...
...
},
{
"name": "karl",
"age": "31"
...
...
},
...
...
]
}
At the moment I can do this to get a entry from the people list...
wantedPerson = "karl"
for person in people:
if person['name'] == wantedPerson:
* I have the persons entry *
break
Is there a better way of doing this? Something similar to how we can .get('key') ?
Thanks,
Chris
Assuming you load that json data using the standard library for it, you're fairly close to optimal, perhaps you were looking for something like this:
from json import loads
text = '{"people": [{"name": "billy", "age": "12"}, {"name": "karl", "age": "31"}]}'
data = loads(text)
people = [p for p in data['people'] if p['name'] == 'karl']
If you frequently need to access this data, you might just do something like this:
all_people = {p['name']: p for p in data['people']}
print(all_people['karl'])
That is, all_people becomes a dictionary that uses the name as a key, so you can access any person in it quickly by accessing them by name. This assumes however that there are no duplicate names in your data.
First, there's no problem with your current 'naive' approach - it's clear and efficient since you can't find the value you're looking for without scanning the list.
It seems that you refer to better as shorter, so if you want a one-liner solution, consider the following:
next((person for person in people if person.name == wantedPerson), None)
It gets the first person in the list that has the required name or None if no such person was found.
similarly
ps = {
"people": [
{
"name": "billy",
"age": "12"
},
{
"name": "karl",
"age": "31"
},
]
}
print([x for x in ps['people'] if 'karl' in x.values()])
For possible alternatives or details see e.g. # Get key by value in dictionary
I've got an issue that pertains to how to use jupyter widgets, dropdowns namely, to produce a workflow. Currently my intentions aren't working, and I am aiming to do the following:
Run a function that produces a list
This list is fed into a dropdown, from which I select one (x)
x refers to another function, that has a dictionary, it picks up all values associated with this key, and produces another list
The list is fed into another dropdown, from where I'd pick one value for processing.
Issue that I am coming up with, is that I can get the first list produced and fed into a dropdown. However the subsequent list is not captured, and rather the function is, which of course fails down the road. Let me illustrate with some code:
This bit of code simply goes through a list of dictionaries, and places all the unique league instances into a list:
def league_names():
league_list = []
data_filenames = [data_file for data_file in os.listdir()
if data_file.endswith('.json')]
with open(data_filenames[0]) as json_file:
data = json.load(json_file)
for x in data:
if x['Competition'] is not None and x['Competition'] not in league_list:
league_list.append(x['Competition'])
return league_list[1:]
What the following will then do, is take that list, and search the same set of dictionaries, search for all the teams that are a part of that league, and add them to a list.
def team_names(league_select):
team_list = []
data_filenames = [data_file for data_file in os.listdir()
if data_file.endswith('.json')]
with open(data_filenames[0]) as json_file:
data = json.load(json_file)
for x in data:
if x['Competition'] == league_select and x['Team'] not in team_list:
team_list.append(x['Team'])
return team_list
How I want to interact with this, is that the first league list is passed to a dropdown, from which you pick a league. This passes the league to the second function, to pull all the teams. How this is done is with the following:
def league_interact():
choice = interact(team_names, league_select=league_names())
return type(choice)
league_interact()
Now this works, the list is successfully passed through, however what I simply cannot get to work, is for the interact from here to be transformed into a variable, that I can then pass to a subsequent function for further processing.
Below is an example of the json content:
[{"Team": "Yeovil Town FC", "Gender": "M", "Competition": "National League", "Earliest Season": "2003-2004", "Latest Season": "2020-2021", "Total Seasons": "18", "Championships": "1", "Other Names": "", "Code": "bd5179b9", "Prefix": "Yeovil-Town-Stats"},
{"Team": "Yeovil Town LFC", "Gender": "F", "Competition": "", "Earliest Season": "2017", "Latest Season": "2018-2019", "Total Seasons": "3", "Championships": "", "Other Names": "", "Code": "a506e4a2", "Prefix": "Yeovil-Town-Women-Stats"},
{"Team": "York City FC", "Gender": "M", "Competition": "", "Earliest Season": "2002-2003", "Latest Season": "2019-2020", "Total Seasons": "13", "Championships": "0", "Other Names": "", "Code": "e272e7a8", "Prefix": "York-City-Stats"},
{"Team": "Yorkshire Amateur AFC", "Gender": "M", "Competition": "", "Earliest Season": "2019-2020", "Latest Season": "2020-2021", "Total Seasons": "0", "Championships": "", "Other Names": "", "Code": "66379800", "Prefix": "Yorkshire-Amateur-AFC-Stats"}]
Question: How would I in the above case, use interact to produce the list created by the first choice, rather than a function? I have the type pulled here, where it is a 'function' rather than a list as expected. I tried using .value, and some derivatives, but none of them pushed out a value. Any idea how to approach this, so I can produce a secondary dropdown?
I've tried the following, but getting an error:
def league_interact():
choice = interact(team_names, league_select=league_names())
return choice
def team_interact():
choice2 = interact(team_code, team_select=league_interact())
team_interact()
Error: ValueError: <function team_names at 0x0000021359D20B80> cannot be transformed to a widget
Thanks! I did trawl through the documentation, but how to approach this didn't quite click with me.
Ok so I actually managed to figure this out:
#interact(league = league_box)
def choose_both(league):
team_box.options = team_names(league_box.value)
return
#interact_manual(team = team_box, use_season = season_box)
def choose_team(team, use_season):
return team_choice_cap(team_data(team),use_season)
def team_choice_cap(data_set, use_season):
code = data_set['Code']
prefix = data_set['Prefix']
return parse_seasons(code,prefix,use_season)
The above interact and interact_manual can be used to feed the latter list, that then works to pull up with a manual call the rest of the details.
I have a json structured like this:
{ "status":"OK", "copyright":"Copyright (c) 2017 Pro Publica Inc. All Rights Reserved.","results":[
{
"member_id": "B001288",
"total_votes": "100",
"offset": "0",
"votes": [
{
"member_id": "B001288",
"chamber": "Senate",
"congress": "115",
"session": "1",
"roll_call": "84",
"bill": {
"number": "H.J.Res.57",
"bill_uri": "https://api.propublica.org/congress/v1/115/bills/hjres57.json",
"title": "Providing for congressional disapproval under chapter 8 of title 5, United States Code, of the rule submitted by the Department of Education relating to accountability and State plans under the Elementary and Secondary Education Act of 1965.",
"latest_action": "Message on Senate action sent to the House."
},
"description": "A joint resolution providing for congressional disapproval under chapter 8 of title 5, United States Code, of the rule submitted by the Department of Education relating to accountability and State ...",
"question": "On the Joint Resolution",
"date": "2017-03-09",
"time": "12:02:00",
"position": "No"
},
Sometimes the "bill" parameter is there, sometimes it is blank, like:
{
"member_id": "B001288",
"chamber": "Senate",
"congress": "115",
"session": "1",
"roll_call": "79",
"bill": {
},
"description": "James Richard Perry, of Texas, to be Secretary of Energy",
"question": "On the Nomination",
"date": "2017-03-02",
"time": "13:46:00",
"position": "No"
},
I want to access and store the "bill_uri" in a list, so I can access it later on. I've already performed .json() through the requests package to process it into python. print votes_json["results"][0]["votes"][0]["bill"]["bill_uri"] etc. works just fine, but when I do:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if votes_json["results"][0]["votes"][n]["bill"]["bill_uri"] in votes_json["results"][0]["votes"][n]:
bill_urls_2.append(votes_json["results"][0]["votes"][n])["bill"]["bill_uri"]
print bill_urls_2
I get the error KeyError: 'bill_uri'. I think I have a problem with the structure of the if statement, specifically what key I'm looking for in the dictionary. Could someone provide an explanation/link to explanation about how to use in to find keys? Or pinpoint the error in how I'm using it?
Update: Aha! I got this to work:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if "bill" in votes_json["results"][0]["votes"][n]:
if "bill_uri" in votes_json["results"][0]["votes"][n]["bill"]:
bill_urls_2.append(votes_json["results"][0]["votes"][n]["bill"]["bill_uri"])
print bill_urls_2
Thank you to everyone who gave me advice.
The error here is cause by the fact that you are looking for a key in the dictionary by called that key itself. Here's a small example:
my_dict = {'A': 1, 'B':2, 'C':3}
Now C may or may not exist in the dict every time. This is how I can check if C exists in the dict:
if 'C' in my_dict:
print(True)
What you are doing is:
if my_dict['C'] in my_dict:
print(True)
If C doesn't exist to begin with my_dict['C'] isn't found and gives you an error.
What you need to do is:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if "bill_uri" in votes_json["results"][0]["votes"][n]:
bill_urls_2.append(votes_json["results"][0]["votes"][n]["bill"]["bill_uri"])
print bill_urls_2
I was wondering how I could import a JSON file, and then save that to an ordered CSV file, with header row and the applicable data below.
Here's what the JSON file looks like:
[
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
etc. etc. + } ]
Note there are multiple 'keys' (firstName, lastName, nickname, etc.). I would like to create a CSV file with those as the header, then the applicable info beneath in rows, with each row having a player's information.
Here's the script I have so far for Python:
import urllib2
import json
import csv
writefilerows = csv.writer(open('WCData_Rows.csv',"wb+"))
api_key = "xxxx"
url = "http://worldcup.kimonolabs.com/api/players?apikey=" + api_key + "&limit=1000"
json_obj = urllib2.urlopen(url)
readable_json = json.load(json_obj)
list_of_attributes = readable_json[0].keys()
print list_of_attributes
writefilerows.writerow(list_of_attributes)
for x in readable_json:
writefilerows.writerow(x[list_of_attributes])
But when I run that, I get a "TypeError: unhashable type:'list'" error. I am still learning Python (obviously I suppose). I have looked around online (found this) and can't seem to figure out how to do it without explicitly stating what key I want to print...I don't want to have to list each one individually...
Thank you for any help/ideas! Please let me know if I can clarify or provide more information.
Your TypeError is occuring because you are trying to index a dictionary, x with a list, list_of_attributes with x[list_of_attributes]. This is not how python works. In this case you are iterating readable_json which appears it will return a dictionary with each iteration. There is no need pull values out of this data in order to write them out.
The DictWriter should give you what your looking for.
import csv
[...]
def encode_dict(d, out_encoding="utf8"):
'''Encode dictionary to desired encoding, assumes incoming data in unicode'''
encoded_d = {}
for k, v in d.iteritems():
k = k.encode(out_encoding)
v = unicode(v).encode(out_encoding)
encoded_d[k] = v
return encoded_d
list_of_attributes = readable_json[0].keys()
# sort fields in desired order
list_of_attributes.sort()
with open('WCData_Rows.csv',"wb+") as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=list_of_attributes)
writer.writeheader()
for data in readable_json:
writer.writerow(encode_dict(data))
Note:
This assumes that each entry in readable_json has the same fields.
Maybe pandas could do this - but I newer tried to read JSON
import pandas as pd
df = pd.read_json( ... )
df.to_csv( ... )
pandas.DataFrame.to_csv
pandas.io.json.read_json
EDIT:
data = ''' [
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
}
]'''
import pandas as pd
df = pd.read_json(data)
print df
df.to_csv('results.csv')
result:
age firstName lastName nationality nickname
0 24 Nicolas Alexis Julio N'Koulou N'Doubena Cameroon N. N'Koulou
1 26 Alexandre Dimitri Song-Billong Cameroon A. Song
With pandas you can save it in csv, excel, etc (and maybe even directly in database).
And you can do some operations on data in table and show it as graph.