How to use Faker's unique method with generator objects

How to use Faker's unique method with generator objects - python

I want to use the Faker multi-locale mode (https://maleficefakertest.readthedocs.io/en/docs-housekeeping/fakerclass.html#multiple-locale-mode) and pass a list of locales to my Faker object, then call the respective locale generator and generate unique values when needed in my code.
The "unique" attribute works fine for a Faker object, but does not when it is operating on a Faker Generator. I can see what is happening, but was expecting/hoping that I could just use the same "unique" method with the multi-locales mode. For example:
from faker import Faker
fake1 = Faker("en_US")
fake2 = Faker(["en_CA", "en_US"])
print(type(fake1))
print(fake1.state())
print(fake1.unique.state())
print(type(fake2["en_US"]))
print(fake2["en_US"].state())
print(fake2["en_US"].unique.state())
Gives:
<class 'faker.proxy.Faker'>
Arizona
Illinois
<class 'faker.generator.Generator'>
Oregon
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(fake2["en_US"].unique.state())
AttributeError: 'Generator' object has no attribute 'unique'
Does anyone know a way to use "unique" with the multi-locale mode?

The good news is, Canada has provinces, so you can call fake2.unique.province() 13 times (10 provinces, 3 territories) before getting a UniquenessException.
The bad news is, in playing around with your problem for another locale that also has states (Australia), I couldn't make much sense of the behaviour. It seems to depart from what you want.
Just from observation, it becomes likely that you start getting UniquenessException once one of the locales is exhausted (likely to be Australia in this case, having 8 states).
fake3 = Faker(['en_US', 'en_AU'])
for i in range(70):
try:
print(fake3.unique.state())
except:
print("UniquenessException")
Iowa
Georgia
Tasmania
Idaho
Alabama
Australian Capital Territory
Maine
Montana
Western Australia
Victoria
Queensland
Kentucky
Pennsylvania
New South Wales
Rhode Island
Arizona
South Australia
Louisiana
Northern Territory
Missouri
UniquenessException
Nevada
UniquenessException
Alaska
Connecticut
UniquenessException
Delaware
Kansas
UniquenessException
Indiana
Texas
UniquenessException
UniquenessException
UniquenessException
New York
UniquenessException
UniquenessException
New Mexico
UniquenessException
UniquenessException
UniquenessException
Illinois
Mississippi
UniquenessException
West Virginia
Ohio
Arkansas
Wyoming
UniquenessException
UniquenessException
New Hampshire
UniquenessException
UniquenessException
Hawaii
UniquenessException
Vermont
UniquenessException
North Dakota
UniquenessException
North Carolina
South Carolina
Washington
UniquenessException
UniquenessException
Minnesota
UniquenessException
Utah
UniquenessException
UniquenessException
Tennessee

Related

select random pairs from remaining unique values in a list

Updated: Not sure I explained it well first time.
I have a scheduling problem, or more accurately, a "first come first served" problem. A list of available assets are assigned a set of spaces, available in pairs (think cars:parking spots, diners:tables, teams:games). I need a rough simulation (random) that chooses the first two to arrive from available pairs, then chooses the next two from remaining available pairs, and so on, until all spaces are filled.
Started using teams:games to cut my teeth. The first pair is easy enough. How do I then whittle it down to fill the next two spots from among the remaining available entities? Tried a bunch of different things, but coming up short. Help appreciated.
import itertools
import numpy as np
import pandas as pd
a = ['Georgia','Oregon','Florida','Texas'], ['Georgia','Oregon','Florida','Texas']
b = [(x,y) for x,y in itertools.product(*a) if x != y]
c = pd.DataFrame(b)
c.columns = ['home', 'away']
print(c)
d = c.sample(n = 2, replace = False)
print(d)
The first results is all possible combinations. But, once the first slots are filled, there can be no repeats. in example below, once Oregon and Georgia are slated in, the only remaining options to choose from are Forlida:Texas or Texas:Florida. Obviously just the sample function alone produces duplicates frequently. I will need this to scale up to dozens, then hundreds of entities:slots. Many thanks in advance!
home away
0 Georgia Oregon
1 Georgia Florida
2 Georgia Texas
3 Oregon Georgia
4 Oregon Florida
5 Oregon Texas
6 Florida Georgia
7 Florida Oregon
8 Florida Texas
9 Texas Georgia
10 Texas Oregon
11 Texas Florida
home away
3 Oregon Georgia
5 Oregon Texas

Not exactly sure what you are trying to do. But if you want to randomly pair your unique entities you can simply randomly order them and then place them in a 2-columns dataframe. I wrote this with all the US states minus one (Wyomi):
states = ['Alaska','Alabama','Arkansas','Arizona','California',
'Colorado','Connecticut','District of Columbia','Delaware',
'Florida','Georgia','Hawaii','Iowa','Idaho','Illinois',
'Indiana','Kansas','Kentucky','Louisiana','Massachusetts',
'Maryland','Maine','Michigan','Minnesota','Missouri',
'Mississippi','Montana','North Carolina','North Dakota',
'Nebraska','New Hampshire','New Jersey','New Mexico',
'Nevada','New York','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina',
'South Dakota','Tennessee','Texas','Utah','Virginia',
'Vermont','Washington','Wisconsin','West Virginia']
a=states.copy()
random.shuffle(states)
c = pd.DataFrame({'home':a[::2],'away':a[1::2]})
print(c)
#Output
home away
0 West Virginia Minnesota
1 New Hampshire Louisiana
2 Nevada Florida
3 Alabama Indiana
4 Delaware North Dakota
5 Georgia Rhode Island
6 Oregon Pennsylvania
7 New York South Dakota
8 Maryland Kansas
9 Ohio Hawaii
10 Colorado Wisconsin
11 Iowa Idaho
12 Illinois Missouri
13 Arizona Mississippi
14 Connecticut Montana
15 District of Columbia Vermont
16 Tennessee Kentucky
17 Alaska Washington
18 California Michigan
19 Arkansas New Jersey
20 Massachusetts Utah
21 Oklahoma New Mexico
22 Virginia South Carolina
23 North Carolina Maine
24 Texas Nebraska
Not sure if this is exactly what you were asking for though.
If you need to schedule all the fixtures of the season, you can check this answer --> League fixture generator in python

Cannot turn string into an integer python3

I'm attempting to convert the following into integers. I have literally tried everything and keep getting errors.
For instance:
pop2007 = pop2007.astype('int32')
ValueError: invalid literal for int() with base 10: '4,779,736'
Below is the DF I'm trying to convert. I've even attempted the .values method with no success.
pop2007
Alabama 4,779,736
Alaska 710,231
Arizona 6,392,017
Arkansas 2,915,918
California 37,253,956
Colorado 5,029,196
Connecticut 3,574,097
Delaware 897,934
Florida 18,801,310
Georgia 9,687,653
Idaho 1,567,582
Illinois 12,830,632
Indiana 6,483,802
Iowa 3,046,355
Kansas 2,853,118
Kentucky 4,339,367
Louisiana 4,533,372
Maine 1,328,361
Maryland 5,773,552
Massachusetts 6,547,629
Michigan 9,883,640
Minnesota 5,303,925
Mississippi 2,967,297
Missouri 5,988,927
Montana 989,415
Nebraska 1,826,341
Nevada 2,700,551
New Hampshire 1,316,470
New Jersey 8,791,894
New Mexico 2059179
New York 19378102
North Carolina 9535483
North Dakota 672591
Ohio 11536504
Oklahoma 3751351
Oregon 3831074
Pennsylvania 12702379
Rhode Island 1052567
South Carolina 4625364
South Dakota 814180
Tennessee 6346105
Texas 25,145,561
Utah 2,763,885
Vermont 625,741
Virginia 8,001,024
Washington 6,724,540
West Virginia 1,852,994
Wisconsin 5,686,986
Wyoming 563,626
Name: 3, dtype: object

You can't turn a string with commas into an integer. Try this.
my_int = '1,000,000'
my_int = int(my_int.replace(',', ''))
print(my_int)

Have you tried pop2007.replace(',','') to remove the commas from your string values before converting to integers?

Converting from HTML to CSV using Python

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:

The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland

And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Robot FrameWork Collections - List comparison Issue

I am trying to compare two identical lists in Robot Framework . The code I am using is :
List Test
Lists Should Be Equal #{List_Of_States_USA} #{List_Of_States_USA-Temp}
and the lists are identical with the following values :
#{List_Of_States_USA} Alabama Alaska American Samoa Arizona Arkansas California Colorado
... Connecticut Delaware District of Columbia Florida Georgia Guam Hawaii
... Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana
... Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri
... Montana National Nebraska Nevada New Hampshire New Jersey New Mexico
... New York North Carolina North Dakota Northern Mariana Islands Ohio Oklahoma Oregon
... Pennsylvania Puerto Rico Rhode Island South Carolina South Dakota Tennessee Texas
... Utah Vermont Virgin Islands Virginia Washington West Virginia Wisconsin
... Wyoming
This test fails with the following error:
FAIL Keyword 'Collections.Lists Should Be Equal' expected 2 to 5 arguments, got 114.
I have searched SO and other sites for a solution, but could not figure out why this happened. Thanks in advance for support

You need to use a $ not #. When you use #, robot expands the lists into multiple arguments.
From the robot framework user's guide:
When a variable is used as a scalar like ${EXAMPLE}, its value will be used as-is. If a variable value is a list or list-like, it is also possible to use as a list variable like #{EXAMPLE}. In this case individual list items are passed in as arguments separately.
Consider the case of #{foo} being a list with the values "one", "two" and "three". In such as case the following two are identical:
some keyword #{foo}
some keyword one two three
You need to change your statement to this:
Lists Should Be Equal ${List_Of_States_USA} ${List_Of_States_USA-Temp}

So, As suggested by Bryan-Oakley above, I modified the test as follows:
${L1} Create List #{List_Of_States_USA}
${L2} Create List #{List_Of_States_USA-Temp}
Lists Should Be Equal ${L1} ${L2}
Now the test passed. Thanks Again # Brian

New column in pandas dataframe based on existing column values

I have a dataframe with a column named 'States' that lists various U.S. states. I need to create another column with a region specifier like 'Atlantic Coast' I have lists of the states that belong to various regions so if the state in df['States'] matches a state in the list 'Atlantic_states' the specifier 'Atlantic Coast' is inserted into the new column df['region specifier'] my code below shows the list I want to compare my dataframe values with and the output of the df['States'] column.
#list of states
Atlantic_states = ['Virginia',
'Massachusetts',
'Maine',
'New York',
'Rhode Island',
'Connecticut',
'New Hampshire',
'Maryland',
'Delaware',
'New Jersey',
'North Carolina',
'South Carolina',
'Georgia',
'Florida']
print(df['States'])
Out:
States
0 Virginia
1 Massachusetts
2 Maine
3 New York
4 Rhode Island
5 Connecticut
6 New Hampshire
7 Maryland
8 Delaware
9 New Jersey
10 North Carolina
11 South Carolina
12 Georgia
13 Florida
14 Wisconsin
15 Michigan
16 Ohio
17 Pennsylvania
18 Illinois
19 Indiana
20 Minnesota
21 New York
22 Washington
23 Oregon
24 California

Whilst Andy's answer works it is not the most efficient way of doing this. There is a handy method that can be called on almost all pandas Series-like objects: .isin(). Entries to this can be lists, dicts and pandas Series.
df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
'Connecticut','New Hampshire','Maryland', 'Delaware',
'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
'Indiana','Minnesota','New York','Washington','Oregon',
'California'],
columns=['States'])
Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
'Connecticut', 'New Hampshire', 'Maryland', 'Delaware',
'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
'Florida']
df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
'Unknown')
df.head()
Out[1]:
States Coast
0 Virginia Atlantic Coast
1 Massachusetts Atlantic Coast
2 Maine Atlantic Coast
3 New York Atlantic Coast
4 Rhode Island Atlantic Coast
Benchmarks
Here are some timings using for mapping the first 10 letters of the alphabet to some random int numbers:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))
for apply
%%timeit
def is_atlantic(state):
return True if state in letters else False
df.numbers.apply(is_atlantic)
Out[]: 1 loops, best of 3: 432 ms per loop
Now for map as suggested by JohnE
%%timeit
df.numbers.map(letters)
Out[]: 10 loops, best of 3: 56.9 ms per loop
and finally for isin (also suggested by Nickil Maveli)
%%timeit
df.numbers.isin(letters)
Out[]: 10 loops, best of 3: 20.9 ms per loop
So we see that .isin() is much quicker than .apply() and twice as quick as .map().
Note: apply and isin just return the boolean masks and map fills with the desired strings. Even so, when assigning to another column isin wins out by about 2/3 of the time of map.

You have a couple options. First, to directly answer the question as posed:
Option 1
Create a function that returns whether a state is in the Atlantic region or not
def is_atlantic(state):
return "Atlantic" if state in Atlantic_states else "Unknown"
Now, you use .apply() and get the results (and return it to your new column)
df['Region'] = df['State'].apply(is_atlantic)
This returns a data frame that looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Unknown
15 Michigan Unknown
16 Ohio Unknown
17 Pennsylvania Unknown
18 Illinois Unknown
19 Indiana Unknown
20 Minnesota Unknown
21 New York Atlantic
22 Washington Unknown
23 Oregon Unknown
24 California Unknown
Option 2
The first option gets cumbersome if you have multiple lists you want to check against. Instead of having multiple lists, I recommend creating a single dictionary with the State as the key and the region as the value. With only 50 values this should be easy enough to maintain.
regions = {
'Virginia': 'Atlantic',
'Massachusetts': 'Atlantic',
'Maine': 'Atlantic',
'New York': 'Atlantic',
'Rhode Island': 'Atlantic',
'Connecticut': 'Atlantic',
'New Hampshire': 'Atlantic',
'Maryland': 'Atlantic',
'Delaware': 'Atlantic',
'New Jersey': 'Atlantic',
'North Carolina': 'Atlantic',
'South Carolina': 'Atlantic',
'Georgia': 'Atlantic',
'Florida': 'Atlantic',
'Wisconsin': 'Midwest',
'Michigan': 'Midwest',
'Ohio': 'Midwest',
'Pennsylvania': 'Midwest',
'Illinois': 'Midwest',
'Indiana': 'Midwest',
'Minnesota': 'Midwest',
'New York': 'Atlantic',
'Washington': 'West',
'Oregon': 'West',
'California': 'West'
}
You can use .apply() again, with a slightly modified function:
def get_region(state):
return regions[state]
df['Region'] = df['State'].apply(get_region)
This time your dataframe looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Midwest
15 Michigan Midwest
16 Ohio Midwest
17 Pennsylvania Midwest
18 Illinois Midwest
19 Indiana Midwest
20 Minnesota Midwest
21 New York Atlantic
22 Washington West
23 Oregon West
24 California West

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Faker's unique method with generator objects - python

Related

select random pairs from remaining unique values in a list

Cannot turn string into an integer python3

Converting from HTML to CSV using Python

Robot FrameWork Collections - List comparison Issue

New column in pandas dataframe based on existing column values

Categories

Resources