im trying to create a data set with a unique id code but i get a
'ValueError not enough values to unpack (expected 6, got 5)'
on line 8, basically, I am trying to:
generate a unique 6 digit id code
append dataset value with 'ID' ex: ID123456
UPDATE:
fixed the error and ID append, now how do i make sure the generated id is unique in the dataset?
from faker import Faker
import random
import pandas as pd
Faker.seed(0)
random.seed(0)
fake = Faker("en_US")
fixed_digits = 6
concatid = 'ID'
idcode,name, city, country, job, age = [[] for k in range(0,6)]
for row in range(0,100):
idcode.append(concatid + str(random.randrange(111111, 999999, fixed_digits)))
name.append(fake.name())
city.append(fake.city())
country.append(fake.country())
job.append(fake.job())
age.append(random.randint(20,100))
d = {"ID Code":idcode, "Name":name, "Age":age, "City":city, "Country":country, "Job":job}
df = pd.DataFrame(d)
df.head()
planning to generate 1k rows
To answer the question:
now how do I make sure the generated id is unique in the dataset?
You have to use: unique.random_int
So, your code will be like this as you see below:
from faker import Faker
import random
import pandas as pd
Faker.seed(0)
random.seed(0)
fake = Faker("en_US")
fixed_digits = 6
concatid = 'ID'
idcode,name, city, country, job, age = [[] for k in range(0,6)]
for row in range(0,100):
idcode.append(concatid + str(fake.unique.random_int(min=111111, max=999999)))
name.append(fake.name())
city.append(fake.city())
country.append(fake.country())
job.append(fake.job())
age.append(random.randint(20,100))
d = {"ID Code":idcode, "Name":name, "Age":age, "City":city, "Country":country, "Job":job}
df = pd.DataFrame(d)
df.head()
Related
I would like to run two separate loops on df. In the first step, I would like to filter the df by sex (male, female) and year (yrs 2008:2013) and save these dataframes in a list. In the second step, I would like to do some kind of analysis to each element of the list and name the output based on which sex & year combination it came from.
I realize I can do this in one step, but my actual code and significantly more complex and throws an error, which stops the loop and it never advances to the second stage. consequently, I need to break it up into two steps. This is what I have so far. I would like to ask for help on the second stage. How do I run the make_graph function on each element of the list and name it according to sex&year combination?
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_toy=pd.DataFrame([])
df_toy['value'] = np.random.randint(low=1, high=1000, size=100000)
df_toy['age'] = np.random.choice(range(0, 92), 100000)
df_toy['sex'] = np.random.choice([0, 1], 100000)
df_toy['year'] = np.random.randint(low=2008, high=2013, size=100000)
def format_data(df_toy, SEX, YEAR):
df_toy = df_toy[(df_toy["sex"] == SEX) & (df_toy["year"] == YEAR) ]
return df_toy
def make_graph(df_):
plt.scatter(age, value)
return df_toy
dfs = []
for SEX in range(0,3):
for YEAR in range(2008,2014):
dfs.append(format_data(df_toy, SEX, YEAR))
for i in range(len(dfs)):
df_=dfs[i]
make_graph(df_)
df_YEAR_SEX=df_
IIUC you could filter plot and save the data like this. Since I don't know the actual data I don't know why you need to do it in 2 steps, here is how you could do it with a few changes.
# Input data
df_toy = pd.DataFrame({
'value' : np.random.randint(low=1, high=1000, size=100000),
'age' : np.random.choice(range(0, 92), 100000),
'sex' : np.random.choice([0, 1], 100000),
'year' : np.random.randint(low=2008, high=2013, size=100000)
})
def filter_and_plot(df, SEX, YEAR):
# filter the df for sex and year
tmp = df[(df["sex"] == SEX) & (df["year"] == YEAR)]
# create a new plot for each filtered df and plot it
fig, ax = plt.subplots()
ax.scatter(x=tmp['age'], y=tmp['value'], s=0.4)
# return the filtered df
return tmp
result_dict = {}
for SEX in range(0,2):
for YEAR in range(2008, 2013):
# use a f-string to build a key in a dictionary which includes sex and year
# keys look like this: "df_1_2009", the value to each key is the filtered dataframe
result_dict[f"df_{SEX}_{YEAR}"] = filter_and_plot(df_toy, SEX, YEAR)
I have a dataframe with 4 columns. The 3 first colums are only useful to use as a group by for me. I want to get all the possible combinations of Event numbers for 1 Employee No/Client Number/Date. As an example, in the photo below :
https://i.stack.imgur.com/5r3vQ.png
This is the output i would want to get :
https://i.stack.imgur.com/JiroJ.png
Note that for me the order is not important, meaning that the combination 123,4567 is the same as the combination 4567,123. So if there was let's say 5 cases of 123,4567 and 8 cases of 4567,123 i would want only one line with 123,4567 and 13.
Any idea ? I'm still new to Python and kind of stuck!
Thank you very much :)
Edit :
This code seems to be working :
import pandas as pd
import time
from collections import Counter
from itertools import chain, combinations
import sys
sys.path.append('C:/Config Python')
import config
import pyodbc
import pandas as pd
import numpy as np
pd.options.display.max_colwidth = 150
#Build teradata connection function
def td_connect(usr, pwd, DRIVER = 'XXX', DBCNAME = 'YYY'):
try:
conn_td = pyodbc.connect(DRIVER=DRIVER, DBCNAME=DBCNAME, UID=usr, PWD=pwd, autocommit = True)
return conn_td
except IOError as e:
print('I/O error !')
#Give the query you wish to run
sql = """
The code is here
"""
#Put td login information
conn = td_connect(usr=config.username,pwd=config.password)
#get data
df = pd.read_sql(sql, conn)
df
gp = df.groupby(['Employee no', 'Client number', 'Date'])
d = dict()
for name, group in gp:
l = group['Event Number'].to_list()
try:
d[len(l)].append(l)
except KeyError:
d[len(l)] = [l]
d
meets = []
for i in d.keys():
meets.append(Counter(chain.from_iterable(combinations(line, i) for line in d[i])))
print(meets)
Inspired from Concatenate strings from several rows using Pandas groupby
df['Combinations'] = df.groupby(['Employee no', 'Client number', 'Date'])['Event Number'].transform(lambda x: ",".join(x))
df['Counts'] = df.groupby(['Employee no', 'Client number', 'Date']).counts()['Event number']
result = df[['Employee no', 'Client number', 'Date', 'Combinations', 'Counts']].drop_duplicates()
I pull the data from the census api using the census wrapper, i would like to filter that data out with a list of zips i compiled.
So i am trying to filter the data from a pull request data of the census. I Have a csv file of the zip i want to use and i have it put into a list already. I have tried a few things such as putting the census in a data frame and trying to filter the zipcode column by my list but i dont think my syntax is correct.
this is just the test data i pulled,
census_data = c.acs5.get(('NAME', 'B25034_010E'),
{'for': 'zip code tabulation area:*'})
census_pd = census_pd.rename(columns={"NAME": "Name", "zip code tabulation area": "Zipcode"})
censusfilter = census_pd['Zipcode'==ziplst]
so i tried this way, and also i tried a for loop where i take census_pd['Zipcode'] and a inner for loop to iterate over the list with a if statement like zip1 == zip2 append to a list.
my dependencys
# Dependencies
import pandas as pd
import requests
import json
import pprint
import numpy as np
import matplotlib.pyplot as plt
import requests
from census import Census
import gmaps
from us import states
# Census & gmaps API Keys
from config import (api_key, gkey)
c = Census(api_key, year=2013)
# Configure gmaps
gmaps.configure(api_key=gkey)
as mentioned i want to filter out whatever data i may pull from the census data specific to the zipcodes i use
It's not clear how your data looks like. I am guessing that you have a scalar column and you want to filter that column using a list. If it is the question then you can use isin built in method to filter the dataframe.
import pandas as pd
data = {'col': [2, 3, 4], 'col2': [1, 2, 3], 'col3': ["asd", "ads", "asdf"]}
df = pd.DataFrame.from_dict(data)
random_list = ["asd", "ads"]
df_filtered = df[df["col3"].isin(random_list)]
The sample data isn't very clear, so below is how to filter a dataframe on a column using a list of values to filter by
import pandas as pd
from io import StringIO
# Example data
df = pd.read_csv(StringIO(
'''zip,some_column
"01234",A1
"01234",A2
"01235",A3
"01236",B1
'''), dtype = {"zip": str})
zips_list = ["01234", "01235"]
# using a join
zips_df = pd.DataFrame({"zip": zips_list})
df1 = df.merge(zips_df, how='inner', on='zip')
print(df1)
# using query
df2 = df.query('zip in #zips_list')
print(df2)
# using an index
df.set_index("zip", inplace=True)
df3=df.loc[zips_list]
print(df3)
Output in all cases:
zip some_column
0 01234 A1
1 01234 A2
2 01235 A3
I would like to get a number of entries at once in a specific order of an ID column values. To make things more complicated, as input I have rows with ID1 and ID2, and for each row either ID1 or ID2 is in the table but not both.
The IDs are all unique.
import pandas as pd
import numpy as np
print('Generating table and matchTable...')
N = 10000
# General unique IDs list to draw from
ids = np.random.choice(a=list(range(N*100)), replace=False, size=N*10)
# First N ids go into MAIN_IDS
mainIDs = ids[:N]
data = np.random.randint(low=0, high=25, size=N)
table = pd.DataFrame({'MAIN_IDS': mainIDs, 'DATA':data})
# These ids exist in the table as MAIN_IDS
tableIdsList = np.random.choice(mainIDs, replace=False, size=int(N/10))
notInTableIdsList = ids[N:N+int(N/10)]
idsA = np.zeros(shape=(int(N/10)), dtype=np.int)
idsB = np.zeros(shape=(int(N/10)), dtype=np.int)
for i in range(len(idsA)):
if np.random.random()>0.4:
idsA[i] = tableIdsList[i]
idsB[i] = notInTableIdsList[i]
else:
idsA[i] = notInTableIdsList[i]
idsB[i] = tableIdsList[i]
matchTable = pd.DataFrame({'ID1': idsA, 'ID2':idsB})
print(' Done!')
print('Generating the correct result...')
correctResult = []
for i in range(len(tableIdsList)):
correctResult.append(data[np.where(mainIDs==tableIdsList[i])[0][0]])
correctResult = np.array(correctResult)
print(' Done!')
I want to get DATA, where MAIN_ID==ID1 or ID2, but in the order of the matchTable.
First filter your match table by the Id from table , then we using reindex
idx=matchTable.where(matchTable.isin(table.MAIN_IDS.tolist())).stack()
table=table.set_index('MAIN_IDS').reindex(idx).reset_index()
I have a dataframe and am trying to get the closest matches using mahalanobis distance across three categories, like:
from io import StringIO
from sklearn import metrics
import pandas as pd
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
print(df)
print(d)
Where that pid column is a unique identifier.
What I need to do is take that ndarray returned by the pairwise_distances call and update the original dataframe so each row has some kind of list of its closest N matches (so pid 0 might have an ordered list by distance of like 2, 1, 5, 3, 4 (or whatever it actually is), but I'm totally stumped how this is done in python.
from io import StringIO
from sklearn import metrics
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df